-
BLAB: Brutally Long Audio Bench
Authors:
Orevaoghene Ahia,
Martijn Bartelds,
Kabir Ahuja,
Hila Gonen,
Valentin Hofmann,
Siddhant Arora,
Shuyue Stella Li,
Vishal Puttagunta,
Mofetoluwa Adeyemi,
Charishma Buchireddy,
Ben Walls,
Noah Bennett,
Shinji Watanabe,
Noah A. Smith,
Yulia Tsvetkov,
Sachin Kumar
Abstract:
Developing large audio language models (LMs) capable of understanding diverse spoken interactions is essential for accommodating the multimodal nature of human communication and can increase the accessibility of language technologies across different user populations. Recent work on audio LMs has primarily evaluated their performance on short audio segments, typically under 30 seconds, with limite…
▽ More
Developing large audio language models (LMs) capable of understanding diverse spoken interactions is essential for accommodating the multimodal nature of human communication and can increase the accessibility of language technologies across different user populations. Recent work on audio LMs has primarily evaluated their performance on short audio segments, typically under 30 seconds, with limited exploration of long-form conversational speech segments that more closely reflect natural user interactions with these models. We introduce Brutally Long Audio Bench (BLAB), a challenging long-form audio benchmark that evaluates audio LMs on localization, duration estimation, emotion, and counting tasks using audio segments averaging 51 minutes in length. BLAB consists of 833+ hours of diverse, full-length audio clips, each paired with human-annotated, text-based natural language questions and answers. Our audio data were collected from permissively licensed sources and underwent a human-assisted filtering process to ensure task compliance. We evaluate six open-source and proprietary audio LMs on BLAB and find that all of them, including advanced models such as Gemini 2.0 Pro and GPT-4o, struggle with the tasks in BLAB. Our comprehensive analysis reveals key insights into the trade-offs between task difficulty and audio duration. In general, we find that audio LMs struggle with long-form speech, with performance declining as duration increases. They perform poorly on localization, temporal reasoning, counting, and struggle to understand non-phonemic information, relying more on prompts than audio content. BLAB serves as a challenging evaluation framework to develop audio LMs with robust long-form audio understanding capabilities.
△ Less
Submitted 12 May, 2025; v1 submitted 5 May, 2025;
originally announced May 2025.
-
A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage
Authors:
Rui Xin,
Niloofar Mireshghallah,
Shuyue Stella Li,
Michael Duan,
Hyunwoo Kim,
Yejin Choi,
Yulia Tsvetkov,
Sewoong Oh,
Pang Wei Koh
Abstract:
Sanitizing sensitive text data typically involves removing personally identifiable information (PII) or generating synthetic data under the assumption that these methods adequately protect privacy; however, their effectiveness is often only assessed by measuring the leakage of explicit identifiers but ignoring nuanced textual markers that can lead to re-identification. We challenge the above illus…
▽ More
Sanitizing sensitive text data typically involves removing personally identifiable information (PII) or generating synthetic data under the assumption that these methods adequately protect privacy; however, their effectiveness is often only assessed by measuring the leakage of explicit identifiers but ignoring nuanced textual markers that can lead to re-identification. We challenge the above illusion of privacy by proposing a new framework that evaluates re-identification attacks to quantify individual privacy risks upon data release. Our approach shows that seemingly innocuous auxiliary information -- such as routine social activities -- can be used to infer sensitive attributes like age or substance use history from sanitized data. For instance, we demonstrate that Azure's commercial PII removal tool fails to protect 74\% of information in the MedQA dataset. Although differential privacy mitigates these risks to some extent, it significantly reduces the utility of the sanitized text for downstream tasks. Our findings indicate that current sanitization techniques offer a \textit{false sense of privacy}, highlighting the need for more robust methods that protect against semantic-level information leakage.
△ Less
Submitted 2 May, 2025; v1 submitted 27 April, 2025;
originally announced April 2025.
-
Medical Hallucinations in Foundation Models and Their Impact on Healthcare
Authors:
Yubin Kim,
Hyewon Jeong,
Shan Chen,
Shuyue Stella Li,
Mingyu Lu,
Kumail Alhamoud,
Jimin Mun,
Cristina Grau,
Minseok Jung,
Rodrigo Gameiro,
Lizhou Fan,
Eugene Park,
Tristan Lin,
Joonsik Yoon,
Wonjin Yoon,
Maarten Sap,
Yulia Tsvetkov,
Paul Liang,
Xuhai Xu,
Xin Liu,
Daniel McDuff,
Hyeonhoon Lee,
Hae Won Park,
Samir Tulebaev,
Cynthia Breazeal
Abstract:
Foundation Models that are capable of processing and generating multi-modal data have transformed AI's role in medicine. However, a key limitation of their reliability is hallucination, where inaccurate or fabricated information can impact clinical decisions and patient safety. We define medical hallucination as any instance in which a model generates misleading medical content. This paper examine…
▽ More
Foundation Models that are capable of processing and generating multi-modal data have transformed AI's role in medicine. However, a key limitation of their reliability is hallucination, where inaccurate or fabricated information can impact clinical decisions and patient safety. We define medical hallucination as any instance in which a model generates misleading medical content. This paper examines the unique characteristics, causes, and implications of medical hallucinations, with a particular focus on how these errors manifest themselves in real-world clinical scenarios. Our contributions include (1) a taxonomy for understanding and addressing medical hallucinations, (2) benchmarking models using medical hallucination dataset and physician-annotated LLM responses to real medical cases, providing direct insight into the clinical impact of hallucinations, and (3) a multi-national clinician survey on their experiences with medical hallucinations. Our results reveal that inference techniques such as Chain-of-Thought (CoT) and Search Augmented Generation can effectively reduce hallucination rates. However, despite these improvements, non-trivial levels of hallucination persist. These findings underscore the ethical and practical imperative for robust detection and mitigation strategies, establishing a foundation for regulatory policies that prioritize patient safety and maintain clinical integrity as AI becomes more integrated into healthcare. The feedback from clinicians highlights the urgent need for not only technical advances but also for clearer ethical and regulatory guidelines to ensure patient safety. A repository organizing the paper resources, summaries, and additional information is available at https://github.com/mitmedialab/medical hallucination.
△ Less
Submitted 25 February, 2025;
originally announced March 2025.
-
Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning
Authors:
Shuyue Stella Li,
Jimin Mun,
Faeze Brahman,
Jonathan S. Ilgen,
Yulia Tsvetkov,
Maarten Sap
Abstract:
Large language models (LLMs) often fail to ask effective questions under uncertainty, making them unreliable in domains where proactive information-gathering is essential for decisionmaking. We present ALFA, a framework that improves LLM question-asking by (i) decomposing the notion of a "good" question into a set of theory-grounded attributes (e.g., clarity, relevance), (ii) controllably synthesi…
▽ More
Large language models (LLMs) often fail to ask effective questions under uncertainty, making them unreliable in domains where proactive information-gathering is essential for decisionmaking. We present ALFA, a framework that improves LLM question-asking by (i) decomposing the notion of a "good" question into a set of theory-grounded attributes (e.g., clarity, relevance), (ii) controllably synthesizing attribute-specific question variations, and (iii) aligning models via preference-based optimization to explicitly learn to ask better questions along these fine-grained attributes. Focusing on clinical reasoning as a case study, we introduce the MediQ-AskDocs dataset, composed of 17k real-world clinical interactions augmented with 80k attribute-specific preference pairs of follow-up questions, as well as a novel expert-annotated interactive healthcare QA task to evaluate question-asking abilities. Models aligned with ALFA reduce diagnostic errors by 56.6% on MediQ-AskDocs compared to SOTA instruction-tuned LLMs, with a question-level win-rate of 64.4% and strong generalizability. Our findings suggest that explicitly guiding question-asking with structured, fine-grained attributes offers a scalable path to improve LLMs, especially in expert application domains.
△ Less
Submitted 20 February, 2025;
originally announced February 2025.
-
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Authors:
DeepSeek-AI,
Daya Guo,
Dejian Yang,
Haowei Zhang,
Junxiao Song,
Ruoyu Zhang,
Runxin Xu,
Qihao Zhu,
Shirong Ma,
Peiyi Wang,
Xiao Bi,
Xiaokang Zhang,
Xingkai Yu,
Yu Wu,
Z. F. Wu,
Zhibin Gou,
Zhihong Shao,
Zhuoshu Li,
Ziyi Gao,
Aixin Liu,
Bing Xue,
Bingxuan Wang,
Bochao Wu,
Bei Feng,
Chengda Lu
, et al. (175 additional authors not shown)
Abstract:
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters…
▽ More
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.
△ Less
Submitted 22 January, 2025;
originally announced January 2025.
-
DeepSeek-V3 Technical Report
Authors:
DeepSeek-AI,
Aixin Liu,
Bei Feng,
Bing Xue,
Bingxuan Wang,
Bochao Wu,
Chengda Lu,
Chenggang Zhao,
Chengqi Deng,
Chenyu Zhang,
Chong Ruan,
Damai Dai,
Daya Guo,
Dejian Yang,
Deli Chen,
Dongjie Ji,
Erhang Li,
Fangyun Lin,
Fucong Dai,
Fuli Luo,
Guangbo Hao,
Guanting Chen,
Guowei Li,
H. Zhang,
Han Bao
, et al. (175 additional authors not shown)
Abstract:
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for loa…
▽ More
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.
△ Less
Submitted 18 February, 2025; v1 submitted 26 December, 2024;
originally announced December 2024.
-
CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs
Authors:
Yu Ying Chiu,
Liwei Jiang,
Bill Yuchen Lin,
Chan Young Park,
Shuyue Stella Li,
Sahithya Ravi,
Mehar Bhatia,
Maria Antoniak,
Yulia Tsvetkov,
Vered Shwartz,
Yejin Choi
Abstract:
To make large language models (LLMs) more helpful across diverse cultures, it is essential to have effective cultural knowledge benchmarks to measure and track our progress. Effective benchmarks need to be robust, diverse, and challenging. We introduce CulturalBench: a set of 1,227 human-written and human-verified questions for effectively assessing LLMs' cultural knowledge, covering 45 global reg…
▽ More
To make large language models (LLMs) more helpful across diverse cultures, it is essential to have effective cultural knowledge benchmarks to measure and track our progress. Effective benchmarks need to be robust, diverse, and challenging. We introduce CulturalBench: a set of 1,227 human-written and human-verified questions for effectively assessing LLMs' cultural knowledge, covering 45 global regions including the underrepresented ones like Bangladesh, Zimbabwe, and Peru. Questions - each verified by five independent annotators - span 17 diverse topics ranging from food preferences to greeting etiquettes. We evaluate models on two setups: CulturalBench-Easy and CulturalBench-Hard which share the same questions but asked differently. We find that LLMs are sensitive to such difference in setups (e.g., GPT-4o with 27.3% difference). Compared to human performance (92.6% accuracy), CulturalBench-Hard is more challenging for frontier LLMs with the best performing model (GPT-4o) at only 61.5% and the worst (Llama3-8b) at 21.4%. Moreover, we find that LLMs often struggle with tricky questions that have multiple correct answers (e.g., What utensils do the Chinese usually use?), revealing a tendency to converge to a single answer. Our results also indicate that OpenAI GPT-4o substantially outperform other proprietary and open source models in questions related to all but one region (Oceania). Nonetheless, all models consistently underperform on questions related to South America and the Middle East.
△ Less
Submitted 3 October, 2024;
originally announced October 2024.
-
ValueScope: Unveiling Implicit Norms and Values via Return Potential Model of Social Interactions
Authors:
Chan Young Park,
Shuyue Stella Li,
Hayoung Jung,
Svitlana Volkova,
Tanushree Mitra,
David Jurgens,
Yulia Tsvetkov
Abstract:
This study introduces ValueScope, a framework leveraging language models to quantify social norms and values within online communities, grounded in social science perspectives on normative structures. We employ ValueScope to dissect and analyze linguistic and stylistic expressions across 13 Reddit communities categorized under gender, politics, science, and finance. Our analysis provides a quantit…
▽ More
This study introduces ValueScope, a framework leveraging language models to quantify social norms and values within online communities, grounded in social science perspectives on normative structures. We employ ValueScope to dissect and analyze linguistic and stylistic expressions across 13 Reddit communities categorized under gender, politics, science, and finance. Our analysis provides a quantitative foundation showing that even closely related communities exhibit remarkably diverse norms. This diversity supports existing theories and adds a new dimension--community preference--to understanding community interactions. ValueScope not only delineates differing social norms among communities but also effectively traces their evolution and the influence of significant external events like the U.S. presidential elections and the emergence of new sub-communities. The framework thus highlights the pivotal role of social norms in shaping online interactions, presenting a substantial advance in both the theory and application of social norm studies in digital spaces.
△ Less
Submitted 7 October, 2024; v1 submitted 2 July, 2024;
originally announced July 2024.
-
Teaching LLMs to Abstain across Languages via Multilingual Feedback
Authors:
Shangbin Feng,
Weijia Shi,
Yike Wang,
Wenxuan Ding,
Orevaoghene Ahia,
Shuyue Stella Li,
Vidhisha Balachandran,
Sunayana Sitaram,
Yulia Tsvetkov
Abstract:
Multilingual LLMs often have knowledge disparities across languages, with larger gaps in under-resourced languages. Teaching LLMs to abstain in the face of knowledge gaps is thus a promising strategy to mitigate hallucinations in multilingual settings. However, previous studies on LLM abstention primarily focus on English; we find that directly applying existing solutions beyond English results in…
▽ More
Multilingual LLMs often have knowledge disparities across languages, with larger gaps in under-resourced languages. Teaching LLMs to abstain in the face of knowledge gaps is thus a promising strategy to mitigate hallucinations in multilingual settings. However, previous studies on LLM abstention primarily focus on English; we find that directly applying existing solutions beyond English results in up to 20.5% performance gaps between high and low-resource languages, potentially due to LLMs' drop in calibration and reasoning beyond a few resource-rich languages. To this end, we propose strategies to enhance LLM abstention by learning from multilingual feedback, where LLMs self-reflect on proposed answers in one language by generating multiple feedback items in related languages: we show that this helps identifying the knowledge gaps across diverse languages, cultures, and communities. Extensive experiments demonstrate that our multilingual feedback approach outperforms various strong baselines, achieving up to 9.2% improvement for low-resource languages across three black-box and open models on three datasets, featuring open-book, closed-book, and commonsense QA. Further analysis reveals that multilingual feedback is both an effective and a more equitable abstain strategy to serve diverse language speakers, and cultural factors have great impact on language selection and LLM abstention behavior, highlighting future directions for multilingual and multi-cultural reliable language modeling.
△ Less
Submitted 14 February, 2025; v1 submitted 22 June, 2024;
originally announced June 2024.
-
MediQ: Question-Asking LLMs and a Benchmark for Reliable Interactive Clinical Reasoning
Authors:
Shuyue Stella Li,
Vidhisha Balachandran,
Shangbin Feng,
Jonathan S. Ilgen,
Emma Pierson,
Pang Wei Koh,
Yulia Tsvetkov
Abstract:
Users typically engage with LLMs interactively, yet most existing benchmarks evaluate them in a static, single-turn format, posing reliability concerns in interactive scenarios. We identify a key obstacle towards reliability: LLMs are trained to answer any question, even with incomplete context or insufficient knowledge. In this paper, we propose to change the static paradigm to an interactive one…
▽ More
Users typically engage with LLMs interactively, yet most existing benchmarks evaluate them in a static, single-turn format, posing reliability concerns in interactive scenarios. We identify a key obstacle towards reliability: LLMs are trained to answer any question, even with incomplete context or insufficient knowledge. In this paper, we propose to change the static paradigm to an interactive one, develop systems that proactively ask questions to gather more information and respond reliably, and introduce an benchmark - MediQ - to evaluate question-asking ability in LLMs. MediQ simulates clinical interactions consisting of a Patient System and an adaptive Expert System; with potentially incomplete initial information, the Expert refrains from making diagnostic decisions when unconfident, and instead elicits missing details via follow-up questions. We provide a pipeline to convert single-turn medical benchmarks into an interactive format. Our results show that directly prompting state-of-the-art LLMs to ask questions degrades performance, indicating that adapting LLMs to proactive information-seeking settings is nontrivial. We experiment with abstention strategies to better estimate model confidence and decide when to ask questions, improving diagnostic accuracy by 22.3%; however, performance still lags compared to an (unrealistic in practice) upper bound with complete information upfront. Further analyses show improved interactive performance with filtering irrelevant contexts and reformatting conversations. Overall, we introduce a novel problem towards LLM reliability, an interactive MediQ benchmark and a novel question-asking system, and highlight directions to extend LLMs' information-seeking abilities in critical domains.
△ Less
Submitted 7 November, 2024; v1 submitted 2 June, 2024;
originally announced June 2024.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Authors:
DeepSeek-AI,
Aixin Liu,
Bei Feng,
Bin Wang,
Bingxuan Wang,
Bo Liu,
Chenggang Zhao,
Chengqi Dengr,
Chong Ruan,
Damai Dai,
Daya Guo,
Dejian Yang,
Deli Chen,
Dongjie Ji,
Erhang Li,
Fangyun Lin,
Fuli Luo,
Guangbo Hao,
Guanting Chen,
Guowei Li,
H. Zhang,
Hanwei Xu,
Hao Yang,
Haowei Zhang,
Honghui Ding
, et al. (132 additional authors not shown)
Abstract:
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference…
▽ More
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.
△ Less
Submitted 19 June, 2024; v1 submitted 7 May, 2024;
originally announced May 2024.
-
CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge
Authors:
Yu Ying Chiu,
Liwei Jiang,
Maria Antoniak,
Chan Young Park,
Shuyue Stella Li,
Mehar Bhatia,
Sahithya Ravi,
Yulia Tsvetkov,
Vered Shwartz,
Yejin Choi
Abstract:
Frontier large language models (LLMs) are developed by researchers and practitioners with skewed cultural backgrounds and on datasets with skewed sources. However, LLMs' (lack of) multicultural knowledge cannot be effectively assessed with current methods for developing benchmarks. Existing multicultural evaluations primarily rely on expensive and restricted human annotations or potentially outdat…
▽ More
Frontier large language models (LLMs) are developed by researchers and practitioners with skewed cultural backgrounds and on datasets with skewed sources. However, LLMs' (lack of) multicultural knowledge cannot be effectively assessed with current methods for developing benchmarks. Existing multicultural evaluations primarily rely on expensive and restricted human annotations or potentially outdated internet resources. Thus, they struggle to capture the intricacy, dynamics, and diversity of cultural norms. LLM-generated benchmarks are promising, yet risk propagating the same biases they are meant to measure. To synergize the creativity and expert cultural knowledge of human annotators and the scalability and standardizability of LLM-based automation, we introduce CulturalTeaming, an interactive red-teaming system that leverages human-AI collaboration to build truly challenging evaluation dataset for assessing the multicultural knowledge of LLMs, while improving annotators' capabilities and experiences. Our study reveals that CulturalTeaming's various modes of AI assistance support annotators in creating cultural questions, that modern LLMs fail at, in a gamified manner. Importantly, the increased level of AI assistance (e.g., LLM-generated revision hints) empowers users to create more difficult questions with enhanced perceived creativity of themselves, shedding light on the promises of involving heavier AI assistance in modern evaluation dataset creation procedures. Through a series of 1-hour workshop sessions, we gather CULTURALBENCH-V0.1, a compact yet high-quality evaluation dataset with users' red-teaming attempts, that different families of modern LLMs perform with accuracy ranging from 37.7% to 72.2%, revealing a notable gap in LLMs' multicultural proficiency.
△ Less
Submitted 9 April, 2024;
originally announced April 2024.
-
A Quantitative Approach to Understand Self-Supervised Models as Cross-lingual Feature Extractors
Authors:
Shuyue Stella Li,
Beining Xu,
Xiangyu Zhang,
Hexin Liu,
Wenhan Chao,
Leibny Paola Garcia
Abstract:
In this work, we study the features extracted by English self-supervised learning (SSL) models in cross-lingual contexts and propose a new metric to predict the quality of feature representations. Using automatic speech recognition (ASR) as a downstream task, we analyze the effect of model size, training objectives, and model architecture on the models' performance as a feature extractor for a set…
▽ More
In this work, we study the features extracted by English self-supervised learning (SSL) models in cross-lingual contexts and propose a new metric to predict the quality of feature representations. Using automatic speech recognition (ASR) as a downstream task, we analyze the effect of model size, training objectives, and model architecture on the models' performance as a feature extractor for a set of topologically diverse corpora. We develop a novel metric, the Phonetic-Syntax Ratio (PSR), to measure the phonetic and synthetic information in the extracted representations using deep generalized canonical correlation analysis. Results show the contrastive loss in the wav2vec2.0 objective facilitates more effective cross-lingual feature extraction. There is a positive correlation between PSR scores and ASR performance, suggesting that phonetic information extracted by monolingual SSL models can be used for downstream tasks in cross-lingual settings. The proposed metric is an effective indicator of the quality of the representations and can be useful for model selection.
△ Less
Submitted 27 November, 2023;
originally announced November 2023.
-
Narrowing the Gap between Zero- and Few-shot Machine Translation by Matching Styles
Authors:
Weiting Tan,
Haoran Xu,
Lingfeng Shen,
Shuyue Stella Li,
Kenton Murray,
Philipp Koehn,
Benjamin Van Durme,
Yunmo Chen
Abstract:
Large language models trained primarily in a monolingual setting have demonstrated their ability to generalize to machine translation using zero- and few-shot examples with in-context learning. However, even though zero-shot translations are relatively good, there remains a discernible gap comparing their performance with the few-shot setting. In this paper, we investigate the factors contributing…
▽ More
Large language models trained primarily in a monolingual setting have demonstrated their ability to generalize to machine translation using zero- and few-shot examples with in-context learning. However, even though zero-shot translations are relatively good, there remains a discernible gap comparing their performance with the few-shot setting. In this paper, we investigate the factors contributing to this gap and find that this gap can largely be closed (for about 70%) by matching the writing styles of the target corpus. Additionally, we explore potential approaches to enhance zero-shot baselines without the need for parallel demonstration examples, providing valuable insights into how these methods contribute to improving translation metrics.
△ Less
Submitted 3 November, 2023;
originally announced November 2023.
-
Simple yet Effective Code-Switching Language Identification with Multitask Pre-Training and Transfer Learning
Authors:
Shuyue Stella Li,
Cihan Xiao,
Tianjian Li,
Bismarck Odoom
Abstract:
Code-switching, also called code-mixing, is the linguistics phenomenon where in casual settings, multilingual speakers mix words from different languages in one utterance. Due to its spontaneous nature, code-switching is extremely low-resource, which makes it a challenging problem for language and speech processing tasks. In such contexts, Code-Switching Language Identification (CSLID) becomes a d…
▽ More
Code-switching, also called code-mixing, is the linguistics phenomenon where in casual settings, multilingual speakers mix words from different languages in one utterance. Due to its spontaneous nature, code-switching is extremely low-resource, which makes it a challenging problem for language and speech processing tasks. In such contexts, Code-Switching Language Identification (CSLID) becomes a difficult but necessary task if we want to maximally leverage existing monolingual tools for other tasks. In this work, we propose two novel approaches toward improving language identification accuracy on an English-Mandarin child-directed speech dataset. Our methods include a stacked Residual CNN+GRU model and a multitask pre-training approach to use Automatic Speech Recognition (ASR) as an auxiliary task for CSLID. Due to the low-resource nature of code-switching, we also employ careful silver data creation using monolingual corpora in both languages and up-sampling as data augmentation. We focus on English-Mandarin code-switched data, but our method works on any language pair. Our best model achieves a balanced accuracy of 0.781 on a real English-Mandarin code-switching child-directed speech corpus and outperforms the previous baseline by 55.3%.
△ Less
Submitted 31 May, 2023;
originally announced May 2023.
-
Condensing Multilingual Knowledge with Lightweight Language-Specific Modules
Authors:
Haoran Xu,
Weiting Tan,
Shuyue Stella Li,
Yunmo Chen,
Benjamin Van Durme,
Philipp Koehn,
Kenton Murray
Abstract:
Incorporating language-specific (LS) modules is a proven method to boost performance in multilingual machine translation. This approach bears similarity to Mixture-of-Experts (MoE) because it does not inflate FLOPs. However, the scalability of this approach to hundreds of languages (experts) tends to be unmanageable due to the prohibitive number of parameters introduced by full-rank matrices in fu…
▽ More
Incorporating language-specific (LS) modules is a proven method to boost performance in multilingual machine translation. This approach bears similarity to Mixture-of-Experts (MoE) because it does not inflate FLOPs. However, the scalability of this approach to hundreds of languages (experts) tends to be unmanageable due to the prohibitive number of parameters introduced by full-rank matrices in fully-connected layers. In this work, we introduce the Language-Specific Matrix Synthesis (LMS) method. This approach constructs LS modules by generating low-rank matrices from two significantly smaller matrices to approximate the full-rank matrix. Furthermore, we condense multilingual knowledge from multiple LS modules into a single shared module with the Fuse Distillation (FD) technique to improve the efficiency of inference and model serialization. We show that our LMS method significantly outperforms previous LS methods and MoE methods with the same amount of extra parameters, e.g., 1.73 BLEU points over the Switch Transformer on many-to-many multilingual machine translation. Importantly, LMS is able to have comparable translation performance with much fewer parameters.
△ Less
Submitted 22 October, 2023; v1 submitted 23 May, 2023;
originally announced May 2023.
-
Language Agnostic Code-Mixing Data Augmentation by Predicting Linguistic Patterns
Authors:
Shuyue Stella Li,
Kenton Murray
Abstract:
In this work, we focus on intrasentential code-mixing and propose several different Synthetic Code-Mixing (SCM) data augmentation methods that outperform the baseline on downstream sentiment analysis tasks across various amounts of labeled gold data. Most importantly, our proposed methods demonstrate that strategically replacing parts of sentences in the matrix language with a constant mask signif…
▽ More
In this work, we focus on intrasentential code-mixing and propose several different Synthetic Code-Mixing (SCM) data augmentation methods that outperform the baseline on downstream sentiment analysis tasks across various amounts of labeled gold data. Most importantly, our proposed methods demonstrate that strategically replacing parts of sentences in the matrix language with a constant mask significantly improves classification accuracy, motivating further linguistic insights into the phenomenon of code-mixing. We test our data augmentation method in a variety of low-resource and cross-lingual settings, reaching up to a relative improvement of 7.73% on the extremely scarce English-Malayalam dataset. We conclude that the code-switch pattern in code-mixing sentences is also important for the model to learn. Finally, we propose a language-agnostic SCM algorithm that is cheap yet extremely helpful for low-resource languages.
△ Less
Submitted 14 November, 2022;
originally announced November 2022.
-
PQLM -- Multilingual Decentralized Portable Quantum Language Model for Privacy Protection
Authors:
Shuyue Stella Li,
Xiangyu Zhang,
Shu Zhou,
Hongchao Shu,
Ruixing Liang,
Hexin Liu,
Leibny Paola Garcia
Abstract:
With careful manipulation, malicious agents can reverse engineer private information encoded in pre-trained language models. Security concerns motivate the development of quantum pre-training. In this work, we propose a highly Portable Quantum Language Model (PQLM) that can easily transmit information to downstream tasks on classical machines. The framework consists of a cloud PQLM built with rand…
▽ More
With careful manipulation, malicious agents can reverse engineer private information encoded in pre-trained language models. Security concerns motivate the development of quantum pre-training. In this work, we propose a highly Portable Quantum Language Model (PQLM) that can easily transmit information to downstream tasks on classical machines. The framework consists of a cloud PQLM built with random Variational Quantum Classifiers (VQC) and local models for downstream applications. We demonstrate the ad hoc portability of the quantum model by extracting only the word embeddings and effectively applying them to downstream tasks on classical machines. Our PQLM exhibits comparable performance to its classical counterpart on both intrinsic evaluation (loss, perplexity) and extrinsic evaluation (multilingual sentiment analysis accuracy) metrics. We also perform ablation studies on the factors affecting PQLM performance to analyze model stability. Our work establishes a theoretical foundation for a portable quantum pre-trained language model that could be trained on private data and made available for public use with privacy protection guarantees.
△ Less
Submitted 26 February, 2023; v1 submitted 6 October, 2022;
originally announced October 2022.
-
End-to-End Lyrics Recognition with Self-supervised Learning
Authors:
Xiangyu Zhang,
Shuyue Stella Li,
Zhanhong He,
Roberto Togneri,
Leibny Paola Garcia
Abstract:
Lyrics recognition is an important task in music processing. Despite traditional algorithms such as the hybrid HMM- TDNN model achieving good performance, studies on applying end-to-end models and self-supervised learning (SSL) are limited. In this paper, we first establish an end-to-end baseline for lyrics recognition and then explore the performance of SSL models on lyrics recognition task. We e…
▽ More
Lyrics recognition is an important task in music processing. Despite traditional algorithms such as the hybrid HMM- TDNN model achieving good performance, studies on applying end-to-end models and self-supervised learning (SSL) are limited. In this paper, we first establish an end-to-end baseline for lyrics recognition and then explore the performance of SSL models on lyrics recognition task. We evaluate a variety of upstream SSL models with different training methods (masked reconstruction, masked prediction, autoregressive reconstruction, and contrastive learning). Our end-to-end self-supervised models, evaluated on the DAMP music dataset, outperform the previous state-of-the-art (SOTA) system by 5.23% for the dev set and 2.4% for the test set even without a language model trained by a large corpus. Moreover, we investigate the effect of background music on the performance of self-supervised learning models and conclude that the SSL models cannot extract features efficiently in the presence of background music. Finally, we study the out-of-domain generalization ability of the SSL features considering that those models were not trained on music datasets.
△ Less
Submitted 26 October, 2022; v1 submitted 26 September, 2022;
originally announced September 2022.
-
Genetic Improvement in the Shackleton Framework for Optimizing LLVM Pass Sequences
Authors:
Shuyue Stella Li,
Hannah Peeler,
Andrew N. Sloss,
Kenneth N. Reid,
Wolfgang Banzhaf
Abstract:
Genetic improvement is a search technique that aims to improve a given acceptable solution to a problem. In this paper, we present the novel use of genetic improvement to find problem-specific optimized LLVM pass sequences. We develop a pass-level patch representation in the linear genetic programming framework, Shackleton, to evolve the modifications to be applied to the default optimization pass…
▽ More
Genetic improvement is a search technique that aims to improve a given acceptable solution to a problem. In this paper, we present the novel use of genetic improvement to find problem-specific optimized LLVM pass sequences. We develop a pass-level patch representation in the linear genetic programming framework, Shackleton, to evolve the modifications to be applied to the default optimization pass sequences. Our GI-evolved solution has a mean of 3.7% runtime improvement compared to the -O3 optimization level in the default code generation options which optimizes on runtime. The proposed GI method provides an automatic way to find a problem-specific optimization sequence that improves upon a general solution without any expert domain knowledge. In this paper, we discuss the advantages and limitations of the GI feature in the Shackleton Framework and present our results.
△ Less
Submitted 27 April, 2022;
originally announced April 2022.
-
Optimizing LLVM Pass Sequences with Shackleton: A Linear Genetic Programming Framework
Authors:
Hannah Peeler,
Shuyue Stella Li,
Andrew N. Sloss,
Kenneth N. Reid,
Yuan Yuan,
Wolfgang Banzhaf
Abstract:
In this paper we introduce Shackleton as a generalized framework enabling the application of linear genetic programming -- a technique under the umbrella of evolutionary algorithms -- to a variety of use cases. We also explore here a novel application for this class of methods: optimizing sequences of LLVM optimization passes. The algorithm underpinning Shackleton is discussed, with an emphasis on…
▽ More
In this paper we introduce Shackleton as a generalized framework enabling the application of linear genetic programming -- a technique under the umbrella of evolutionary algorithms -- to a variety of use cases. We also explore here a novel application for this class of methods: optimizing sequences of LLVM optimization passes. The algorithm underpinning Shackleton is discussed, with an emphasis on the effects of different features unique to the framework when applied to LLVM pass sequences. Combined with analysis of different hyperparameter settings, we report the results on automatically optimizing pass sequences using Shackleton for two software applications at differing complexity levels. Finally, we reflect on the advantages and limitations of our current implementation and lay out a path for further improvements. These improvements aim to surpass hand-crafted solutions with an automatic discovery method for an optimal pass sequence.
△ Less
Submitted 31 January, 2022;
originally announced January 2022.