Search | arXiv e-print repository

CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark

Authors: Ahmed Heakl, Sarim Hashmi, Gustavo Bertolo Stahl, Seung Hun Eddie Han, Salman Khan, Abdulrahman Mahmoud

Abstract: We introduce CASS, the first large-scale dataset and model suite for cross-architecture GPU code transpilation, targeting both source-level (CUDA <--> HIP) and assembly-level (Nvidia SASS <--> AMD RDNA3) translation. The dataset comprises 70k verified code pairs across host and device, addressing a critical gap in low-level GPU code portability. Leveraging this resource, we train the CASS family o… ▽ More We introduce CASS, the first large-scale dataset and model suite for cross-architecture GPU code transpilation, targeting both source-level (CUDA <--> HIP) and assembly-level (Nvidia SASS <--> AMD RDNA3) translation. The dataset comprises 70k verified code pairs across host and device, addressing a critical gap in low-level GPU code portability. Leveraging this resource, we train the CASS family of domain-specific language models, achieving 95% source translation accuracy and 37.5% assembly translation accuracy, substantially outperforming commercial baselines such as GPT-4o, Claude, and Hipify. Our generated code matches native performance in over 85% of test cases, preserving runtime and memory behavior. To support rigorous evaluation, we introduce CASS-Bench, a curated benchmark spanning 16 GPU domains with ground-truth execution. All data, models, and evaluation tools are released as open source to foster progress in GPU compiler tooling, binary compatibility, and LLM-guided hardware translation. △ Less

Submitted 29 May, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

Comments: 20 pages, 11 figures, 5 tables

arXiv:2505.14946 [pdf, ps, other]

Reinforcement Learning from User Feedback

Authors: Eric Han, Jun Chen, Karthik Abinav Sankararaman, Xiaoliang Peng, Tengyu Xu, Eryk Helenowski, Kaiyan Peng, Mrinal Kumar, Sinong Wang, Han Fang, Arya Talebzadeh

Abstract: As large language models (LLMs) are increasingly deployed in diverse user facing applications, aligning them with real user preferences becomes essential. Existing methods like Reinforcement Learning from Human Feedback (RLHF) rely on expert annotators trained on manually defined guidelines, whose judgments may not reflect the priorities of everyday users. We introduce Reinforcement Learning from… ▽ More As large language models (LLMs) are increasingly deployed in diverse user facing applications, aligning them with real user preferences becomes essential. Existing methods like Reinforcement Learning from Human Feedback (RLHF) rely on expert annotators trained on manually defined guidelines, whose judgments may not reflect the priorities of everyday users. We introduce Reinforcement Learning from User Feedback (RLUF), a framework for aligning LLMs directly to implicit signals from users in production. RLUF addresses key challenges of user feedback: user feedback is often binary (e.g., emoji reactions), sparse, and occasionally adversarial. We train a reward model, P[Love], to predict the likelihood that an LLM response will receive a Love Reaction, a lightweight form of positive user feedback, and integrate P[Love] into a multi-objective policy optimization framework alongside helpfulness and safety objectives. In large-scale experiments, we show that P[Love] is predictive of increased positive feedback and serves as a reliable offline evaluator of future user behavior. Policy optimization using P[Love] significantly raises observed positive-feedback rates, including a 28% increase in Love Reactions during live A/B tests. However, optimizing for positive reactions introduces reward hacking challenges, requiring careful balancing of objectives. By directly leveraging implicit signals from users, RLUF offers a path to aligning LLMs with real-world user preferences at scale. △ Less

Submitted 20 May, 2025; originally announced May 2025.

arXiv:2505.00304 [pdf, other]

Reinforcement Learning with Continuous Actions Under Unmeasured Confounding

Authors: Yuhan Li, Eugene Han, Yifan Hu, Wenzhuo Zhou, Zhengling Qi, Yifan Cui, Ruoqing Zhu

Abstract: This paper addresses the challenge of offline policy learning in reinforcement learning with continuous action spaces when unmeasured confounders are present. While most existing research focuses on policy evaluation within partially observable Markov decision processes (POMDPs) and assumes discrete action spaces, we advance this field by establishing a novel identification result to enable the no… ▽ More This paper addresses the challenge of offline policy learning in reinforcement learning with continuous action spaces when unmeasured confounders are present. While most existing research focuses on policy evaluation within partially observable Markov decision processes (POMDPs) and assumes discrete action spaces, we advance this field by establishing a novel identification result to enable the nonparametric estimation of policy value for a given target policy under an infinite-horizon framework. Leveraging this identification, we develop a minimax estimator and introduce a policy-gradient-based algorithm to identify the in-class optimal policy that maximizes the estimated policy value. Furthermore, we provide theoretical results regarding the consistency, finite-sample error bound, and regret bound of the resulting optimal policy. Extensive simulations and a real-world application using the German Family Panel data demonstrate the effectiveness of our proposed methodology. △ Less

Submitted 1 May, 2025; originally announced May 2025.

arXiv:2504.13392 [pdf, ps, other]

POET: Supporting Prompting Creativity and Personalization with Automated Expansion of Text-to-Image Generation

Authors: Evans Xu Han, Alice Qian Zhang, Hong Shen, Haiyi Zhu, Paul Pu Liang, Jane Hsieh

Abstract: State-of-the-art visual generative AI tools hold immense potential to assist users in the early ideation stages of creative tasks -- offering the ability to generate (rather than search for) novel and unprecedented (instead of existing) images of considerable quality that also adhere to boundless combinations of user specifications. However, many large-scale text-to-image systems are designed for… ▽ More State-of-the-art visual generative AI tools hold immense potential to assist users in the early ideation stages of creative tasks -- offering the ability to generate (rather than search for) novel and unprecedented (instead of existing) images of considerable quality that also adhere to boundless combinations of user specifications. However, many large-scale text-to-image systems are designed for broad applicability, yielding conventional output that may limit creative exploration. They also employ interaction methods that may be difficult for beginners. Given that creative end users often operate in diverse, context-specific ways that are often unpredictable, more variation and personalization are necessary. We introduce POET, a real-time interactive tool that (1) automatically discovers dimensions of homogeneity in text-to-image generative models, (2) expands these dimensions to diversify the output space of generated images, and (3) learns from user feedback to personalize expansions. An evaluation with 28 users spanning four creative task domains demonstrated POET's ability to generate results with higher perceived diversity and help users reach satisfaction in fewer prompts during creative tasks, thereby prompting them to deliberate and reflect more on a wider range of possible produced results during the co-creative process. Focusing on visual creativity, POET offers a first glimpse of how interaction techniques of future text-to-image generation tools may support and align with more pluralistic values and the needs of end users during the ideation stages of their work. △ Less

Submitted 17 April, 2025; originally announced April 2025.

arXiv:2411.07705 [pdf, other]

dpvis: A Visual and Interactive Learning Tool for Dynamic Programming

Authors: David H. Lee, Aditya Prasad, Ramiro Deo-Campo Vuong, Tianyu Wang, Eric Han, David Kempe

Abstract: Dynamic programming (DP) is a fundamental and powerful algorithmic paradigm taught in most undergraduate (and many graduate) algorithms classes. DP problems are challenging for many computer science students because they require identifying unique problem structures and a refined understanding of recursion. In this paper, we present dpvis, a Python library that helps students understand DP through… ▽ More Dynamic programming (DP) is a fundamental and powerful algorithmic paradigm taught in most undergraduate (and many graduate) algorithms classes. DP problems are challenging for many computer science students because they require identifying unique problem structures and a refined understanding of recursion. In this paper, we present dpvis, a Python library that helps students understand DP through a frame-by-frame animation of dynamic programs. dpvis can easily generate animations of dynamic programs with as little as two lines of modifications compared to a standard Python implementation. For each frame, dpvis highlight the cells that have been read from and written to during an iteration. Moreover, dpvis allows users to test their understanding by prompting them with questions about the next operation performed by the algorithm. We deployed dpvis as a learning tool in an undergraduate algorithms class, and report on the results of a survey. The survey results suggest that dpvis is especially helpful for visualizing the recursive structure of DP. Although some students struggled with the installation of the tool (which has been simplified since the reported deployment), essentially all other students found the tool to be useful for understanding dynamic programs. dpvis is available at https://github.com/itsdawei/dpvis. △ Less

Submitted 12 November, 2024; originally announced November 2024.

Comments: Published as a conference paper at Technical Symposium on Computer Science Education (SIGCSE TS '25); dpvis is available at https://dpvis.readthedocs.io/en/latest/

ACM Class: K.3.1; K.3.2

arXiv:2410.16719 [pdf, other]

Progressive Compositionality in Text-to-Image Generative Models

Authors: Evans Xu Han, Linghao Jin, Xiaofeng Liu, Paul Pu Liang

Abstract: Despite the impressive text-to-image (T2I) synthesis capabilities of diffusion models, they often struggle to understand compositional relationships between objects and attributes, especially in complex settings. Existing solutions have tackled these challenges by optimizing the cross-attention mechanism or learning from the caption pairs with minimal semantic changes. However, can we generate hig… ▽ More Despite the impressive text-to-image (T2I) synthesis capabilities of diffusion models, they often struggle to understand compositional relationships between objects and attributes, especially in complex settings. Existing solutions have tackled these challenges by optimizing the cross-attention mechanism or learning from the caption pairs with minimal semantic changes. However, can we generate high-quality complex contrastive images that diffusion models can directly discriminate based on visual representations? In this work, we leverage large-language models (LLMs) to compose realistic, complex scenarios and harness Visual-Question Answering (VQA) systems alongside diffusion models to automatically curate a contrastive dataset, ConPair, consisting of 15k pairs of high-quality contrastive images. These pairs feature minimal visual discrepancies and cover a wide range of attribute categories, especially complex and natural scenarios. To learn effectively from these error cases, i.e., hard negative images, we propose EvoGen, a new multi-stage curriculum for contrastive learning of diffusion models. Through extensive experiments across a wide range of compositional scenarios, we showcase the effectiveness of our proposed framework on compositional T2I benchmarks. △ Less

Submitted 26 April, 2025; v1 submitted 22 October, 2024; originally announced October 2024.

arXiv:2409.20370 [pdf, other]

The Perfect Blend: Redefining RLHF with Mixture of Judges

Authors: Tengyu Xu, Eryk Helenowski, Karthik Abinav Sankararaman, Di Jin, Kaiyan Peng, Eric Han, Shaoliang Nie, Chen Zhu, Hejia Zhang, Wenxuan Zhou, Zhouhao Zeng, Yun He, Karishma Mandyam, Arya Talabzadeh, Madian Khabsa, Gabriel Cohen, Yuandong Tian, Hao Ma, Sinong Wang, Han Fang

Abstract: Reinforcement learning from human feedback (RLHF) has become the leading approach for fine-tuning large language models (LLM). However, RLHF has limitations in multi-task learning (MTL) due to challenges of reward hacking and extreme multi-objective optimization (i.e., trade-off of multiple and/or sometimes conflicting objectives). Applying RLHF for MTL currently requires careful tuning of the wei… ▽ More Reinforcement learning from human feedback (RLHF) has become the leading approach for fine-tuning large language models (LLM). However, RLHF has limitations in multi-task learning (MTL) due to challenges of reward hacking and extreme multi-objective optimization (i.e., trade-off of multiple and/or sometimes conflicting objectives). Applying RLHF for MTL currently requires careful tuning of the weights for reward model and data combinations. This is often done via human intuition and does not generalize. In this work, we introduce a novel post-training paradigm which we called Constrained Generative Policy Optimization (CGPO). The core of CGPO is Mixture of Judges (MoJ) with cost-efficient constrained policy optimization with stratification, which can identify the perfect blend in RLHF in a principled manner. It shows strong empirical results with theoretical guarantees, does not require extensive hyper-parameter tuning, and is plug-and-play in common post-training pipelines. Together, this can detect and mitigate reward hacking behaviors while reaching a pareto-optimal point across an extremely large number of objectives. Our empirical evaluations demonstrate that CGPO significantly outperforms standard RLHF algorithms like PPO and DPO across various tasks including general chat, STEM questions, instruction following, and coding. Specifically, CGPO shows improvements of 7.4% in AlpacaEval-2 (general chat), 12.5% in Arena-Hard (STEM & reasoning), and consistent gains in other domains like math and coding. Notably, PPO, while commonly used, is prone to severe reward hacking in popular coding benchmarks, which CGPO successfully addresses. This breakthrough in RLHF not only tackles reward hacking and extreme multi-objective optimization challenges but also advances the state-of-the-art in aligning general-purpose LLMs for diverse applications. △ Less

Submitted 30 September, 2024; originally announced September 2024.

Comments: submitted to conference

arXiv:2407.18380 [pdf, other]

doi 10.1109/WoWMoM60985.2024.00023

Effect of Duration and Delay on the Identifiability of VR Motion

Authors: Mark Roman Miller, Vivek Nair, Eugy Han, Cyan DeVeaux, Christian Rack, Rui Wang, Brandon Huang, Marc Erich Latoschik, James F. O'Brien, Jeremy N. Bailenson

Abstract: Social virtual reality is an emerging medium of communication. In this medium, a user's avatar (virtual representation) is controlled by the tracked motion of the user's headset and hand controllers. This tracked motion is a rich data stream that can leak characteristics of the user or can be effectively matched to previously-identified data to identify a user. To better understand the boundaries… ▽ More Social virtual reality is an emerging medium of communication. In this medium, a user's avatar (virtual representation) is controlled by the tracked motion of the user's headset and hand controllers. This tracked motion is a rich data stream that can leak characteristics of the user or can be effectively matched to previously-identified data to identify a user. To better understand the boundaries of motion data identifiability, we investigate how varying training data duration and train-test delay affects the accuracy at which a machine learning model can correctly classify user motion in a supervised learning task simulating re-identification. The dataset we use has a unique combination of a large number of participants, long duration per session, large number of sessions, and a long time span over which sessions were conducted. We find that training data duration and train-test delay affect identifiability; that minimal train-test delay leads to very high accuracy; and that train-test delay should be controlled in future experiments. △ Less

Submitted 26 August, 2024; v1 submitted 25 July, 2024; originally announced July 2024.

Comments: 6 pages, 2 figures, presented at the SePAR workshop (Security and Privacy in Mixed, Augmented, and Virtual Realities), co-located with WoWMoM 2024. arXiv admin note: text overlap with arXiv:2303.01430

arXiv:2407.02896 [pdf, other]

Predicting and Understanding Turn-Taking Behavior in Open-Ended Group Activities in Virtual Reality

Authors: Portia Wang, Eugy Han, Anna C. M. Queiroz, Cyan DeVeaux, Jeremy N. Bailenson

Abstract: In networked virtual reality (VR), user behaviors, individual differences, and group dynamics can serve as important signals into future speech behaviors, such as who the next speaker will be and the timing of turn-taking behaviors. The ability to predict and understand these behaviors offers opportunities to provide adaptive and personalized assistance, for example helping users with varying sens… ▽ More In networked virtual reality (VR), user behaviors, individual differences, and group dynamics can serve as important signals into future speech behaviors, such as who the next speaker will be and the timing of turn-taking behaviors. The ability to predict and understand these behaviors offers opportunities to provide adaptive and personalized assistance, for example helping users with varying sensory abilities navigate complex social scenes and instantiating virtual moderators with natural behaviors. In this work, we predict turn-taking behaviors using features extracted based on social dynamics literature. We discuss results from a large-scale VR classroom dataset consisting of 77 sessions and 1660 minutes of small-group social interactions collected over four weeks. In our evaluation, gradient boosting classifiers achieved the best performance, with accuracies of 0.71--0.78 AUC (area under the ROC curve) across three tasks concerning the "what", "who", and "when" of turn-taking behaviors. In interpreting these models, we found that group size, listener personality, speech-related behavior (e.g., time elapsed since the listener's last speech event), group gaze (e.g., how much the group looks at the speaker), as well as the listener's and previous speaker's head pitch, head y-axis position, and left hand y-axis position more saliently influenced predictions. Results suggested that these features remain reliable indicators in novel social VR settings, as prediction performance is robust over time and with groups and activities not used in the training dataset. We discuss theoretical and practical implications of the work. △ Less

Submitted 25 April, 2025; v1 submitted 3 July, 2024; originally announced July 2024.

arXiv:2406.06516 [pdf, other]

Distribution-Free Predictive Inference under Unknown Temporal Drift

Authors: Elise Han, Chengpiao Huang, Kaizheng Wang

Abstract: Distribution-free prediction sets play a pivotal role in uncertainty quantification for complex statistical models. Their validity hinges on reliable calibration data, which may not be readily available as real-world environments often undergo unknown changes over time. In this paper, we propose a strategy for choosing an adaptive window and use the data therein to construct prediction sets. The w… ▽ More Distribution-free prediction sets play a pivotal role in uncertainty quantification for complex statistical models. Their validity hinges on reliable calibration data, which may not be readily available as real-world environments often undergo unknown changes over time. In this paper, we propose a strategy for choosing an adaptive window and use the data therein to construct prediction sets. The window is selected by optimizing an estimated bias-variance tradeoff. We provide sharp coverage guarantees for our method, showing its adaptivity to the underlying temporal drift. We also illustrate its efficacy through numerical experiments on synthetic and real data. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: 25 pages, 4 figures, 6 tables

arXiv:2402.08672 [pdf, other]

Model Assessment and Selection under Temporal Distribution Shift

Authors: Elise Han, Chengpiao Huang, Kaizheng Wang

Abstract: We investigate model assessment and selection in a changing environment, by synthesizing datasets from both the current time period and historical epochs. To tackle unknown and potentially arbitrary temporal distribution shift, we develop an adaptive rolling window approach to estimate the generalization error of a given model. This strategy also facilitates the comparison between any two candidat… ▽ More We investigate model assessment and selection in a changing environment, by synthesizing datasets from both the current time period and historical epochs. To tackle unknown and potentially arbitrary temporal distribution shift, we develop an adaptive rolling window approach to estimate the generalization error of a given model. This strategy also facilitates the comparison between any two candidate models by estimating the difference of their generalization errors. We further integrate pairwise comparisons into a single-elimination tournament, achieving near-optimal model selection from a collection of candidates. Theoretical analyses and numerical experiments demonstrate the adaptivity of our proposed methods to the non-stationarity in data. △ Less

Submitted 3 June, 2024; v1 submitted 13 February, 2024; originally announced February 2024.

Comments: 26 pages, 6 figures, 4 tables

MSC Class: 62G05 (Primary); 62J02 (Secondary)

arXiv:2307.05435 [pdf, other]

One-Versus-Others Attention: Scalable Multimodal Integration for Biomedical Data

Authors: Michal Golovanevsky, Eva Schiller, Akira Nair, Eric Han, Ritambhara Singh, Carsten Eickhoff

Abstract: Multimodal learning models have become increasingly important as they surpass single-modality approaches on diverse tasks ranging from question-answering to autonomous driving. Despite the importance of multimodal learning, existing efforts focus on NLP applications, where the number of modalities is typically less than four (audio, video, text, images). However, data inputs in other domains, such… ▽ More Multimodal learning models have become increasingly important as they surpass single-modality approaches on diverse tasks ranging from question-answering to autonomous driving. Despite the importance of multimodal learning, existing efforts focus on NLP applications, where the number of modalities is typically less than four (audio, video, text, images). However, data inputs in other domains, such as the medical field, may include X-rays, PET scans, MRIs, genetic screening, clinical notes, and more, creating a need for both efficient and accurate information fusion. Many state-of-the-art models rely on pairwise cross-modal attention, which does not scale well for applications with more than three modalities. For $n$ modalities, computing attention will result in $n \choose 2$ operations, potentially requiring considerable amounts of computational resources. To address this, we propose a new domain-neutral attention mechanism, One-Versus-Others (OvO) attention, that scales linearly with the number of modalities and requires only $n$ attention operations, thus offering a significant reduction in computational complexity compared to existing cross-modal attention algorithms. Using three diverse real-world datasets as well as an additional simulation experiment, we show that our method improves performance compared to popular fusion techniques while decreasing computation costs. △ Less

Submitted 21 October, 2024; v1 submitted 11 July, 2023; originally announced July 2023.

arXiv:2303.01430 [pdf, other]

A Large-Scale Study of Personal Identifiability of Virtual Reality Motion Over Time

Authors: Mark Roman Miller, Eugy Han, Cyan DeVeaux, Eliot Jones, Ryan Chen, Jeremy N. Bailenson

Abstract: In recent years, social virtual reality (VR), sometimes described as the "metaverse," has become widely available. With its potential comes risks, including risks to privacy. To understand these risks, we study the identifiability of participants' motion in VR in a dataset of 232 VR users with eight weekly sessions of about thirty minutes each, totaling 764 hours of social interaction. The sample… ▽ More In recent years, social virtual reality (VR), sometimes described as the "metaverse," has become widely available. With its potential comes risks, including risks to privacy. To understand these risks, we study the identifiability of participants' motion in VR in a dataset of 232 VR users with eight weekly sessions of about thirty minutes each, totaling 764 hours of social interaction. The sample is unique as we are able to study the effect of user, session, and time independently. We find that the number of sessions recorded greatly increases identifiability, and duration per session increases identifiability as well, but to a lesser degree. We also find that greater delay between training and testing sessions reduces identifiability. Ultimately, understanding the identifiability of VR activities will help designers, security professionals, and consumer advocates make VR safer. △ Less

Submitted 2 March, 2023; originally announced March 2023.

Comments: 15 pages, 5 figures

arXiv:2207.05261 [pdf, other]

Building Korean Sign Language Augmentation (KoSLA) Corpus with Data Augmentation Technique

Authors: Changnam An, Eunkyung Han, Dongmyeong Noh, Ohkyoon Kwon, Sumi Lee, Hyunshim Han

Abstract: We present an efficient framework of corpus for sign language translation. Aided with a simple but dramatic data augmentation technique, our method converts text into annotated forms with minimum information loss. Sign languages are composed of manual signals, non-manual signals, and iconic features. According to professional sign language interpreters, non-manual signals such as facial expression… ▽ More We present an efficient framework of corpus for sign language translation. Aided with a simple but dramatic data augmentation technique, our method converts text into annotated forms with minimum information loss. Sign languages are composed of manual signals, non-manual signals, and iconic features. According to professional sign language interpreters, non-manual signals such as facial expressions and gestures play an important role in conveying exact meaning. By considering the linguistic features of sign language, our proposed framework is a first and unique attempt to build a multimodal sign language augmentation corpus (hereinafter referred to as the KoSLA corpus) containing both manual and non-manual modalities. The corpus we built demonstrates confident results in the hospital context, showing improved performance with augmented datasets. To overcome data scarcity, we resorted to data augmentation techniques such as synonym replacement to boost the efficiency of our translation model and available data, while maintaining grammatical and semantic structures of sign language. For the experimental support, we verify the effectiveness of data augmentation technique and usefulness of our corpus by performing a translation task between normal sentences and sign language annotations on two tokenizers. The result was convincing, proving that the BLEU scores with the KoSLA corpus were significant. △ Less

Submitted 11 July, 2022; originally announced July 2022.

arXiv:2202.11323 [pdf, other]

doi 10.1109/ICASSP43922.2022.9747384

Improving fairness in speaker verification via Group-adapted Fusion Network

Authors: Hua Shen, Yuguang Yang, Guoli Sun, Ryan Langman, Eunjung Han, Jasha Droppo, Andreas Stolcke

Abstract: Modern speaker verification models use deep neural networks to encode utterance audio into discriminative embedding vectors. During the training process, these networks are typically optimized to differentiate arbitrary speakers. This learning process biases the learning of fine voice characteristics towards dominant demographic groups, which can lead to an unfair performance disparity across diff… ▽ More Modern speaker verification models use deep neural networks to encode utterance audio into discriminative embedding vectors. During the training process, these networks are typically optimized to differentiate arbitrary speakers. This learning process biases the learning of fine voice characteristics towards dominant demographic groups, which can lead to an unfair performance disparity across different groups. This is observed especially with underrepresented demographic groups sharing similar voice characteristics. In this work, we investigate the fairness of speaker verification models on controlled datasets with imbalanced gender distributions, providing direct evidence that model performance suffers for underrepresented groups. To mitigate this disparity we propose the group-adapted fusion network (GFN) architecture, a modular architecture based on group embedding adaptation and score fusion. We show that our method alleviates model unfairness by improving speaker verification both overall and for individual groups. Given imbalanced group representation in training, our proposed method achieves overall equal error rate (EER) reduction of 9.6% to 29.0% relative, reduces minority group EER by 13.7% to 18.6%, and results in 20.0% to 25.4% less EER disparity, compared to baselines. The approach is applicable to other types of training data skew in speaker recognition systems. △ Less

Submitted 23 February, 2022; originally announced February 2022.

Comments: To appear in Proc. IEEE ICASSP 2022

Journal ref: Proc. IEEE ICASSP, May 2022, pp. 7077-7081

arXiv:2202.10672 [pdf, other]

doi 10.1109/ICASSP43922.2022.9746411

Contrastive-mixup learning for improved speaker verification

Authors: Xin Zhang, Minho Jin, Roger Cheng, Ruirui Li, Eunjung Han, Andreas Stolcke

Abstract: This paper proposes a novel formulation of prototypical loss with mixup for speaker verification. Mixup is a simple yet efficient data augmentation technique that fabricates a weighted combination of random data point and label pairs for deep neural network training. Mixup has attracted increasing attention due to its ability to improve robustness and generalization of deep neural networks. Althou… ▽ More This paper proposes a novel formulation of prototypical loss with mixup for speaker verification. Mixup is a simple yet efficient data augmentation technique that fabricates a weighted combination of random data point and label pairs for deep neural network training. Mixup has attracted increasing attention due to its ability to improve robustness and generalization of deep neural networks. Although mixup has shown success in diverse domains, most applications have centered around closed-set classification tasks. In this work, we propose contrastive-mixup, a novel augmentation strategy that learns distinguishing representations based on a distance metric. During training, mixup operations generate convex interpolations of both inputs and virtual labels. Moreover, we have reformulated the prototypical loss function such that mixup is enabled on metric learning objectives. To demonstrate its generalization given limited training data, we conduct experiments by varying the number of available utterances from each speaker in the VoxCeleb database. Experimental results show that applying contrastive-mixup outperforms the existing baseline, reducing error rate by 16% relatively, especially when the number of training utterances per speaker is limited. △ Less

Submitted 22 February, 2022; originally announced February 2022.

Journal ref: Proc. IEEE ICASSP, May 2022, pp. 7652-7656

arXiv:2202.01286 [pdf, other]

doi 10.1109/ICASSP43922.2022.9746964

ASR-Aware End-to-end Neural Diarization

Authors: Aparna Khare, Eunjung Han, Yuguang Yang, Andreas Stolcke

Abstract: We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model. Two categories of features are explored: features derived directly from ASR output (phones, position-in-word and word boundaries) and features derived from a lexical speaker change detection model, trained by fine-tuning a pret… ▽ More We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model. Two categories of features are explored: features derived directly from ASR output (phones, position-in-word and word boundaries) and features derived from a lexical speaker change detection model, trained by fine-tuning a pretrained BERT model on the ASR output. Three modifications to the Conformer-based EEND architecture are proposed to incorporate the features. First, ASR features are concatenated with acoustic features. Second, we propose a new attention mechanism called contextualized self-attention that utilizes ASR features to build robust speaker representations. Finally, multi-task learning is used to train the model to minimize classification loss for the ASR features along with diarization loss. Experiments on the two-speaker English conversations of Switchboard+SRE data sets show that multi-task learning with position-in-word information is the most effective way of utilizing ASR features, reducing the diarization error rate (DER) by 20% relative to the baseline. △ Less

Submitted 2 February, 2022; originally announced February 2022.

Comments: To appear in ICASSP 2022

Journal ref: Proc. IEEE ICASSP, May 2022, pp. 8092-8096

arXiv:2110.08449 [pdf, other]

Adversarial Attacks on Gaussian Process Bandits

Authors: Eric Han, Jonathan Scarlett

Abstract: Gaussian processes (GP) are a widely-adopted tool used to sequentially optimize black-box functions, where evaluations are costly and potentially noisy. Recent works on GP bandits have proposed to move beyond random noise and devise algorithms robust to adversarial attacks. This paper studies this problem from the attacker's perspective, proposing various adversarial attack methods with differing… ▽ More Gaussian processes (GP) are a widely-adopted tool used to sequentially optimize black-box functions, where evaluations are costly and potentially noisy. Recent works on GP bandits have proposed to move beyond random noise and devise algorithms robust to adversarial attacks. This paper studies this problem from the attacker's perspective, proposing various adversarial attack methods with differing assumptions on the attacker's strength and prior information. Our goal is to understand adversarial attacks on GP bandits from theoretical and practical perspectives. We focus primarily on targeted attacks on the popular GP-UCB algorithm and a related elimination-based algorithm, based on adversarially perturbing the function $f$ to produce another function $\tilde{f}$ whose optima are in some target region $\mathcal{R}_{\rm target}$. Based on our theoretical analysis, we devise both white-box attacks (known $f$) and black-box attacks (unknown $f$), with the former including a Subtraction attack and Clipping attack, and the latter including an Aggressive subtraction attack. We demonstrate that adversarial attacks on GP bandits can succeed in forcing the algorithm towards $\mathcal{R}_{\rm target}$ even with a low attack budget, and we test our attacks' effectiveness on a diverse range of objective functions. △ Less

Submitted 16 June, 2022; v1 submitted 15 October, 2021; originally announced October 2021.

Comments: Accepted to ICML 2022

MSC Class: 90C26 ACM Class: G.1.6

arXiv:2109.02576 [pdf, other]

doi 10.1109/ASRU51503.2021.9687975

Improving Speaker Identification for Shared Devices by Adapting Embeddings to Speaker Subsets

Authors: Zhenning Tan, Yuguang Yang, Eunjung Han, Andreas Stolcke

Abstract: Speaker identification typically involves three stages. First, a front-end speaker embedding model is trained to embed utterance and speaker profiles. Second, a scoring function is applied between a runtime utterance and each speaker profile. Finally, the speaker is identified using nearest neighbor according to the scoring metric. To better distinguish speakers sharing a device within the same ho… ▽ More Speaker identification typically involves three stages. First, a front-end speaker embedding model is trained to embed utterance and speaker profiles. Second, a scoring function is applied between a runtime utterance and each speaker profile. Finally, the speaker is identified using nearest neighbor according to the scoring metric. To better distinguish speakers sharing a device within the same household, we propose a household-adapted nonlinear mapping to a low dimensional space to complement the global scoring metric. The combined scoring function is optimized on labeled or pseudo-labeled speaker utterances. With input dropout, the proposed scoring model reduces EER by 45-71% in simulated households with 2 to 7 hard-to-discriminate speakers per household. On real-world internal data, the EER reduction is 49.2%. From t-SNE visualization, we also show that clusters formed by household-adapted speaker embeddings are more compact and uniformly distributed, compared to clusters formed by global embeddings before adaptation. △ Less

Submitted 6 September, 2021; originally announced September 2021.

Comments: Submitted to ASRU 2021

Journal ref: Proc. IEEE Automatic Speech Recognition and Understanding Workshop, Dec. 2021, pp. 1124-1131

arXiv:2106.07167 [pdf, other]

doi 10.21437/Interspeech.2021-1909

End-to-end Neural Diarization: From Transformer to Conformer

Authors: Yi Chieh Liu, Eunjung Han, Chul Lee, Andreas Stolcke

Abstract: We propose a new end-to-end neural diarization (EEND) system that is based on Conformer, a recently proposed neural architecture that combines convolutional mappings and Transformer to model both local and global dependencies in speech. We first show that data augmentation and convolutional subsampling layers enhance the original self-attentive EEND in the Transformer-based EEND, and then Conforme… ▽ More We propose a new end-to-end neural diarization (EEND) system that is based on Conformer, a recently proposed neural architecture that combines convolutional mappings and Transformer to model both local and global dependencies in speech. We first show that data augmentation and convolutional subsampling layers enhance the original self-attentive EEND in the Transformer-based EEND, and then Conformer gives an additional gain over the Transformer-based EEND. However, we notice that the Conformer-based EEND does not generalize as well from simulated to real conversation data as the Transformer-based model. This leads us to quantify the mismatch between simulated data and real speaker behavior in terms of temporal statistics reflecting turn-taking between speakers, and investigate its correlation with diarization error. By mixing simulated and real data in EEND training, we mitigate the mismatch further, with Conformer-based EEND achieving 24% error reduction over the baseline SA-EEND system, and 10% improvement over the best augmented Transformer-based system, on two-speaker CALLHOME data. △ Less

Submitted 14 June, 2021; originally announced June 2021.

Comments: To appear in Interspeech 2021

Journal ref: Proc. Interspeech, Sept. 2021, pp. 3081-3085

arXiv:2012.13088 [pdf, other]

High-Dimensional Bayesian Optimization via Tree-Structured Additive Models

Authors: Eric Han, Ishank Arora, Jonathan Scarlett

Abstract: Bayesian Optimization (BO) has shown significant success in tackling expensive low-dimensional black-box optimization problems. Many optimization problems of interest are high-dimensional, and scaling BO to such settings remains an important challenge. In this paper, we consider generalized additive models in which low-dimensional functions with overlapping subsets of variables are composed to mod… ▽ More Bayesian Optimization (BO) has shown significant success in tackling expensive low-dimensional black-box optimization problems. Many optimization problems of interest are high-dimensional, and scaling BO to such settings remains an important challenge. In this paper, we consider generalized additive models in which low-dimensional functions with overlapping subsets of variables are composed to model a high-dimensional target function. Our goal is to lower the computational resources required and facilitate faster model learning by reducing the model complexity while retaining the sample-efficiency of existing methods. Specifically, we constrain the underlying dependency graphs to tree structures in order to facilitate both the structure learning and optimization of the acquisition function. For the former, we propose a hybrid graph learning algorithm based on Gibbs sampling and mutation. In addition, we propose a novel zooming-based algorithm that permits generalized additive models to be employed more efficiently in the case of continuous domains. We demonstrate and discuss the efficacy of our approach via a range of experiments on synthetic functions and real-world datasets. △ Less

Submitted 23 December, 2020; originally announced December 2020.

Comments: To appear in AAAI 2021

MSC Class: 90C26 ACM Class: G.1.6

Journal ref: Vol. 35 No. 9: AAAI-21 Technical Tracks 9 (2021) 7630-7638

arXiv:2011.02678 [pdf, other]

doi 10.1109/ICASSP39728.2021.9414371

BW-EDA-EEND: Streaming End-to-End Neural Speaker Diarization for a Variable Number of Speakers

Authors: Eunjung Han, Chul Lee, Andreas Stolcke

Abstract: We present a novel online end-to-end neural diarization system, BW-EDA-EEND, that processes data incrementally for a variable number of speakers. The system is based on the Encoder-Decoder-Attractor (EDA) architecture of Horiguchi et al., but utilizes the incremental Transformer encoder, attending only to its left contexts and using block-level recurrence in the hidden states to carry information… ▽ More We present a novel online end-to-end neural diarization system, BW-EDA-EEND, that processes data incrementally for a variable number of speakers. The system is based on the Encoder-Decoder-Attractor (EDA) architecture of Horiguchi et al., but utilizes the incremental Transformer encoder, attending only to its left contexts and using block-level recurrence in the hidden states to carry information from block to block, making the algorithm complexity linear in time. We propose two variants: For unlimited-latency BW-EDA-EEND, which processes inputs in linear time, we show only moderate degradation for up to two speakers using a context size of 10 seconds compared to offline EDA-EEND. With more than two speakers, the accuracy gap between online and offline grows, but the algorithm still outperforms a baseline offline clustering diarization system for one to four speakers with unlimited context size, and shows comparable accuracy with context size of 10 seconds. For limited-latency BW-EDA-EEND, which produces diarization outputs block-by-block as audio arrives, we show accuracy comparable to the offline clustering-based system. △ Less

Submitted 12 February, 2021; v1 submitted 5 November, 2020; originally announced November 2020.

Journal ref: Proc. IEEE ICASSP, June 2021, pp. 7193-7197

Showing 1–22 of 22 results for author: Han, E