Skip to main content

Showing 1–47 of 47 results for author: Anumanchipalli, G

Searching in archive cs. Search in all archives.
.
  1. arXiv:2503.09572  [pdf, other

    cs.CL

    Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks

    Authors: Lutfi Eren Erdogan, Nicholas Lee, Sehoon Kim, Suhong Moon, Hiroki Furuta, Gopala Anumanchipalli, Kurt Keutzer, Amir Gholami

    Abstract: Large language models (LLMs) have shown remarkable advancements in enabling language agents to tackle simple tasks. However, applying them for complex, multi-step, long-horizon tasks remains a challenge. Recent work have found success by separating high-level planning from low-level execution, which enables the model to effectively balance high-level planning objectives and low-level execution det… ▽ More

    Submitted 22 April, 2025; v1 submitted 12 March, 2025; originally announced March 2025.

  2. arXiv:2503.04721  [pdf, other

    cs.CL eess.AS

    Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities

    Authors: Guan-Ting Lin, Jiachen Lian, Tingle Li, Qirui Wang, Gopala Anumanchipalli, Alexander H. Liu, Hung-yi Lee

    Abstract: Spoken dialogue modeling introduces unique challenges beyond text-based language modeling, demanding robust turn-taking, backchanneling, and real-time interaction. Although most Spoken Dialogue Models (SDMs) rely on half-duplex processing (handling speech one turn at a time), emerging full-duplex SDMs can listen and speak simultaneously, enabling more natural and engaging conversations. However, c… ▽ More

    Submitted 6 March, 2025; originally announced March 2025.

  3. arXiv:2502.19416  [pdf, other

    cs.CL cs.AI

    Norm Growth and Stability Challenges in Localized Sequential Knowledge Editing

    Authors: Akshat Gupta, Christine Fang, Atahan Ozdemir, Maochuan Lu, Ahmed Alaa, Thomas Hartvigsen, Gopala Anumanchipalli

    Abstract: This study investigates the impact of localized updates to large language models (LLMs), specifically in the context of knowledge editing - a task aimed at incorporating or modifying specific facts without altering broader model capabilities. We first show that across different post-training interventions like continuous pre-training, full fine-tuning and LORA-based fine-tuning, the Frobenius norm… ▽ More

    Submitted 26 February, 2025; originally announced February 2025.

    Comments: Accepted for Oral Presentation at KnowFM @ AAAI 2025. arXiv admin note: text overlap with arXiv:2502.01636

  4. arXiv:2502.01636  [pdf, other

    cs.CL cs.AI cs.LG

    Lifelong Sequential Knowledge Editing without Model Degradation

    Authors: Akshat Gupta, Phudish Prateepamornkul, Maochuan Lu, Ahmed Alaa, Thomas Hartvigsen, Gopala Anumanchipalli

    Abstract: Prior work in parameter-modifying knowledge editing has shown that large-scale sequential editing leads to significant model degradation. In this paper, we study the reasons behind this and scale sequential knowledge editing to 10,000 sequential edits, while maintaining the downstream performance of the original model. We first show that locate-then-edit knowledge editing methods lead to overfitti… ▽ More

    Submitted 3 February, 2025; originally announced February 2025.

  5. arXiv:2501.12385  [pdf, other

    cs.SD cs.LG eess.AS

    Audio Texture Manipulation by Exemplar-Based Analogy

    Authors: Kan Jen Cheng, Tingle Li, Gopala Anumanchipalli

    Abstract: Audio texture manipulation involves modifying the perceptual characteristics of a sound to achieve specific transformations, such as adding, removing, or replacing auditory elements. In this paper, we propose an exemplar-based analogy model for audio texture manipulation. Instead of conditioning on text-based instructions, our method uses paired speech examples, where one clip represents the origi… ▽ More

    Submitted 21 January, 2025; originally announced January 2025.

    Comments: ICASSP 2025

  6. arXiv:2501.08328  [pdf, other

    cs.CL cs.AI cs.GT

    PokerBench: Training Large Language Models to become Professional Poker Players

    Authors: Richard Zhuang, Akshat Gupta, Richard Yang, Aniket Rahane, Zhengyu Li, Gopala Anumanchipalli

    Abstract: We introduce PokerBench - a benchmark for evaluating the poker-playing abilities of large language models (LLMs). As LLMs excel in traditional NLP tasks, their application to complex, strategic games like poker poses a new challenge. Poker, an incomplete information game, demands a multitude of skills such as mathematics, reasoning, planning, strategy, and a deep understanding of game theory and h… ▽ More

    Submitted 24 January, 2025; v1 submitted 14 January, 2025; originally announced January 2025.

    Comments: AAAI 2025

  7. arXiv:2412.13387  [pdf, other

    eess.AS cs.SD

    Deep Speech Synthesis from Multimodal Articulatory Representations

    Authors: Peter Wu, Bohan Yu, Kevin Scheck, Alan W Black, Aditi S. Krishnapriyan, Irene Y. Chen, Tanja Schultz, Shinji Watanabe, Gopala K. Anumanchipalli

    Abstract: The amount of articulatory data available for training deep learning models is much less compared to acoustic speech data. In order to improve articulatory-to-acoustic synthesis performance in these low-resource settings, we propose a multimodal pre-training framework. On single-speaker speech synthesis tasks from real-time magnetic resonance imaging and surface electromyography inputs, the intell… ▽ More

    Submitted 17 December, 2024; originally announced December 2024.

  8. arXiv:2410.07168  [pdf, other

    cs.CL cs.SD eess.AS

    Sylber: Syllabic Embedding Representation of Speech from Raw Audio

    Authors: Cheol Jun Cho, Nicholas Lee, Akshat Gupta, Dhruv Agarwal, Ethan Chen, Alan W Black, Gopala K. Anumanchipalli

    Abstract: Syllables are compositional units of spoken language that efficiently structure human speech perception and production. However, current neural speech representations lack such structure, resulting in dense token sequences that are costly to process. To bridge this gap, we propose a new model, Sylber, that produces speech representations with clean and robust syllabic structure. Specifically, we p… ▽ More

    Submitted 2 March, 2025; v1 submitted 9 October, 2024; originally announced October 2024.

    Comments: Accepted at ICLR 2025

  9. arXiv:2409.17141  [pdf, other

    cs.CL cs.AI cs.LG

    FineZip : Pushing the Limits of Large Language Models for Practical Lossless Text Compression

    Authors: Fazal Mittu, Yihuan Bu, Akshat Gupta, Ashok Devireddy, Alp Eren Ozdarendeli, Anant Singh, Gopala Anumanchipalli

    Abstract: While the language modeling objective has been shown to be deeply connected with compression, it is surprising that modern LLMs are not employed in practical text compression systems. In this paper, we provide an in-depth analysis of neural network and transformer-based compression techniques to answer this question. We compare traditional text compression systems with neural network and LLM-based… ▽ More

    Submitted 25 September, 2024; originally announced September 2024.

  10. arXiv:2409.14340  [pdf, other

    cs.CV cs.LG cs.MM cs.SD eess.AS

    Self-Supervised Audio-Visual Soundscape Stylization

    Authors: Tingle Li, Renhao Wang, Po-Yao Huang, Andrew Owens, Gopala Anumanchipalli

    Abstract: Speech sounds convey a great deal of information about the scenes, resulting in a variety of effects ranging from reverberation to additional ambient sounds. In this paper, we manipulate input speech to sound as though it was recorded within a different scene, given an audio-visual conditional example recorded from that scene. Our model learns through self-supervision, taking advantage of the fact… ▽ More

    Submitted 22 September, 2024; originally announced September 2024.

    Comments: ECCV 2024

  11. arXiv:2409.13582  [pdf, other

    eess.AS cs.AI cs.SD

    Time and Tokens: Benchmarking End-to-End Speech Dysfluency Detection

    Authors: Xuanru Zhou, Jiachen Lian, Cheol Jun Cho, Jingwen Liu, Zongli Ye, Jinming Zhang, Brittany Morin, David Baquirin, Jet Vonk, Zoe Ezzes, Zachary Miller, Maria Luisa Gorno Tempini, Gopala Anumanchipalli

    Abstract: Speech dysfluency modeling is a task to detect dysfluencies in speech, such as repetition, block, insertion, replacement, and deletion. Most recent advancements treat this problem as a time-based object detection problem. In this work, we revisit this problem from a new perspective: tokenizing dysfluencies and modeling the detection problem as a token-based automatic speech recognition (ASR) probl… ▽ More

    Submitted 20 September, 2024; originally announced September 2024.

  12. arXiv:2409.12951  [pdf, other

    cs.LG cs.AI cs.CL

    Geometric Interpretation of Layer Normalization and a Comparative Analysis with RMSNorm

    Authors: Akshat Gupta, Atahan Ozdemir, Gopala Anumanchipalli

    Abstract: This paper presents a novel geometric interpretation of LayerNorm and explores how LayerNorm influences the norm and orientation of hidden vectors in the representation space. With these geometric insights, we prepare the foundation for comparing LayerNorm with RMSNorm. We show that the definition of LayerNorm is innately linked to the uniform vector, defined as… ▽ More

    Submitted 1 February, 2025; v1 submitted 19 September, 2024; originally announced September 2024.

  13. arXiv:2409.09621  [pdf, other

    eess.AS cs.AI cs.SD

    Stutter-Solver: End-to-end Multi-lingual Dysfluency Detection

    Authors: Xuanru Zhou, Cheol Jun Cho, Ayati Sharma, Brittany Morin, David Baquirin, Jet Vonk, Zoe Ezzes, Zachary Miller, Boon Lead Tee, Maria Luisa Gorno Tempini, Jiachen Lian, Gopala Anumanchipalli

    Abstract: Current de-facto dysfluency modeling methods utilize template matching algorithms which are not generalizable to out-of-domain real-world dysfluencies across languages, and are not scalable with increasing amounts of training data. To handle these problems, we propose Stutter-Solver: an end-to-end framework that detects dysfluency with accurate type and time transcription, inspired by the YOLO obj… ▽ More

    Submitted 15 September, 2024; originally announced September 2024.

    Comments: IEEE Spoken Language Technology Workshop 2024

  14. arXiv:2409.02451  [pdf, other

    eess.AS cs.AI cs.SD

    Fast, High-Quality and Parameter-Efficient Articulatory Synthesis using Differentiable DSP

    Authors: Yisi Liu, Bohan Yu, Drake Lin, Peter Wu, Cheol Jun Cho, Gopala Krishna Anumanchipalli

    Abstract: Articulatory trajectories like electromagnetic articulography (EMA) provide a low-dimensional representation of the vocal tract filter and have been used as natural, grounded features for speech synthesis. Differentiable digital signal processing (DDSP) is a parameter-efficient framework for audio synthesis. Therefore, integrating low-dimensional EMA features with DDSP can significantly enhance th… ▽ More

    Submitted 4 September, 2024; originally announced September 2024.

    Comments: accepted for Spoken Language Technology Workshop 2024

  15. arXiv:2409.00608  [pdf, other

    cs.CL cs.LG

    TinyAgent: Function Calling at the Edge

    Authors: Lutfi Eren Erdogan, Nicholas Lee, Siddharth Jha, Sehoon Kim, Ryan Tabrizi, Suhong Moon, Coleman Hooper, Gopala Anumanchipalli, Kurt Keutzer, Amir Gholami

    Abstract: Recent large language models (LLMs) have enabled the development of advanced agentic systems that can integrate various tools and APIs to fulfill user queries through function calling. However, the deployment of these LLMs on the edge has not been explored since they typically require cloud-based infrastructure due to their substantial model size and computational demands. To this end, we present… ▽ More

    Submitted 24 October, 2024; v1 submitted 1 September, 2024; originally announced September 2024.

    Comments: EMNLP 2024 Demo

  16. arXiv:2408.16221  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    SSDM: Scalable Speech Dysfluency Modeling

    Authors: Jiachen Lian, Xuanru Zhou, Zoe Ezzes, Jet Vonk, Brittany Morin, David Baquirin, Zachary Mille, Maria Luisa Gorno Tempini, Gopala Krishna Anumanchipalli

    Abstract: Speech dysfluency modeling is the core module for spoken language learning, and speech therapy. However, there are three challenges. First, current state-of-the-art solutions\cite{lian2023unconstrained-udm, lian-anumanchipalli-2024-towards-hudm} suffer from poor scalability. Second, there is a lack of a large-scale dysfluency corpus. Third, there is not an effective learning framework. In this pap… ▽ More

    Submitted 3 October, 2024; v1 submitted 28 August, 2024; originally announced August 2024.

    Comments: 2024 NeurIPS

  17. arXiv:2408.15297  [pdf, other

    eess.AS cs.AI cs.CL

    YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection

    Authors: Xuanru Zhou, Anshul Kashyap, Steve Li, Ayati Sharma, Brittany Morin, David Baquirin, Jet Vonk, Zoe Ezzes, Zachary Miller, Maria Luisa Gorno Tempini, Jiachen Lian, Gopala Krishna Anumanchipalli

    Abstract: Dysfluent speech detection is the bottleneck for disordered speech analysis and spoken language learning. Current state-of-the-art models are governed by rule-based systems which lack efficiency and robustness, and are sensitive to template design. In this paper, we propose YOLO-Stutter: a first end-to-end method that detects dysfluencies in a time-accurate manner. YOLO-Stutter takes imperfect spe… ▽ More

    Submitted 15 September, 2024; v1 submitted 27 August, 2024; originally announced August 2024.

    Comments: Interspeech 2024

  18. arXiv:2407.07235  [pdf, other

    cs.SD cs.LG eess.AS

    Speech After Gender: A Trans-Feminine Perspective on Next Steps for Speech Science and Technology

    Authors: Robin Netzorg, Alyssa Cote, Sumi Koshin, Klo Vivienne Garoute, Gopala Krishna Anumanchipalli

    Abstract: As experts in voice modification, trans-feminine gender-affirming voice teachers have unique perspectives on voice that confound current understandings of speaker identity. To demonstrate this, we present the Versatile Voice Dataset (VVD), a collection of three speakers modifying their voices along gendered axes. The VVD illustrates that current approaches in speaker modeling, based on categorical… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

  19. arXiv:2406.15754  [pdf, other

    cs.CV cs.CL cs.LG cs.SD eess.AS

    Multimodal Segmentation for Vocal Tract Modeling

    Authors: Rishi Jain, Bohan Yu, Peter Wu, Tejas Prabhune, Gopala Anumanchipalli

    Abstract: Accurate modeling of the vocal tract is necessary to construct articulatory representations for interpretable speech processing and linguistics. However, vocal tract modeling is challenging because many internal articulators are occluded from external motion capture technologies. Real-time magnetic resonance imaging (RT-MRI) allows measuring precise movements of internal articulators during speech… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

    Comments: Interspeech 2024

  20. arXiv:2406.12998  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    Coding Speech through Vocal Tract Kinematics

    Authors: Cheol Jun Cho, Peter Wu, Tejas S. Prabhune, Dhruv Agarwal, Gopala K. Anumanchipalli

    Abstract: Vocal tract articulation is a natural, grounded control space of speech production. The spatiotemporal coordination of articulators combined with the vocal source shapes intelligible speech sounds to enable effective spoken communication. Based on this physiological grounding of speech, we propose a new framework of neural encoding-decoding of speech -- Speech Articulatory Coding (SPARC). SPARC co… ▽ More

    Submitted 14 December, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

    Journal ref: IEEE Journal of Selected Topics in Signal Processing, vol. 18, no. 8, pp. 1427-1440, Dec. 2024

  21. arXiv:2405.00664  [pdf, other

    cs.CL cs.AI cs.LG

    Is Bigger Edit Batch Size Always Better? -- An Empirical Study on Model Editing with Llama-3

    Authors: Junsang Yoon, Akshat Gupta, Gopala Anumanchipalli

    Abstract: This study presents a targeted model editing analysis focused on the latest large language model, Llama-3. We explore the efficacy of popular model editing techniques - ROME, MEMIT, and EMMET, which are designed for precise layer interventions. We identify the most effective layers for targeted edits through an evaluation that encompasses up to 4096 edits across three distinct strategies: sequenti… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

  22. arXiv:2403.15042  [pdf, other

    cs.CL

    LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

    Authors: Nicholas Lee, Thanakul Wattanawong, Sehoon Kim, Karttikeya Mangalam, Sheng Shen, Gopala Anumanchipalli, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

    Abstract: Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks. While many real-world applications still require fine-tuning to reach satisfactory levels of performance, many of them are in the low-data regime, making fine-tuning challenging. To address this, we propose LLM2LLM, a targeted and iterative data augmentation st… ▽ More

    Submitted 13 July, 2024; v1 submitted 22 March, 2024; originally announced March 2024.

    Comments: ACL 2024

  23. arXiv:2403.14236  [pdf, other

    cs.LG cs.AI cs.CL

    A Unified Framework for Model Editing

    Authors: Akshat Gupta, Dev Sajnani, Gopala Anumanchipalli

    Abstract: ROME and MEMIT are largely believed to be two different model editing algorithms, with the major difference between them being the ability to perform batched edits. In this paper, we unify these two algorithms under a single conceptual umbrella, optimizing for the same goal, which we call the preservation-memorization objective. ROME uses an equality constraint to optimize this objective to perfor… ▽ More

    Submitted 8 October, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

    Comments: EMNLP 2024 Findings

  24. arXiv:2403.07175  [pdf, other

    cs.CL cs.AI

    Rebuilding ROME : Resolving Model Collapse during Sequential Model Editing

    Authors: Akshat Gupta, Sidharth Baskaran, Gopala Anumanchipalli

    Abstract: Recent work using Rank-One Model Editing (ROME), a popular model editing method, has shown that there are certain facts that the algorithm is unable to edit without breaking the model. Such edits have previously been called disabling edits. These disabling edits cause immediate model collapse and limits the use of ROME for sequential editing. In this paper, we show that disabling edits are an arti… ▽ More

    Submitted 8 October, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

    Comments: EMNLP 2024 (Main)

  25. arXiv:2402.14805  [pdf, other

    cs.CL cs.AI

    Identifying Multiple Personalities in Large Language Models with External Evaluation

    Authors: Xiaoyang Song, Yuta Adachi, Jessie Feng, Mouwei Lin, Linhao Yu, Frank Li, Akshat Gupta, Gopala Anumanchipalli, Simerjot Kaur

    Abstract: As Large Language Models (LLMs) are integrated with human daily applications rapidly, many societal and ethical concerns are raised regarding the behavior of LLMs. One of the ways to comprehend LLMs' behavior is to analyze their personalities. Many recent studies quantify LLMs' personalities using self-assessment tests that are created for humans. Yet many critiques question the applicability and… ▽ More

    Submitted 22 February, 2024; originally announced February 2024.

  26. arXiv:2401.10015  [pdf, other

    cs.CL eess.AS

    Towards Hierarchical Spoken Language Dysfluency Modeling

    Authors: Jiachen Lian, Gopala Anumanchipalli

    Abstract: Speech disfluency modeling is the bottleneck for both speech therapy and language learning. However, there is no effective AI solution to systematically tackle this problem. We solidify the concept of disfluent speech and disfluent speech modeling. We then present Hierarchical Unconstrained Disfluency Modeling (H-UDM) approach, the hierarchical extension of UDM that addresses both disfluency trans… ▽ More

    Submitted 21 January, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

    Comments: 2024 EACL. Hierarchical extension of our previous workshop paper arXiv:2312.12810

  27. arXiv:2401.07453  [pdf, other

    cs.CL cs.AI cs.IR

    Model Editing at Scale leads to Gradual and Catastrophic Forgetting

    Authors: Akshat Gupta, Anurag Rao, Gopala Anumanchipalli

    Abstract: Editing knowledge in large language models is an attractive capability to have which allows us to correct incorrectly learnt facts during pre-training, as well as update the model with an ever-growing list of new facts. While existing model editing techniques have shown promise, they are usually evaluated using metrics for reliability, specificity and generalization over one or few edits. We argue… ▽ More

    Submitted 10 June, 2024; v1 submitted 14 January, 2024; originally announced January 2024.

    Comments: ACL 2024 Findings

  28. arXiv:2312.12810  [pdf, other

    eess.AS cs.SD

    Unconstrained Dysfluency Modeling for Dysfluent Speech Transcription and Detection

    Authors: Jiachen Lian, Carly Feng, Naasir Farooqi, Steve Li, Anshul Kashyap, Cheol Jun Cho, Peter Wu, Robbie Netzorg, Tingle Li, Gopala Krishna Anumanchipalli

    Abstract: Dysfluent speech modeling requires time-accurate and silence-aware transcription at both the word-level and phonetic-level. However, current research in dysfluency modeling primarily focuses on either transcription or detection, and the performance of each aspect remains limited. In this work, we present an unconstrained dysfluency modeling (UDM) approach that addresses both transcription and dete… ▽ More

    Submitted 20 December, 2023; originally announced December 2023.

    Comments: 2023 ASRU

  29. arXiv:2312.08494  [pdf, other

    cs.SD cs.LG eess.AS

    PerMod: Perceptually Grounded Voice Modification with Latent Diffusion Models

    Authors: Robin Netzorg, Ajil Jalal, Luna McNulty, Gopala Krishna Anumanchipalli

    Abstract: Perceptual modification of voice is an elusive goal. While non-experts can modify an image or sentence perceptually with available tools, it is not clear how to similarly modify speech along perceptual axes. Voice conversion does make it possible to convert one voice to another, but these modifications are handled by black box models, and the specifics of what perceptual qualities to modify and ho… ▽ More

    Submitted 13 December, 2023; originally announced December 2023.

  30. arXiv:2310.16287  [pdf, other

    cs.SD cs.GR eess.AS

    Towards Streaming Speech-to-Avatar Synthesis

    Authors: Tejas S. Prabhune, Peter Wu, Bohan Yu, Gopala K. Anumanchipalli

    Abstract: Streaming speech-to-avatar synthesis creates real-time animations for a virtual character from audio data. Accurate avatar representations of speech are important for the visualization of sound in linguistics, phonetics, and phonology, visual feedback to assist second language acquisition, and virtual embodiment for paralyzed patients. Previous works have highlighted the capability of deep articul… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

    Comments: Submitted to ICASSP 2024

  31. arXiv:2310.10803  [pdf, other

    cs.CL eess.AS

    SD-HuBERT: Sentence-Level Self-Distillation Induces Syllabic Organization in HuBERT

    Authors: Cheol Jun Cho, Abdelrahman Mohamed, Shang-Wen Li, Alan W Black, Gopala K. Anumanchipalli

    Abstract: Data-driven unit discovery in self-supervised learning (SSL) of speech has embarked on a new era of spoken language processing. Yet, the discovered units often remain in phonetic space and the units beyond phonemes are largely underexplored. Here, we demonstrate that a syllabic organization emerges in learning sentence-level representation of speech. In particular, we adopt "self-distillation" obj… ▽ More

    Submitted 10 April, 2025; v1 submitted 16 October, 2023; originally announced October 2023.

  32. arXiv:2310.10788  [pdf, other

    eess.AS cs.CL

    Self-Supervised Models of Speech Infer Universal Articulatory Kinematics

    Authors: Cheol Jun Cho, Abdelrahman Mohamed, Alan W Black, Gopala K. Anumanchipalli

    Abstract: Self-Supervised Learning (SSL) based models of speech have shown remarkable performance on a range of downstream tasks. These state-of-the-art models have remained blackboxes, but many recent studies have begun "probing" models like HuBERT, to correlate their internal representations to different aspects of speech. In this paper, we show "inference of articulatory kinematics" as fundamental proper… ▽ More

    Submitted 16 January, 2024; v1 submitted 16 October, 2023; originally announced October 2023.

  33. arXiv:2310.02497  [pdf, other

    cs.SD cs.LG eess.AS

    Towards an Interpretable Representation of Speaker Identity via Perceptual Voice Qualities

    Authors: Robin Netzorg, Bohan Yu, Andrea Guzman, Peter Wu, Luna McNulty, Gopala Anumanchipalli

    Abstract: Unlike other data modalities such as text and vision, speech does not lend itself to easy interpretation. While lay people can understand how to describe an image or sentence via perception, non-expert descriptions of speech often end at high-level demographic information, such as gender or age. In this paper, we propose a possible interpretable representation of speaker identity based on perceptu… ▽ More

    Submitted 3 October, 2023; originally announced October 2023.

  34. arXiv:2309.09088  [pdf, other

    cs.SD eess.AS

    Enhancing GAN-Based Vocoders with Contrastive Learning Under Data-limited Condition

    Authors: Haoming Guo, Seth Z. Zhao, Jiachen Lian, Gopala Anumanchipalli, Gerald Friedland

    Abstract: Vocoder models have recently achieved substantial progress in generating authentic audio comparable to human quality while significantly reducing memory requirement and inference time. However, these data-hungry generative models require large-scale audio data for learning good representations. In this paper, we apply contrastive learning methods in training the vocoder to improve the perceptual q… ▽ More

    Submitted 18 December, 2023; v1 submitted 16 September, 2023; originally announced September 2023.

  35. arXiv:2309.08163  [pdf, other

    cs.CL cs.AI

    Self-Assessment Tests are Unreliable Measures of LLM Personality

    Authors: Akshat Gupta, Xiaoyang Song, Gopala Anumanchipalli

    Abstract: As large language models (LLM) evolve in their capabilities, various recent studies have tried to quantify their behavior using psychological tools created to study human behavior. One such example is the measurement of "personality" of LLMs using self-assessment personality tests developed to measure human personality. Yet almost none of these works verify the applicability of these tests on LLMs… ▽ More

    Submitted 2 January, 2024; v1 submitted 15 September, 2023; originally announced September 2023.

  36. arXiv:2309.07861  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    CiwaGAN: Articulatory information exchange

    Authors: Gašper Beguš, Thomas Lu, Alan Zhou, Peter Wu, Gopala K. Anumanchipalli

    Abstract: Humans encode information into sounds by controlling articulators and decode information from sounds using the auditory apparatus. This paper introduces CiwaGAN, a model of human spoken language acquisition that combines unsupervised articulatory modeling with an unsupervised model of information exchange through the auditory modality. While prior research includes unsupervised articulatory modeli… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.

  37. arXiv:2308.06443  [pdf, other

    cs.LG eess.AS

    Neural Latent Aligner: Cross-trial Alignment for Learning Representations of Complex, Naturalistic Neural Data

    Authors: Cheol Jun Cho, Edward F. Chang, Gopala K. Anumanchipalli

    Abstract: Understanding the neural implementation of complex human behaviors is one of the major goals in neuroscience. To this end, it is crucial to find a true representation of the neural data, which is challenging due to the high complexity of behaviors and the low signal-to-ratio (SNR) of the signals. Here, we propose a novel unsupervised learning framework, Neural Latent Aligner (NLA), to find well-co… ▽ More

    Submitted 11 August, 2023; originally announced August 2023.

    Comments: Accepted at ICML 2023

    Journal ref: Proceedings of the 40th International Conference on Machine Learning (2023), PMLR 202:5661-5676

  38. arXiv:2302.06774  [pdf, other

    eess.AS cs.SD

    Speaker-Independent Acoustic-to-Articulatory Speech Inversion

    Authors: Peter Wu, Li-Wei Chen, Cheol Jun Cho, Shinji Watanabe, Louis Goldstein, Alan W Black, Gopala K. Anumanchipalli

    Abstract: To build speech processing methods that can handle speech as naturally as humans, researchers have explored multiple ways of building an invertible mapping from speech to an interpretable space. The articulatory space is a promising inversion target, since this space captures the mechanics of speech production. To this end, we build an acoustic-to-articulatory inversion (AAI) model that leverages… ▽ More

    Submitted 24 July, 2023; v1 submitted 13 February, 2023; originally announced February 2023.

  39. arXiv:2210.16498  [pdf, other

    eess.AS cs.SD

    Articulatory Representation Learning Via Joint Factor Analysis and Neural Matrix Factorization

    Authors: Jiachen Lian, Alan W Black, Yijing Lu, Louis Goldstein, Shinji Watanabe, Gopala K. Anumanchipalli

    Abstract: Articulatory representation learning is the fundamental research in modeling neural speech production system. Our previous work has established a deep paradigm to decompose the articulatory kinematics data into gestures, which explicitly model the phonological and linguistic structure encoded with human speech production mechanism, and corresponding gestural scores. We continue with this line of w… ▽ More

    Submitted 20 February, 2023; v1 submitted 29 October, 2022; originally announced October 2022.

    Comments: Accepted to 2023 ICASSP. Camera Ready

  40. arXiv:2210.15272  [pdf, ps, other

    eess.AS cs.SD eess.SP

    A Fast and Accurate Pitch Estimation Algorithm Based on the Pseudo Wigner-Ville Distribution

    Authors: Yisi Liu, Peter Wu, Alan W Black, Gopala K. Anumanchipalli

    Abstract: Estimation of fundamental frequency (F0) in voiced segments of speech signals, also known as pitch tracking, plays a crucial role in pitch synchronous speech analysis, speech synthesis, and speech manipulation. In this paper, we capitalize on the high time and frequency resolution of the pseudo Wigner-Ville distribution (PWVD) and propose a new PWVD-based pitch estimation method. We devise an effi… ▽ More

    Submitted 27 October, 2022; originally announced October 2022.

  41. arXiv:2210.15173  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    Articulation GAN: Unsupervised modeling of articulatory learning

    Authors: Gašper Beguš, Alan Zhou, Peter Wu, Gopala K Anumanchipalli

    Abstract: Generative deep neural networks are widely used for speech synthesis, but most existing models directly generate waveforms or spectral outputs. Humans, however, produce speech by controlling articulators, which results in the production of speech sounds through physical properties of sound propagation. We introduce the Articulatory Generator to the Generative Adversarial Network paradigm, a new un… ▽ More

    Submitted 12 March, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

    Comments: ICASSP 2023

    Journal ref: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing

  42. Evidence of Vocal Tract Articulation in Self-Supervised Learning of Speech

    Authors: Cheol Jun Cho, Peter Wu, Abdelrahman Mohamed, Gopala K. Anumanchipalli

    Abstract: Recent self-supervised learning (SSL) models have proven to learn rich representations of speech, which can readily be utilized by diverse downstream tasks. To understand such utilities, various analyses have been done for speech SSL models to reveal which and how information is encoded in the learned representations. Although the scope of previous analyses is extensive in acoustic, phonetic, and… ▽ More

    Submitted 20 July, 2023; v1 submitted 21 October, 2022; originally announced October 2022.

  43. arXiv:2209.06337  [pdf, other

    eess.AS cs.SD q-bio.QM

    Deep Speech Synthesis from Articulatory Representations

    Authors: Peter Wu, Shinji Watanabe, Louis Goldstein, Alan W Black, Gopala K. Anumanchipalli

    Abstract: In the articulatory synthesis task, speech is synthesized from input features containing information about the physical behavior of the human vocal tract. This task provides a promising direction for speech synthesis research, as the articulatory space is compact, smooth, and interpretable. Current works have highlighted the potential for deep learning models to perform articulatory synthesis. How… ▽ More

    Submitted 13 September, 2022; originally announced September 2022.

  44. arXiv:2206.02512  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    Unsupervised TTS Acoustic Modeling for TTS with Conditional Disentangled Sequential VAE

    Authors: Jiachen Lian, Chunlei Zhang, Gopala Krishna Anumanchipalli, Dong Yu

    Abstract: In this paper, we propose a novel unsupervised text-to-speech acoustic model training scheme, named UTTS, which does not require text-audio pairs. UTTS is a multi-speaker speech synthesizer that supports zero-shot voice cloning, it is developed from a perspective of disentangled speech representation learning. The framework offers a flexible choice of a speaker's duration model, timbre feature (id… ▽ More

    Submitted 6 October, 2024; v1 submitted 6 June, 2022; originally announced June 2022.

    Comments: IEEE/ACM Transactions on Audio, Speech, and Language Processing ( Volume: 31)

  45. arXiv:2205.05227  [pdf, ps, other

    eess.AS cs.AI cs.CL cs.SD

    Towards Improved Zero-shot Voice Conversion with Conditional DSVAE

    Authors: Jiachen Lian, Chunlei Zhang, Gopala Krishna Anumanchipalli, Dong Yu

    Abstract: Disentangling content and speaking style information is essential for zero-shot non-parallel voice conversion (VC). Our previous study investigated a novel framework with disentangled sequential variational autoencoder (DSVAE) as the backbone for information decomposition. We have demonstrated that simultaneous disentangling content embedding and speaker embedding from one utterance is feasible fo… ▽ More

    Submitted 20 June, 2022; v1 submitted 10 May, 2022; originally announced May 2022.

    Comments: Accepted to 2022 Interspeech. Demo link is here https://jlian2.github.io/Improved-Voice-Conversion-with-Conditional-DSVAE/

  46. arXiv:2204.00465  [pdf, other

    eess.AS cs.AI eess.SP

    Deep Neural Convolutive Matrix Factorization for Articulatory Representation Decomposition

    Authors: Jiachen Lian, Alan W Black, Louis Goldstein, Gopala Krishna Anumanchipalli

    Abstract: Most of the research on data-driven speech representation learning has focused on raw audios in an end-to-end manner, paying little attention to their internal phonological or gestural structure. This work, investigating the speech representations derived from articulatory kinematics signals, uses a neural implementation of convolutive sparse matrix factorization to decompose the articulatory data… ▽ More

    Submitted 20 June, 2022; v1 submitted 1 April, 2022; originally announced April 2022.

    Comments: Accepted to 2022 Interspeech. Code is publicly available at https://github.com/Berkeley-Speech-Group/ema_gesture

  47. arXiv:1909.01401  [pdf, other

    cs.LG cs.CL q-bio.NC stat.ML

    Brain2Char: A Deep Architecture for Decoding Text from Brain Recordings

    Authors: Pengfei Sun, Gopala K. Anumanchipalli, Edward F. Chang

    Abstract: Decoding language representations directly from the brain can enable new Brain-Computer Interfaces (BCI) for high bandwidth human-human and human-machine communication. Clinically, such technologies can restore communication in people with neurological conditions affecting their ability to speak. In this study, we propose a novel deep network architecture Brain2Char, for directly decoding text (sp… ▽ More

    Submitted 3 September, 2019; originally announced September 2019.