Skip to main content

Showing 1–29 of 29 results for author: Mouchtaris, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.12103  [pdf, other

    cs.AI cs.CY cs.LG

    The Amazon Nova Family of Models: Technical Report and Model Card

    Authors: Amazon AGI, Aaron Langford, Aayush Shah, Abhanshu Gupta, Abhimanyu Bhatter, Abhinav Goyal, Abhinav Mathur, Abhinav Mohanty, Abhishek Kumar, Abhishek Sethi, Abi Komma, Abner Pena, Achin Jain, Adam Kunysz, Adam Opyrchal, Adarsh Singh, Aditya Rawal, Adok Achar Budihal Prasad, Adrià de Gispert, Agnika Kumar, Aishwarya Aryamane, Ajay Nair, Akilan M, Akshaya Iyengar, Akshaya Vishnu Kudlu Shanbhogue , et al. (761 additional authors not shown)

    Abstract: We present Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. Amazon Nova Pro is a highly-capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Lite is a low-cost multimodal model that is lightning fast for processing images, video, documents… ▽ More

    Submitted 17 March, 2025; originally announced June 2025.

    Comments: 48 pages, 10 figures

    Report number: 20250317

  2. arXiv:2505.14871  [pdf, ps, other

    cs.CL cs.LG

    Saten: Sparse Augmented Tensor Networks for Post-Training Compression of Large Language Models

    Authors: Ryan Solgi, Kai Zhen, Rupak Vignesh Swaminathan, Nathan Susanj, Athanasios Mouchtaris, Siegfried Kunzmann, Zheng Zhang

    Abstract: The efficient implementation of large language models (LLMs) is crucial for deployment on resource-constrained devices. Low-rank tensor compression techniques, such as tensor-train (TT) networks, have been widely studied for over-parameterized neural networks. However, their applications to compress pre-trained large language models (LLMs) for downstream tasks (post-training) remains challenging d… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

  3. arXiv:2503.04992  [pdf, ps, other

    cs.LG cs.AI cs.CL

    Wanda++: Pruning Large Language Models via Regional Gradients

    Authors: Yifan Yang, Kai Zhen, Bhavana Ganesh, Aram Galstyan, Goeric Huybrechts, Markus Müller, Jonas M. Kübler, Rupak Vignesh Swaminathan, Athanasios Mouchtaris, Sravan Babu Bodapati, Nathan Susanj, Zheng Zhang, Jack FitzGerald, Abhishek Kumar

    Abstract: Large Language Models (LLMs) pruning seeks to remove unimportant weights for inference speedup with minimal accuracy impact. However, existing methods often suffer from accuracy degradation without full-model sparsity-aware fine-tuning. This paper presents Wanda++, a novel pruning framework that outperforms the state-of-the-art methods by utilizing decoder-block-level \textbf{regional} gradients.… ▽ More

    Submitted 1 June, 2025; v1 submitted 6 March, 2025; originally announced March 2025.

    Comments: Paper accepted at ACL 2025 Findings

  4. arXiv:2502.12346  [pdf, other

    cs.LG cs.AI

    QuZO: Quantized Zeroth-Order Fine-Tuning for Large Language Models

    Authors: Jiajun Zhou, Yifan Yang, Kai Zhen, Ziyue Liu, Yequan Zhao, Ershad Banijamali, Athanasios Mouchtaris, Ngai Wong, Zheng Zhang

    Abstract: Language Models (LLMs) are often quantized to lower precision to reduce the memory cost and latency in inference. However, quantization often degrades model performance, thus fine-tuning is required for various down-stream tasks. Traditional fine-tuning methods such as stochastic gradient descent and Adam optimization require backpropagation, which are error-prone in the low-precision settings. To… ▽ More

    Submitted 17 February, 2025; originally announced February 2025.

  5. arXiv:2502.11513  [pdf, other

    cs.LG cs.AI

    MaZO: Masked Zeroth-Order Optimization for Multi-Task Fine-Tuning of Large Language Models

    Authors: Zhen Zhang, Yifan Yang, Kai Zhen, Nathan Susanj, Athanasios Mouchtaris, Siegfried Kunzmann, Zheng Zhang

    Abstract: Large language models have demonstrated exceptional capabilities across diverse tasks, but their fine-tuning demands significant memory, posing challenges for resource-constrained environments. Zeroth-order (ZO) optimization provides a memory-efficient alternative by eliminating the need for backpropagation. However, ZO optimization suffers from high gradient variance, and prior research has large… ▽ More

    Submitted 17 February, 2025; originally announced February 2025.

    Comments: 17 pages

  6. arXiv:2406.18060  [pdf, other

    cs.CL cs.AI cs.LG

    AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning

    Authors: Yifan Yang, Kai Zhen, Ershad Banijamal, Athanasios Mouchtaris, Zheng Zhang

    Abstract: Fine-tuning large language models (LLMs) has achieved remarkable performance across various natural language processing tasks, yet it demands more and more memory as model sizes keep growing. To address this issue, the recently proposed Memory-efficient Zeroth-order (MeZO) methods attempt to fine-tune LLMs using only forward passes, thereby avoiding the need for a backpropagation graph. However, s… ▽ More

    Submitted 2 December, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

    Comments: Accepted for publication in EMNLP 2024

  7. Accelerator-Aware Training for Transducer-Based Speech Recognition

    Authors: Suhaila M. Shakiah, Rupak Vignesh Swaminathan, Hieu Duy Nguyen, Raviteja Chinta, Tariq Afzal, Nathan Susanj, Athanasios Mouchtaris, Grant P. Strimel, Ariya Rastrow

    Abstract: Machine learning model weights and activations are represented in full-precision during training. This leads to performance degradation in runtime when deployed on neural network accelerator (NNA) chips, which leverage highly parallelized fixed-point arithmetic to improve runtime memory and latency. In this work, we replicate the NNA operators during the training phase, accounting for the degradat… ▽ More

    Submitted 12 May, 2023; originally announced May 2023.

    Comments: Accepted to SLT 2022

    Journal ref: IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, 2023, pp. 100-107

  8. arXiv:2305.05271  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Robust Acoustic and Semantic Contextual Biasing in Neural Transducers for Speech Recognition

    Authors: Xuandi Fu, Kanthashree Mysore Sathyendra, Ankur Gandhe, Jing Liu, Grant P. Strimel, Ross McGowan, Athanasios Mouchtaris

    Abstract: Attention-based contextual biasing approaches have shown significant improvements in the recognition of generic and/or personal rare-words in End-to-End Automatic Speech Recognition (E2E ASR) systems like neural transducers. These approaches employ cross-attention to bias the model towards specific contextual entities injected as bias-phrases to the model. Prior approaches typically relied on subw… ▽ More

    Submitted 9 May, 2023; originally announced May 2023.

    Comments: Accepted at ICASSP 2023

  9. arXiv:2304.01905  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Dual-Attention Neural Transducers for Efficient Wake Word Spotting in Speech Recognition

    Authors: Saumya Y. Sahai, Jing Liu, Thejaswi Muniyappa, Kanthashree M. Sathyendra, Anastasios Alexandridis, Grant P. Strimel, Ross McGowan, Ariya Rastrow, Feng-Ju Chang, Athanasios Mouchtaris, Siegfried Kunzmann

    Abstract: We present dual-attention neural biasing, an architecture designed to boost Wake Words (WW) recognition and improve inference time latency on speech recognition tasks. This architecture enables a dynamic switch for its runtime compute paths by exploiting WW spotting to select which branch of its attention networks to execute for an input audio frame. With this approach, we effectively improve WW s… ▽ More

    Submitted 4 April, 2023; v1 submitted 2 April, 2023; originally announced April 2023.

    Comments: Accepted to Proc. IEEE ICASSP 2023

  10. arXiv:2210.09188  [pdf, other

    cs.SD cs.LG eess.AS

    Sub-8-bit quantization for on-device speech recognition: a regularization-free approach

    Authors: Kai Zhen, Martin Radfar, Hieu Duy Nguyen, Grant P. Strimel, Nathan Susanj, Athanasios Mouchtaris

    Abstract: For on-device automatic speech recognition (ASR), quantization aware training (QAT) is ubiquitous to achieve the trade-off between model predictive performance and efficiency. Among existing QAT methods, one major drawback is that the quantization centroids have to be predetermined and fixed. To overcome this limitation, we introduce a regularization-free, "soft-to-hard" compression mechanism with… ▽ More

    Submitted 1 November, 2022; v1 submitted 17 October, 2022; originally announced October 2022.

    Comments: Accepted for publication at IEEE SLT'22

  11. arXiv:2209.14868  [pdf, other

    cs.SD cs.CL eess.AS

    ConvRNN-T: Convolutional Augmented Recurrent Neural Network Transducers for Streaming Speech Recognition

    Authors: Martin Radfar, Rohit Barnwal, Rupak Vignesh Swaminathan, Feng-Ju Chang, Grant P. Strimel, Nathan Susanj, Athanasios Mouchtaris

    Abstract: The recurrent neural network transducer (RNN-T) is a prominent streaming end-to-end (E2E) ASR technology. In RNN-T, the acoustic encoder commonly consists of stacks of LSTMs. Very recently, as an alternative to LSTM layers, the Conformer architecture was introduced where the encoder of RNN-T is replaced with a modified Transformer encoder composed of convolutional layers at the frontend and betwee… ▽ More

    Submitted 29 September, 2022; originally announced September 2022.

    Comments: This paper was presented in Interspeech 2022

  12. arXiv:2207.02393  [pdf, other

    cs.CL cs.SD eess.AS

    Compute Cost Amortized Transformer for Streaming ASR

    Authors: Yi Xie, Jonathan Macoskey, Martin Radfar, Feng-Ju Chang, Brian King, Ariya Rastrow, Athanasios Mouchtaris, Grant P. Strimel

    Abstract: We present a streaming, Transformer-based end-to-end automatic speech recognition (ASR) architecture which achieves efficient neural inference through compute cost amortization. Our architecture creates sparse computation pathways dynamically at inference time, resulting in selective use of compute resources throughout decoding, enabling significant reductions in compute with minimal impact on acc… ▽ More

    Submitted 4 July, 2022; originally announced July 2022.

  13. arXiv:2206.15408  [pdf, other

    eess.AS cs.AI eess.SP

    Sub-8-Bit Quantization Aware Training for 8-Bit Neural Network Accelerator with On-Device Speech Recognition

    Authors: Kai Zhen, Hieu Duy Nguyen, Raviteja Chinta, Nathan Susanj, Athanasios Mouchtaris, Tariq Afzal, Ariya Rastrow

    Abstract: We present a novel sub-8-bit quantization-aware training (S8BQAT) scheme for 8-bit neural network accelerators. Our method is inspired from Lloyd-Max compression theory with practical adaptations for a feasible computational overhead during training. With the quantization centroids derived from a 32-bit baseline, we augment training loss with a Multi-Regional Absolute Cosine (MRACos) regularizer t… ▽ More

    Submitted 30 June, 2022; originally announced June 2022.

    Comments: Accepted for publication in INTERSPEECH 2022

  14. arXiv:2205.13660  [pdf, other

    cs.CL cs.LG

    Contextual Adapters for Personalized Speech Recognition in Neural Transducers

    Authors: Kanthashree Mysore Sathyendra, Thejaswi Muniyappa, Feng-Ju Chang, Jing Liu, Jinru Su, Grant P. Strimel, Athanasios Mouchtaris, Siegfried Kunzmann

    Abstract: Personal rare word recognition in end-to-end Automatic Speech Recognition (E2E ASR) models is a challenge due to the lack of training data. A standard way to address this issue is with shallow fusion methods at inference time. However, due to their dependence on external language models and the deterministic approach to weight boosting, their performance is limited. In this paper, we propose train… ▽ More

    Submitted 26 May, 2022; originally announced May 2022.

    Comments: Accepted at ICASSP 2022

  15. arXiv:2205.05590  [pdf, other

    cs.CL cs.SD eess.AS

    A neural prosody encoder for end-ro-end dialogue act classification

    Authors: Kai Wei, Dillon Knox, Martin Radfar, Thanh Tran, Markus Muller, Grant P. Strimel, Nathan Susanj, Athanasios Mouchtaris, Maurizio Omologo

    Abstract: Dialogue act classification (DAC) is a critical task for spoken language understanding in dialogue systems. Prosodic features such as energy and pitch have been shown to be useful for DAC. Despite their importance, little research has explored neural approaches to integrate prosodic features into end-to-end (E2E) DAC models which infer dialogue acts directly from audio signals. In this work, we pr… ▽ More

    Submitted 11 May, 2022; originally announced May 2022.

  16. arXiv:2111.03250  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Context-Aware Transformer Transducer for Speech Recognition

    Authors: Feng-Ju Chang, Jing Liu, Martin Radfar, Athanasios Mouchtaris, Maurizio Omologo, Ariya Rastrow, Siegfried Kunzmann

    Abstract: End-to-end (E2E) automatic speech recognition (ASR) systems often have difficulty recognizing uncommon words, that appear infrequently in the training data. One promising method, to improve the recognition accuracy on such rare words, is to latch onto personalized/contextual information at inference. In this work, we present a novel context-aware transformer transducer (CATT) network that improves… ▽ More

    Submitted 5 November, 2021; originally announced November 2021.

    Comments: Accepted to ASRU 2021

  17. arXiv:2111.00400  [pdf, other

    cs.CL cs.SD eess.AS

    FANS: Fusing ASR and NLU for on-device SLU

    Authors: Martin Radfar, Athanasios Mouchtaris, Siegfried Kunzmann, Ariya Rastrow

    Abstract: Spoken language understanding (SLU) systems translate voice input commands to semantics which are encoded as an intent and pairs of slot tags and values. Most current SLU systems deploy a cascade of two neural models where the first one maps the input audio to a transcript (ASR) and the second predicts the intent and slots from the transcript (NLU). In this paper, we introduce FANS, a new end-to-e… ▽ More

    Submitted 30 October, 2021; originally announced November 2021.

    Comments: Published in Interspeech 2021

  18. arXiv:2108.12953  [pdf, other

    eess.AS cs.LG cs.SD

    Multi-Channel Transformer Transducer for Speech Recognition

    Authors: Feng-Ju Chang, Martin Radfar, Athanasios Mouchtaris, Maurizio Omologo

    Abstract: Multi-channel inputs offer several advantages over single-channel, to improve the robustness of on-device speech recognition systems. Recent work on multi-channel transformer, has proposed a way to incorporate such inputs into end-to-end ASR for improved accuracy. However, this approach is characterized by a high computational complexity, which prevents it from being deployed in on-device systems.… ▽ More

    Submitted 29 August, 2021; originally announced August 2021.

    Journal ref: Published in INTERSPEECH 2021

  19. arXiv:2106.09009  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    End-to-End Spoken Language Understanding for Generalized Voice Assistants

    Authors: Michael Saxon, Samridhi Choudhary, Joseph P. McKenna, Athanasios Mouchtaris

    Abstract: End-to-end (E2E) spoken language understanding (SLU) systems predict utterance semantics directly from speech using a single model. Previous work in this area has focused on targeted tasks in fixed domains, where the output semantic structure is assumed a priori and the input speech is of limited complexity. In this work we present our approach to developing an E2E model for generalized SLU in com… ▽ More

    Submitted 19 July, 2021; v1 submitted 16 June, 2021; originally announced June 2021.

    Comments: Accepted to Interspeech 2021; 5 pages, 2 tables, 1 figure

    Journal ref: Proc. Interspeech 2021, 4738-4742

  20. arXiv:2106.07734  [pdf, other

    cs.CL cs.LG eess.AS

    CoDERT: Distilling Encoder Representations with Co-learning for Transducer-based Speech Recognition

    Authors: Rupak Vignesh Swaminathan, Brian King, Grant P. Strimel, Jasha Droppo, Athanasios Mouchtaris

    Abstract: We propose a simple yet effective method to compress an RNN-Transducer (RNN-T) through the well-known knowledge distillation paradigm. We show that the transducer's encoder outputs naturally have a high entropy and contain rich information about acoustically similar word-piece confusions. This rich information is suppressed when combined with the lower entropy decoder outputs to produce the joint… ▽ More

    Submitted 14 June, 2021; originally announced June 2021.

    Comments: Accepted at InterSpeech 2021

  21. arXiv:2106.06126  [pdf, other

    cs.SD cs.LG eess.AS

    Exploiting Large-scale Teacher-Student Training for On-device Acoustic Models

    Authors: Jing Liu, Rupak Vignesh Swaminathan, Sree Hari Krishnan Parthasarathi, Chunchuan Lyu, Athanasios Mouchtaris, Siegfried Kunzmann

    Abstract: We present results from Alexa speech teams on semi-supervised learning (SSL) of acoustic models (AM) with experiments spanning over 3000 hours of GPU time, making our study one of the largest of its kind. We discuss SSL for AMs in a small footprint setting, showing that a smaller capacity model trained with 1 million hours of unsupervised data can outperform a baseline supervised system by 14.3% w… ▽ More

    Submitted 10 June, 2021; originally announced June 2021.

    Comments: TSD2021

  22. arXiv:2102.04932  [pdf, other

    cs.LG cs.AI cs.CL cs.SD eess.AS

    Sparsification via Compressed Sensing for Automatic Speech Recognition

    Authors: Kai Zhen, Hieu Duy Nguyen, Feng-Ju Chang, Athanasios Mouchtaris, Ariya Rastrow, .

    Abstract: In order to achieve high accuracy for machine learning (ML) applications, it is essential to employ models with a large number of parameters. Certain applications, such as Automatic Speech Recognition (ASR), however, require real-time interactions with users, hence compelling the model to have as low latency as possible. Deploying large scale ML applications thus necessitates model quantization an… ▽ More

    Submitted 9 February, 2021; originally announced February 2021.

    Comments: 5 pages, accepted for publication in (ICASSP 2021) 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing. June 6-12, 2021. Location: Toronto, ON, Canada

  23. arXiv:2102.03951  [pdf, other

    eess.AS cs.CL cs.SD

    End-to-End Multi-Channel Transformer for Speech Recognition

    Authors: Feng-Ju Chang, Martin Radfar, Athanasios Mouchtaris, Brian King, Siegfried Kunzmann

    Abstract: Transformers are powerful neural architectures that allow integrating different modalities using attention mechanisms. In this paper, we leverage the neural transformer architectures for multi-channel speech recognition systems, where the spectral and spatial information collected from different microphones are integrated using attention layers. Our multi-channel transformer network mainly consist… ▽ More

    Submitted 7 February, 2021; originally announced February 2021.

    Comments: Accepted by 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2021)

  24. arXiv:2011.09044  [pdf, other

    eess.AS cs.CL cs.SD

    Tie Your Embeddings Down: Cross-Modal Latent Spaces for End-to-end Spoken Language Understanding

    Authors: Bhuvan Agrawal, Markus Müller, Martin Radfar, Samridhi Choudhary, Athanasios Mouchtaris, Siegfried Kunzmann

    Abstract: End-to-end (E2E) spoken language understanding (SLU) systems can infer the semantics of a spoken utterance directly from an audio signal. However, training an E2E system remains a challenge, largely due to the scarcity of paired audio-semantics data. In this paper, we treat an E2E system as a multi-modal model, with audio and text functioning as its two modalities, and use a cross-modal latent spa… ▽ More

    Submitted 15 April, 2021; v1 submitted 17 November, 2020; originally announced November 2020.

    Comments: 7 pages, 6 figures

  25. arXiv:2008.10984  [pdf, other

    cs.CL cs.SD eess.AS

    End-to-End Neural Transformer Based Spoken Language Understanding

    Authors: Martin Radfar, Athanasios Mouchtaris, Siegfried Kunzmann

    Abstract: Spoken language understanding (SLU) refers to the process of inferring the semantic information from audio signals. While the neural transformers consistently deliver the best performance among the state-of-the-art neural architectures in field of natural language processing (NLP), their merits in a closely related field, i.e., spoken language understanding (SLU) have not beed investigated. In thi… ▽ More

    Submitted 12 August, 2020; originally announced August 2020.

    Comments: Interspeech 2020

  26. arXiv:2008.02858  [pdf, other

    cs.CL cs.SD eess.AS

    Semantic Complexity in End-to-End Spoken Language Understanding

    Authors: Joseph P. McKenna, Samridhi Choudhary, Michael Saxon, Grant P. Strimel, Athanasios Mouchtaris

    Abstract: End-to-end spoken language understanding (SLU) models are a class of model architectures that predict semantics directly from speech. Because of their input and output types, we refer to them as speech-to-interpretation (STI) models. Previous works have successfully applied STI models to targeted use cases, such as recognizing home automation commands, however no study has yet addressed how these… ▽ More

    Submitted 6 August, 2020; originally announced August 2020.

    Comments: Accepted at Interspeech, 2020

  27. arXiv:2007.03900  [pdf, other

    eess.AS cs.CL cs.SD

    Streaming End-to-End Bilingual ASR Systems with Joint Language Identification

    Authors: Surabhi Punjabi, Harish Arsikere, Zeynab Raeesy, Chander Chandak, Nikhil Bhave, Ankish Bansal, Markus Müller, Sergio Murillo, Ariya Rastrow, Sri Garimella, Roland Maas, Mat Hans, Athanasios Mouchtaris, Siegfried Kunzmann

    Abstract: Multilingual ASR technology simplifies model training and deployment, but its accuracy is known to depend on the availability of language information at runtime. Since language identity is seldom known beforehand in real-world scenarios, it must be inferred on-the-fly with minimum latency. Furthermore, in voice-activated smart assistant systems, language identity is also required for downstream pr… ▽ More

    Submitted 8 July, 2020; originally announced July 2020.

  28. arXiv:1809.10117  [pdf, ps, other

    eess.IV cs.CV

    Convolutional Neural Networks for Video Quality Assessment

    Authors: Michalis Giannopoulos, Grigorios Tsagkatakis, Saverio Blasi, Farzad Toutounchi, Athanasios Mouchtaris, Panagiotis Tsakalides, Marta Mrak, Ebroul Izquierdo

    Abstract: Video Quality Assessment (VQA) is a very challenging task due to its highly subjective nature. Moreover, many factors influence VQA. Compression of video content, while necessary for minimising transmission and storage requirements, introduces distortions which can have detrimental effects on the perceived quality. Especially when dealing with modern video coding standards, it is extremely difficu… ▽ More

    Submitted 26 September, 2018; originally announced September 2018.

    Comments: Number of Pages: 12, Number of Figures: 17, Submitted to: Signal Processing: Image Communication (Elsevier)

  29. arXiv:1204.1085  [pdf, ps, other

    cs.IT

    Post-Nonlinear Sparse Component Analysis Using Single-Source Zones and Functional Data Clustering

    Authors: Matthieu Puigt, Anthony Griffin, Athanasios Mouchtaris

    Abstract: In this paper, we introduce a general extension of linear sparse component analysis (SCA) approaches to postnonlinear (PNL) mixtures. In particular, and contrary to the state-of-art methods, our approaches use a weak sparsity source assumption: we look for tiny temporal zones where only one source is active. We investigate two nonlinear single-source confidence measures, using the mutual informati… ▽ More

    Submitted 4 April, 2012; originally announced April 2012.

    Comments: 11 pages, submitted to IEEE Transactions on Signal Processing