Skip to main content

Showing 1–6 of 6 results for author: Kesiraju, S

Searching in archive eess. Search in all archives.
.
  1. arXiv:2506.08633  [pdf, ps, other

    eess.AS cs.CL

    Approaching Dialogue State Tracking via Aligning Speech Encoders and LLMs

    Authors: Šimon Sedláček, Bolaji Yusuf, Ján Švec, Pradyoth Hegde, Santosh Kesiraju, Oldřich Plchot, Jan Černocký

    Abstract: In this work, we approach spoken Dialogue State Tracking (DST) by bridging the representation spaces of speech encoders and LLMs via a small connector module, with a focus on fully open-sourced and open-data components (WavLM-large, OLMo). We focus on ablating different aspects of such systems including full/LoRA adapter fine-tuning, the effect of agent turns in the dialogue history, as well as fu… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

    Comments: Accepted to Interspeech 2025

  2. arXiv:2506.04714  [pdf, ps, other

    cs.CL eess.AS

    IIITH-BUT system for IWSLT 2025 low-resource Bhojpuri to Hindi speech translation

    Authors: Bhavana Akkiraju, Aishwarya Pothula, Santosh Kesiraju, Anil Kumar Vuppala

    Abstract: This paper presents the submission of IIITH-BUT to the IWSLT 2025 shared task on speech translation for the low-resource Bhojpuri-Hindi language pair. We explored the impact of hyperparameter optimisation and data augmentation techniques on the performance of the SeamlessM4T model fine-tuned for this specific task. We systematically investigated a range of hyperparameters including learning rate s… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

    Comments: Paper is accepted to IWSLT2025

  3. arXiv:2410.17437  [pdf, other

    eess.AS

    Improving Automatic Speech Recognition with Decoder-Centric Regularisation in Encoder-Decoder Models

    Authors: Alexander Polok, Santosh Kesiraju, Karel Beneš, Lukáš Burget, Jan Černocký

    Abstract: This paper proposes a simple yet effective way of regularising the encoder-decoder-based automatic speech recognition (ASR) models that enhance the robustness of the model and improve the generalisation to out-of-domain scenarios. The proposed approach is dubbed as $\textbf{De}$coder-$\textbf{C}$entric $\textbf{R}$egularisation in $\textbf{E}$ncoder-$\textbf{D}$ecoder (DeCRED) architecture for ASR… ▽ More

    Submitted 22 October, 2024; originally announced October 2024.

  4. arXiv:2403.07767  [pdf, ps, other

    eess.AS cs.LG eess.SP

    Beyond the Labels: Unveiling Text-Dependency in Paralinguistic Speech Recognition Datasets

    Authors: Jan Pešán, Santosh Kesiraju, Lukáš Burget, Jan ''Honza'' Černocký

    Abstract: Paralinguistic traits like cognitive load and emotion are increasingly recognized as pivotal areas in speech recognition research, often examined through specialized datasets like CLSE and IEMOCAP. However, the integrity of these datasets is seldom scrutinized for text-dependency. This paper critically evaluates the prevalent assumption that machine learning models trained on such datasets genuine… ▽ More

    Submitted 18 October, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

  5. Strategies for improving low resource speech to text translation relying on pre-trained ASR models

    Authors: Santosh Kesiraju, Marek Sarvas, Tomas Pavlicek, Cecile Macaire, Alejandro Ciuba

    Abstract: This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST). We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively. Using the encoder-decoder framework for ST, our results show that a multilingual automatic speech recognition system acts as a g… ▽ More

    Submitted 31 May, 2023; originally announced June 2023.

  6. arXiv:2104.02332  [pdf, other

    eess.AS

    Detecting English Speech in the Air Traffic Control Voice Communication

    Authors: Igor Szoke, Santosh Kesiraju, Ondrej Novotny, Martin Kocour, Karel Vesely, Jan "Honza" Cernocky

    Abstract: We launched a community platform for collecting the ATC speech world-wide in the ATCO2 project. Filtering out unseen non-English speech is one of the main components in the data processing pipeline. The proposed English Language Detection (ELD) system is based on the embeddings from Bayesian subspace multinomial model. It is trained on the word confusion network from an ASR system. It is robust, e… ▽ More

    Submitted 6 April, 2021; originally announced April 2021.