-
Speech-to-Text Translation with Phoneme-Augmented CoT: Enhancing Cross-Lingual Transfer in Low-Resource Scenarios
Authors:
Gerard I. Gállego,
Oriol Pareras,
Martí Cortada Garcia,
Lucas Takanori,
Javier Hernando
Abstract:
We propose a Speech-to-Text Translation (S2TT) approach that integrates phoneme representations into a Chain-of-Thought (CoT) framework to improve translation in low-resource and zero-resource settings. By introducing phoneme recognition as an intermediate step, we enhance cross-lingual transfer, enabling translation even for languages with no labeled speech data. Our system builds on a multilingu…
▽ More
We propose a Speech-to-Text Translation (S2TT) approach that integrates phoneme representations into a Chain-of-Thought (CoT) framework to improve translation in low-resource and zero-resource settings. By introducing phoneme recognition as an intermediate step, we enhance cross-lingual transfer, enabling translation even for languages with no labeled speech data. Our system builds on a multilingual LLM, which we extend to process speech and phonemes. Training follows a curriculum learning strategy that progressively introduces more complex tasks. Experiments on multilingual S2TT benchmarks show that phoneme-augmented CoT improves translation quality in low-resource conditions and enables zero-resource translation, while slightly impacting high-resource performance. Despite this trade-off, our findings demonstrate that phoneme-based CoT is a promising step toward making S2TT more accessible across diverse languages.
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
Breaking Language Barriers in Visual Language Models via Multilingual Textual Regularization
Authors:
Iñigo Pikabea,
Iñaki Lacunza,
Oriol Pareras,
Carlos Escolano,
Aitor Gonzalez-Agirre,
Javier Hernando,
Marta Villegas
Abstract:
Rapid advancements in Visual Language Models (VLMs) have transformed multimodal understanding but are often constrained by generating English responses regardless of the input language. This phenomenon has been termed as Image-induced Fidelity Loss (IFL) and stems from limited multimodal multilingual training data. To address this, we propose a continuous multilingual integration strategy that inj…
▽ More
Rapid advancements in Visual Language Models (VLMs) have transformed multimodal understanding but are often constrained by generating English responses regardless of the input language. This phenomenon has been termed as Image-induced Fidelity Loss (IFL) and stems from limited multimodal multilingual training data. To address this, we propose a continuous multilingual integration strategy that injects text-only multilingual data during visual instruction tuning, preserving the language model's original multilingual capabilities. Extensive evaluations demonstrate that our approach significantly improves linguistic fidelity across languages without degradation in visual performance. We also explore model merging, which improves language fidelity but comes at the cost of visual performance. In contrast, our core method achieves robust multilingual alignment without trade-offs, offering a scalable and effective path to mitigating IFL for global VLM adoption.
△ Less
Submitted 20 May, 2025; v1 submitted 28 March, 2025;
originally announced March 2025.
-
Integrating Sensing and Communications in 6G? Not Until It Is Secure to Do So
Authors:
Nanchi Su,
Fan Liu,
Jiaqi Zou,
Christos Masouros,
George C. Alexandropoulos,
Alain Mourad,
Javier Lorca Hernando,
Qinyu Zhang,
Tse-Tin Chan
Abstract:
Integrated Sensing and Communication (ISAC) is emerging as a cornerstone technology for forthcoming 6G systems, significantly improving spectrum and energy efficiency. However, the commercial viability of ISAC hinges on addressing critical challenges surrounding security, privacy, and trustworthiness. These challenges necessitate an end-to-end framework to safeguards both communication data and se…
▽ More
Integrated Sensing and Communication (ISAC) is emerging as a cornerstone technology for forthcoming 6G systems, significantly improving spectrum and energy efficiency. However, the commercial viability of ISAC hinges on addressing critical challenges surrounding security, privacy, and trustworthiness. These challenges necessitate an end-to-end framework to safeguards both communication data and sensing information, particularly in ultra-low-latency and highly connected environments. Conventional solutions, such as encryption and key management, often fall short when confronted with ISAC's dual-functional nature. In this context, the physical layer plays a pivotal role: this article reviews emerging physical-layer strategies, including artificial noise (AN) injection, cooperative jamming, and constructive interference (CI), which enhance security by mitigating eavesdropping risks and safeguarding both communication data and sensing information. We further highlight the unique privacy issues that ISAC introduces to cellular networks and outline future research directions aimed at ensuring robust security and privacy for efficient ISAC deployment in 6G.
△ Less
Submitted 19 March, 2025;
originally announced March 2025.
-
Mass-Editing Memory with Attention in Transformers: A cross-lingual exploration of knowledge
Authors:
Daniel Tamayo,
Aitor Gonzalez-Agirre,
Javier Hernando,
Marta Villegas
Abstract:
Recent research has explored methods for updating and modifying factual knowledge in large language models, often focusing on specific multi-layer perceptron blocks. This study expands on this work by examining the effectiveness of existing knowledge editing methods across languages and delving into the role of attention mechanisms in this process. Drawing from the insights gained, we propose Mass…
▽ More
Recent research has explored methods for updating and modifying factual knowledge in large language models, often focusing on specific multi-layer perceptron blocks. This study expands on this work by examining the effectiveness of existing knowledge editing methods across languages and delving into the role of attention mechanisms in this process. Drawing from the insights gained, we propose Mass-Editing Memory with Attention in Transformers (MEMAT), a method that achieves significant improvements in all metrics while requiring minimal parameter modifications. MEMAT delivers a remarkable 10% increase in magnitude metrics, benefits languages not included in the training data and also demonstrates a high degree of portability. Our code and data are at https://github.com/dtamayo-nlp/MEMAT.
△ Less
Submitted 4 February, 2025;
originally announced February 2025.
-
Work-Efficient Parallel Non-Maximum Suppression Kernels
Authors:
David Oro,
Carles Fernández,
Xavier Martorell,
Javier Hernando
Abstract:
In the context of object detection, sliding-window classifiers and single-shot Convolutional Neural Network (CNN) meta-architectures typically yield multiple overlapping candidate windows with similar high scores around the true location of a particular object. Non-Maximum Suppression (NMS) is the process of selecting a single representative candidate within this cluster of detections, so as to ob…
▽ More
In the context of object detection, sliding-window classifiers and single-shot Convolutional Neural Network (CNN) meta-architectures typically yield multiple overlapping candidate windows with similar high scores around the true location of a particular object. Non-Maximum Suppression (NMS) is the process of selecting a single representative candidate within this cluster of detections, so as to obtain a unique detection per object appearing on a given picture. In this paper, we present a highly scalable NMS algorithm for embedded GPU architectures that is designed from scratch to handle workloads featuring thousands of simultaneous detections on a given picture. Our kernels are directly applicable to other sequential NMS algorithms such as FeatureNMS, Soft-NMS or AdaptiveNMS that share the inner workings of the classic greedy NMS method. The obtained performance results show that our parallel NMS algorithm is capable of clustering 1024 simultaneous detected objects per frame in roughly 1 ms on both NVIDIA Tegra X1 and NVIDIA Tegra X2 on-die GPUs, while taking 2 ms on NVIDIA Tegra K1. Furthermore, our proposed parallel greedy NMS algorithm yields a 14x-40x speed up when compared to state-of-the-art NMS methods that require learning a CNN from annotated data.
△ Less
Submitted 1 February, 2025;
originally announced February 2025.
-
Language Modelling for Speaker Diarization in Telephonic Interviews
Authors:
Miquel India,
Javier Hernando,
José A. R. Fonollosa
Abstract:
The aim of this paper is to investigate the benefit of combining both language and acoustic modelling for speaker diarization. Although conventional systems only use acoustic features, in some scenarios linguistic data contain high discriminative speaker information, even more reliable than the acoustic ones. In this study we analyze how an appropriate fusion of both kind of features is able to ob…
▽ More
The aim of this paper is to investigate the benefit of combining both language and acoustic modelling for speaker diarization. Although conventional systems only use acoustic features, in some scenarios linguistic data contain high discriminative speaker information, even more reliable than the acoustic ones. In this study we analyze how an appropriate fusion of both kind of features is able to obtain good results in these cases. The proposed system is based on an iterative algorithm where a LSTM network is used as a speaker classifier. The network is fed with character-level word embeddings and a GMM based acoustic score created with the output labels from previous iterations. The presented algorithm has been evaluated in a Call-Center database, which is composed of telephone interview audios. The combination of acoustic features and linguistic content shows a 84.29% improvement in terms of a word-level DER as compared to a HMM/VB baseline system. The results of this study confirms that linguistic content can be efficiently used for some speaker recognition tasks.
△ Less
Submitted 28 January, 2025;
originally announced January 2025.
-
On the Use of Audio to Improve Dialogue Policies
Authors:
Daniel Roncel,
Federico Costa,
Javier Hernando
Abstract:
With the significant progress of speech technologies, spoken goal-oriented dialogue systems are becoming increasingly popular. One of the main modules of a dialogue system is typically the dialogue policy, which is responsible for determining system actions. This component usually relies only on audio transcriptions, being strongly dependent on their quality and ignoring very important extralingui…
▽ More
With the significant progress of speech technologies, spoken goal-oriented dialogue systems are becoming increasingly popular. One of the main modules of a dialogue system is typically the dialogue policy, which is responsible for determining system actions. This component usually relies only on audio transcriptions, being strongly dependent on their quality and ignoring very important extralinguistic information embedded in the user's speech. In this paper, we propose new architectures to add audio information by combining speech and text embeddings using a Double Multi-Head Attention component. Our experiments show that audio embedding-aware dialogue policies outperform text-based ones, particularly in noisy transcription scenarios, and that how text and audio embeddings are combined is crucial to improve performance. We obtained a 9.8% relative improvement in the User Request Score compared to an only-text-based dialogue system on the DSTC2 dataset.
△ Less
Submitted 17 October, 2024;
originally announced October 2024.
-
Double Multi-Head Attention Multimodal System for Odyssey 2024 Speech Emotion Recognition Challenge
Authors:
Federico Costa,
Miquel India,
Javier Hernando
Abstract:
As computer-based applications are becoming more integrated into our daily lives, the importance of Speech Emotion Recognition (SER) has increased significantly. Promoting research with innovative approaches in SER, the Odyssey 2024 Speech Emotion Recognition Challenge was organized as part of the Odyssey 2024 Speaker and Language Recognition Workshop. In this paper we describe the Double Multi-He…
▽ More
As computer-based applications are becoming more integrated into our daily lives, the importance of Speech Emotion Recognition (SER) has increased significantly. Promoting research with innovative approaches in SER, the Odyssey 2024 Speech Emotion Recognition Challenge was organized as part of the Odyssey 2024 Speaker and Language Recognition Workshop. In this paper we describe the Double Multi-Head Attention Multimodal System developed for this challenge. Pre-trained self-supervised models were used to extract informative acoustic and text features. An early fusion strategy was adopted, where a Multi-Head Attention layer transforms these mixed features into complementary contextualized representations. A second attention mechanism is then applied to pool these representations into an utterance-level vector. Our proposed system achieved the third position in the categorical task ranking with a 34.41% Macro-F1 score, where 31 teams participated in total.
△ Less
Submitted 15 June, 2024;
originally announced June 2024.
-
Speaker Characterization by means of Attention Pooling
Authors:
Federico Costa,
Miquel India,
Javier Hernando
Abstract:
State-of-the-art Deep Learning systems for speaker verification are commonly based on speaker embedding extractors. These architectures are usually composed of a feature extractor front-end together with a pooling layer to encode variable-length utterances into fixed-length speaker vectors. The authors have recently proposed the use of a Double Multi-Head Self-Attention pooling for speaker recogni…
▽ More
State-of-the-art Deep Learning systems for speaker verification are commonly based on speaker embedding extractors. These architectures are usually composed of a feature extractor front-end together with a pooling layer to encode variable-length utterances into fixed-length speaker vectors. The authors have recently proposed the use of a Double Multi-Head Self-Attention pooling for speaker recognition, placed between a CNN-based front-end and a set of fully connected layers. This has shown to be an excellent approach to efficiently select the most relevant features captured by the front-end from the speech signal. In this paper we show excellent experimental results by adapting this architecture to other different speaker characterization tasks, such as emotion recognition, sex classification and COVID-19 detection.
△ Less
Submitted 7 May, 2024;
originally announced May 2024.
-
Technology Trends for Massive MIMO towards 6G
Authors:
Yiming Huo,
Xingqin Lin,
Boya Di,
Hongliang Zhang,
Francisco Javier Lorca Hernando,
Ahmet Serdar Tan,
Shahid Mumtaz,
Özlem Tuğfe Demir,
Kun Chen-Hu
Abstract:
At the dawn of the next-generation wireless systems and networks, massive multiple-input multiple-output (MIMO) has been envisioned as one of the enabling technologies. With the continued success of being applied in the 5G and beyond, the massive MIMO technology has demonstrated its advantageousness, integrability, and extendibility. Moreover, several evolutionary features and revolutionizing tren…
▽ More
At the dawn of the next-generation wireless systems and networks, massive multiple-input multiple-output (MIMO) has been envisioned as one of the enabling technologies. With the continued success of being applied in the 5G and beyond, the massive MIMO technology has demonstrated its advantageousness, integrability, and extendibility. Moreover, several evolutionary features and revolutionizing trends for massive MIMO have gradually emerged in recent years, which are expected to reshape the future 6G wireless systems and networks. Specifically, the functions and performance of future massive MIMO systems will be enabled and enhanced via combining other innovative technologies, architectures, and strategies such as intelligent omni-surfaces (IOSs)/intelligent reflecting surfaces (IRSs), artificial intelligence (AI), THz communications, cell free architecture. Also, more diverse vertical applications based on massive MIMO will emerge and prosper, such as wireless localization and sensing, vehicular communications, non-terrestrial communications, remote sensing, inter-planetary communications.
△ Less
Submitted 5 January, 2023; v1 submitted 4 January, 2023;
originally announced January 2023.
-
The UPC Speaker Verification System Submitted to VoxCeleb Speaker Recognition Challenge 2020 (VoxSRC-20)
Authors:
Umair Khan,
Javier Hernando
Abstract:
This report describes the submission from Technical University of Catalonia (UPC) to the VoxCeleb Speaker Recognition Challenge (VoxSRC-20) at Interspeech 2020. The final submission is a combination of three systems. System-1 is an autoencoder based approach which tries to reconstruct similar i-vectors, whereas System-2 and -3 are Convolutional Neural Network (CNN) based siamese architectures. The…
▽ More
This report describes the submission from Technical University of Catalonia (UPC) to the VoxCeleb Speaker Recognition Challenge (VoxSRC-20) at Interspeech 2020. The final submission is a combination of three systems. System-1 is an autoencoder based approach which tries to reconstruct similar i-vectors, whereas System-2 and -3 are Convolutional Neural Network (CNN) based siamese architectures. The siamese networks have two and three branches, respectively, where each branch is a CNN encoder. The double-branch siamese performs binary classification using cross entropy loss during training. Whereas, our triple-branch siamese is trained to learn speaker embeddings using triplet loss. We provide results of our systems on VoxCeleb-1 test, VoxSRC-20 validation and test sets.
△ Less
Submitted 27 October, 2020; v1 submitted 21 October, 2020;
originally announced October 2020.
-
Self-attention encoding and pooling for speaker recognition
Authors:
Pooyan Safari,
Miquel India,
Javier Hernando
Abstract:
The computing power of mobile devices limits the end-user applications in terms of storage size, processing, memory and energy consumption. These limitations motivate researchers for the design of more efficient deep models. On the other hand, self-attention networks based on Transformer architecture have attracted remarkable interests due to their high parallelization capabilities and strong perf…
▽ More
The computing power of mobile devices limits the end-user applications in terms of storage size, processing, memory and energy consumption. These limitations motivate researchers for the design of more efficient deep models. On the other hand, self-attention networks based on Transformer architecture have attracted remarkable interests due to their high parallelization capabilities and strong performance on a variety of Natural Language Processing (NLP) applications. Inspired by the Transformer, we propose a tandem Self-Attention Encoding and Pooling (SAEP) mechanism to obtain a discriminative speaker embedding given non-fixed length speech utterances. SAEP is a stack of identical blocks solely relied on self-attention and position-wise feed-forward networks to create vector representation of speakers. This approach encodes short-term speaker spectral features into speaker embeddings to be used in text-independent speaker verification. We have evaluated this approach on both VoxCeleb1 & 2 datasets. The proposed architecture is able to outperform the baseline x-vector, and shows competitive performance to some other benchmarks based on convolutions, with a significant reduction in model size. It employs 94%, 95%, and 73% less parameters compared to ResNet-34, ResNet-50, and x-vector, respectively. This indicates that the proposed fully attention based architecture is more efficient in extracting time-invariant features from speaker utterances.
△ Less
Submitted 3 August, 2020;
originally announced August 2020.
-
Double Multi-Head Attention for Speaker Verification
Authors:
Miquel India,
Pooyan Safari,
Javier Hernando
Abstract:
Most state-of-the-art Deep Learning systems for speaker verification are based on speaker embedding extractors. These architectures are commonly composed of a feature extractor front-end together with a pooling layer to encode variable-length utterances into fixed-length speaker vectors. In this paper we present Double Multi-Head Attention pooling, which extends our previous approach based on Self…
▽ More
Most state-of-the-art Deep Learning systems for speaker verification are based on speaker embedding extractors. These architectures are commonly composed of a feature extractor front-end together with a pooling layer to encode variable-length utterances into fixed-length speaker vectors. In this paper we present Double Multi-Head Attention pooling, which extends our previous approach based on Self Multi-Head Attention. An additional self attention layer is added to the pooling layer that summarizes the context vectors produced by Multi-Head Attention into a unique speaker representation. This method enhances the pooling mechanism by giving weights to the information captured for each head and it results in creating more discriminative speaker embeddings. We have evaluated our approach with the VoxCeleb2 dataset. Our results show 6.09% and 5.23% relative improvement in terms of EER compared to Self Attention pooling and Self Multi-Head Attention, respectively. According to the obtained results, Double Multi-Head Attention has shown to be an excellent approach to efficiently select the most relevant features captured by the CNN-based front-ends from the speech signal.
△ Less
Submitted 9 January, 2021; v1 submitted 26 July, 2020;
originally announced July 2020.
-
End-to-end User Recognition using Touchscreen Biometrics
Authors:
Michał Krzemiński,
Javier Hernando
Abstract:
We study the touchscreen data as behavioural biometrics. The goal was to create an end-to-end system that can transparently identify users using raw data from mobile devices. The touchscreen biometrics was researched only few times in series of works with disparity in used methodology and databases. In the proposed system data from the touchscreen goes directly, without any processing, to the inpu…
▽ More
We study the touchscreen data as behavioural biometrics. The goal was to create an end-to-end system that can transparently identify users using raw data from mobile devices. The touchscreen biometrics was researched only few times in series of works with disparity in used methodology and databases. In the proposed system data from the touchscreen goes directly, without any processing, to the input of a deep neural network, which is able to decide on the identity of the user. No hand-crafted features are used. The implemented classification algorithm tries to find patterns by its own from raw data. The achieved results show that the proposed deep model is sufficient enough for the given identification task. The performed tests indicate high accuracy of user identification and better EER results compared to state of the art systems. The best result achieved by our system is 0.65% EER.
△ Less
Submitted 9 June, 2020;
originally announced June 2020.
-
Self Multi-Head Attention for Speaker Recognition
Authors:
Miquel India,
Pooyan Safari,
Javier Hernando
Abstract:
Most state-of-the-art Deep Learning (DL) approaches for speaker recognition work on a short utterance level. Given the speech signal, these algorithms extract a sequence of speaker embeddings from short segments and those are averaged to obtain an utterance level speaker representation. In this work we propose the use of an attention mechanism to obtain a discriminative speaker embedding given non…
▽ More
Most state-of-the-art Deep Learning (DL) approaches for speaker recognition work on a short utterance level. Given the speech signal, these algorithms extract a sequence of speaker embeddings from short segments and those are averaged to obtain an utterance level speaker representation. In this work we propose the use of an attention mechanism to obtain a discriminative speaker embedding given non fixed length speech utterances. Our system is based on a Convolutional Neural Network (CNN) that encodes short-term speaker features from the spectrogram and a self multi-head attention model that maps these representations into a long-term speaker embedding. The attention model that we propose produces multiple alignments from different subsegments of the CNN encoded states over the sequence. Hence this mechanism works as a pooling layer which decides the most discriminative features over the sequence to obtain an utterance level representation. We have tested this approach for the verification task for the VoxCeleb1 dataset. The results show that self multi-head attention outperforms both temporal and statistical pooling methods with a 18\% of relative EER. Obtained results show a 58\% relative improvement in EER compared to i-vector+PLDA.
△ Less
Submitted 1 July, 2019; v1 submitted 24 June, 2019;
originally announced June 2019.
-
From Big Data to Big Displays: High-Performance Visualization at Blue Brain
Authors:
Stefan Eilemann,
Marwan Abdellah,
Nicolas Antille,
Ahmet Bilgili,
Grigory Chevtchenko,
Raphael Dumusc,
Cyrille Favreau,
Juan Hernando,
Daniel Nachbaur,
Pawel Podhajski,
Jafet Villafranca,
Felix Schürmann
Abstract:
Blue Brain has pushed high-performance visualization (HPV) to complement its HPC strategy since its inception in 2007. In 2011, this strategy has been accelerated to develop innovative visualization solutions through increased funding and strategic partnerships with other research institutions.
We present the key elements of this HPV ecosystem, which integrates C++ visualization applications wit…
▽ More
Blue Brain has pushed high-performance visualization (HPV) to complement its HPC strategy since its inception in 2007. In 2011, this strategy has been accelerated to develop innovative visualization solutions through increased funding and strategic partnerships with other research institutions.
We present the key elements of this HPV ecosystem, which integrates C++ visualization applications with novel collaborative display systems. We motivate how our strategy of transforming visualization engines into services enables a variety of use cases, not only for the integration with high-fidelity displays, but also to build service oriented architectures, to link into web applications and to provide remote services to Python applications.
△ Less
Submitted 30 June, 2017;
originally announced June 2017.
-
Deep Learning for Single and Multi-Session i-Vector Speaker Recognition
Authors:
Omid Ghahabi,
Javier Hernando
Abstract:
The promising performance of Deep Learning (DL) in speech recognition has motivated the use of DL in other speech technology applications such as speaker recognition. Given i-vectors as inputs, the authors proposed an impostor selection algorithm and a universal model adaptation process in a hybrid system based on Deep Belief Networks (DBN) and Deep Neural Networks (DNN) to discriminatively model…
▽ More
The promising performance of Deep Learning (DL) in speech recognition has motivated the use of DL in other speech technology applications such as speaker recognition. Given i-vectors as inputs, the authors proposed an impostor selection algorithm and a universal model adaptation process in a hybrid system based on Deep Belief Networks (DBN) and Deep Neural Networks (DNN) to discriminatively model each target speaker. In order to have more insight into the behavior of DL techniques in both single and multi-session speaker enrollment tasks, some experiments have been carried out in this paper in both scenarios. Additionally, the parameters of the global model, referred to as universal DBN (UDBN), are normalized before adaptation. UDBN normalization facilitates training DNNs specifically with more than one hidden layer. Experiments are performed on the NIST SRE 2006 corpus. It is shown that the proposed impostor selection algorithm and UDBN adaptation process enhance the performance of conventional DNNs 8-20 % and 16-20 % in terms of EER for the single and multi-session tasks, respectively. In both scenarios, the proposed architectures outperform the baseline systems obtaining up to 17 % reduction in EER.
△ Less
Submitted 8 December, 2015;
originally announced December 2015.