-
A large-scale heterogeneous 3D magnetic resonance brain imaging dataset for self-supervised learning
Authors:
Asbjørn Munk,
Stefano Cerri,
Jakob Ambsdorf,
Julia Machnio,
Sebastian Nørgaard Llambias,
Vardan Nersesjan,
Christian Hedeager Krag,
Peirong Liu,
Pablo Rocamora García,
Mostafa Mehdipour Ghazi,
Mikael Boesen,
Michael Eriksen Benros,
Juan Eugenio Iglesias,
Mads Nielsen
Abstract:
We present FOMO60K, a large-scale, heterogeneous dataset of 60,529 brain Magnetic Resonance Imaging (MRI) scans from 13,900 sessions and 11,187 subjects, aggregated from 16 publicly available sources. The dataset includes both clinical- and research-grade images, multiple MRI sequences, and a wide range of anatomical and pathological variability, including scans with large brain anomalies. Minimal…
▽ More
We present FOMO60K, a large-scale, heterogeneous dataset of 60,529 brain Magnetic Resonance Imaging (MRI) scans from 13,900 sessions and 11,187 subjects, aggregated from 16 publicly available sources. The dataset includes both clinical- and research-grade images, multiple MRI sequences, and a wide range of anatomical and pathological variability, including scans with large brain anomalies. Minimal preprocessing was applied to preserve the original image characteristics while reducing barriers to entry for new users. Accompanying code for self-supervised pretraining and finetuning is provided. FOMO60K is intended to support the development and benchmarking of self-supervised learning methods in medical imaging at scale.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
HENT-SRT: Hierarchical Efficient Neural Transducer with Self-Distillation for Joint Speech Recognition and Translation
Authors:
Amir Hussein,
Cihan Xiao,
Matthew Wiesner,
Dan Povey,
Leibny Paola Garcia,
Sanjeev Khudanpur
Abstract:
Neural transducers (NT) provide an effective framework for speech streaming, demonstrating strong performance in automatic speech recognition (ASR). However, the application of NT to speech translation (ST) remains challenging, as existing approaches struggle with word reordering and performance degradation when jointly modeling ASR and ST, resulting in a gap with attention-based encoder-decoder (…
▽ More
Neural transducers (NT) provide an effective framework for speech streaming, demonstrating strong performance in automatic speech recognition (ASR). However, the application of NT to speech translation (ST) remains challenging, as existing approaches struggle with word reordering and performance degradation when jointly modeling ASR and ST, resulting in a gap with attention-based encoder-decoder (AED) models. Existing NT-based ST approaches also suffer from high computational training costs. To address these issues, we propose HENT-SRT (Hierarchical Efficient Neural Transducer for Speech Recognition and Translation), a novel framework that factorizes ASR and translation tasks to better handle reordering. To ensure robust ST while preserving ASR performance, we use self-distillation with CTC consistency regularization. Moreover, we improve computational efficiency by incorporating best practices from ASR transducers, including a down-sampled hierarchical encoder, a stateless predictor, and a pruned transducer loss to reduce training complexity. Finally, we introduce a blank penalty during decoding, reducing deletions and improving translation quality. Our approach is evaluated on three conversational datasets Arabic, Spanish, and Mandarin achieving new state-of-the-art performance among NT models and substantially narrowing the gap with AED-based systems.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English
Authors:
Haoyang Zhang,
Hexin Liu,
Xiangyu Zhang,
Qiquan Zhang,
Yuchen Hu,
Junqi Zhao,
Fei Tian,
Xuerui Yang,
Leibny Paola Garcia,
Eng Siong Chng
Abstract:
The speech tokenizer plays a crucial role in recent speech tasks, generally serving as a bridge between speech signals and language models. While low-frame-rate codecs are widely employed as speech tokenizers, the impact of frame rates on speech tokens remains underexplored. In this study, we investigate how varying frame rates affect speech tokenization by examining Mandarin and English, two typo…
▽ More
The speech tokenizer plays a crucial role in recent speech tasks, generally serving as a bridge between speech signals and language models. While low-frame-rate codecs are widely employed as speech tokenizers, the impact of frame rates on speech tokens remains underexplored. In this study, we investigate how varying frame rates affect speech tokenization by examining Mandarin and English, two typologically distinct languages. We encode speech at different frame rates and evaluate the resulting semantic tokens in the speech recognition task. Our findings reveal that frame rate variations influence speech tokenization differently for each language, highlighting the interplay between frame rates, phonetic density, and language-specific acoustic features. The results provide insights into optimizing frame rate selection for speech tokenizers, with implications for automatic speech recognition, text-to-speech, and other speech-related applications.
△ Less
Submitted 13 June, 2025; v1 submitted 20 May, 2025;
originally announced May 2025.
-
Safe Autonomous Environmental Contact for Soft Robots using Control Barrier Functions
Authors:
Akua K. Dickson,
Juan C. Pacheco Garcia,
Meredith L. Anderson,
Ran Jing,
Sarah Alizadeh-Shabdiz,
Audrey X. Wang,
Charles DeLorey,
Zach J. Patterson,
Andrew P. Sabelhaus
Abstract:
Robots built from soft materials will inherently apply lower environmental forces than their rigid counterparts, and therefore may be more suitable in sensitive settings with unintended contact. However, these robots' applied forces result from both their design and their control system in closed-loop, and therefore, ensuring bounds on these forces requires controller synthesis for safety as well.…
▽ More
Robots built from soft materials will inherently apply lower environmental forces than their rigid counterparts, and therefore may be more suitable in sensitive settings with unintended contact. However, these robots' applied forces result from both their design and their control system in closed-loop, and therefore, ensuring bounds on these forces requires controller synthesis for safety as well. This article introduces the first feedback controller for a soft manipulator that formally meets a safety specification with respect to environmental contact. In our proof-of-concept setting, the robot's environment has known geometry and is deformable with a known elastic modulus. Our approach maps a bound on applied forces to a safe set of positions of the robot's tip via predicted deformations of the environment. Then, a quadratic program with Control Barrier Functions in its constraints is used to supervise a nominal feedback signal, verifiably maintaining the robot's tip within this safe set. Hardware experiments on a multi-segment soft pneumatic robot demonstrate that the proposed framework successfully constrains its environmental contact forces. This framework represents a fundamental shift in perspective on control and safety for soft robots, defining and implementing a formally verifiable logic specification on their pose and contact forces.
△ Less
Submitted 20 April, 2025;
originally announced April 2025.
-
QKD-KEM: Hybrid QKD Integration into TLS with OpenSSL Providers
Authors:
Javier Blanco-Romero,
Pedro Otero García,
Daniel Sobral-Blanco,
Florina Almenares Mendoza,
Ana Fernández Vilas,
Rebeca P. Díaz-Redondo
Abstract:
Quantum Key Distribution (QKD) promises information-theoretic security, yet integrating QKD into existing protocols like TLS remains challenging due to its fundamentally different operational model. In this paper, we propose a hybrid QKD-KEM protocol with two distinct integration approaches: a client-initiated flow compatible with both ETSI 004 and 014 specifications, and a server-initiated flow s…
▽ More
Quantum Key Distribution (QKD) promises information-theoretic security, yet integrating QKD into existing protocols like TLS remains challenging due to its fundamentally different operational model. In this paper, we propose a hybrid QKD-KEM protocol with two distinct integration approaches: a client-initiated flow compatible with both ETSI 004 and 014 specifications, and a server-initiated flow similar to existing work but limited to stateless ETSI 014 APIs. Unlike previous implementations, our work specifically addresses the integration of stateful QKD key exchange protocols (ETSI 004) which is essential for production QKD networks but has remained largely unexplored. By adapting OpenSSL's provider infrastructure to accommodate QKD's pre-distributed key model, we maintain compatibility with current TLS implementations while offering dual layers of security. Performance evaluations demonstrate the feasibility of our hybrid scheme with acceptable overhead, showing that robust security against quantum threats is achievable while addressing the unique requirements of different QKD API specifications.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
Understanding intra-node communication in HPC systems and Datacenters
Authors:
Joaquin Tarraga-Moreno,
Jesus Escudero-Sahuquillo,
Pedro Javier Garcia,
Francisco J. Quiles
Abstract:
Over the past decade, specialized computing and storage devices, such as GPUs, TPUs, and high-speed storage, have been increasingly integrated into server nodes within Supercomputers and Data Centers. The advent of high-bandwidth memory (HBM) has facilitated a more compact design for these components, enabling multiple units to be interconnected within a single server node through intra-node netwo…
▽ More
Over the past decade, specialized computing and storage devices, such as GPUs, TPUs, and high-speed storage, have been increasingly integrated into server nodes within Supercomputers and Data Centers. The advent of high-bandwidth memory (HBM) has facilitated a more compact design for these components, enabling multiple units to be interconnected within a single server node through intra-node networks like PCIe, NVLink, or Ethernet. These networks allow for scaling up the number of dedicated computing and storage devices per node. Additionally, inter-node networks link these devices across thousands of server nodes in large-scale computing systems. However, as communication demands among accelerators grow-especially in workloads like generative AI-both intra- and inter-node networks risk becoming critical bottlenecks. Although modern intra-node network architectures attempt to mitigate this issue by boosting bandwidth, we demonstrate in this paper that such an approach can inadvertently degrade inter-node communication. This occurs when high-bandwidth intra-node traffic interferes with incoming traffic from external nodes, leading to congestion. To evaluate this phenomenon, we analyze the communication behavior of realistic traffic patterns commonly found in generative AI applications. Using OMNeT++, we developed a general simulation model that captures both intra- and inter-node network interactions. Through extensive simulations, our findings reveal that increasing intra-node bandwidth and the number of accelerators per node can actually hinder overall inter-node communication performance rather than improve it.
△ Less
Submitted 28 February, 2025;
originally announced February 2025.
-
KPIs 2024 Challenge: Advancing Glomerular Segmentation from Patch- to Slide-Level
Authors:
Ruining Deng,
Tianyuan Yao,
Yucheng Tang,
Junlin Guo,
Siqi Lu,
Juming Xiong,
Lining Yu,
Quan Huu Cap,
Pengzhou Cai,
Libin Lan,
Ze Zhao,
Adrian Galdran,
Amit Kumar,
Gunjan Deotale,
Dev Kumar Das,
Inyoung Paik,
Joonho Lee,
Geongyu Lee,
Yujia Chen,
Wangkai Li,
Zhaoyang Li,
Xuege Hou,
Zeyuan Wu,
Shengjin Wang,
Maximilian Fischer
, et al. (22 additional authors not shown)
Abstract:
Chronic kidney disease (CKD) is a major global health issue, affecting over 10% of the population and causing significant mortality. While kidney biopsy remains the gold standard for CKD diagnosis and treatment, the lack of comprehensive benchmarks for kidney pathology segmentation hinders progress in the field. To address this, we organized the Kidney Pathology Image Segmentation (KPIs) Challenge…
▽ More
Chronic kidney disease (CKD) is a major global health issue, affecting over 10% of the population and causing significant mortality. While kidney biopsy remains the gold standard for CKD diagnosis and treatment, the lack of comprehensive benchmarks for kidney pathology segmentation hinders progress in the field. To address this, we organized the Kidney Pathology Image Segmentation (KPIs) Challenge, introducing a dataset that incorporates preclinical rodent models of CKD with over 10,000 annotated glomeruli from 60+ Periodic Acid Schiff (PAS)-stained whole slide images. The challenge includes two tasks, patch-level segmentation and whole slide image segmentation and detection, evaluated using the Dice Similarity Coefficient (DSC) and F1-score. By encouraging innovative segmentation methods that adapt to diverse CKD models and tissue conditions, the KPIs Challenge aims to advance kidney pathology analysis, establish new benchmarks, and enable precise, large-scale quantification for disease research and diagnosis.
△ Less
Submitted 11 February, 2025;
originally announced February 2025.
-
Leveraging InfiniBand Controller to Configure Deadlock-Free Routing Engines for Dragonflies
Authors:
German Maglione-Mathey,
Jesus Escudero-Sahuquillo,
Pedro Javier Garcia,
Francisco J. Quiles,
Eitan Zahavi
Abstract:
The Dragonfly topology is currently one of the most popular network topologies in high-performance parallel systems. The interconnection networks of many of these systems are built from components based on the InfiniBand specification. However, due to some constraints in this specification, the available versions of the InfiniBand network controller (OpenSM) do not include routing engines based on…
▽ More
The Dragonfly topology is currently one of the most popular network topologies in high-performance parallel systems. The interconnection networks of many of these systems are built from components based on the InfiniBand specification. However, due to some constraints in this specification, the available versions of the InfiniBand network controller (OpenSM) do not include routing engines based on some popular deadlock-free routing algorithms proposed theoretically for Dragonflies, such as the one proposed by Kim and Dally based on Virtual-Channel shifting. In this paper we propose a straightforward method to integrate this routing algorithm in OpenSM as a routing engine, explaining in detail the configuration required to support it. We also provide experiment results, obtained both from a real InfiniBand-based cluster and from simulation, to validate the new routing engine and to compare its performance and requirements against other routing engines currently available in OpenSM.
△ Less
Submitted 3 February, 2025;
originally announced February 2025.
-
Congestion Management in High-Performance Interconnection Networks Using Adaptive Routing Notifications
Authors:
Jose Rocher-Gonzalez,
Jesus Escudero-Sahuquillo,
Pedro J. Garcia,
Francisco J. Quiles
Abstract:
The interconnection network is a crucial subsystem in High-Performance Computing clusters and Data-centers, guaranteeing high bandwidth and low latency to the applications' communication operations. Unfortunately, congestion situations may spoil network performance unless the network design applies specific countermeasures. Adaptive routing algorithms are a traditional approach to dealing with con…
▽ More
The interconnection network is a crucial subsystem in High-Performance Computing clusters and Data-centers, guaranteeing high bandwidth and low latency to the applications' communication operations. Unfortunately, congestion situations may spoil network performance unless the network design applies specific countermeasures. Adaptive routing algorithms are a traditional approach to dealing with congestion since they provide traffic flows with alternative routes that bypass congested areas. However, adaptive routing decisions at switches are typically based on local information without a global network traffic perspective, leading to congestion spreading throughout the network beyond the original congested areas. In this paper, we propose a new efficient congestion management strategy that leverages adaptive routing notifications currently available in some interconnect technologies and efficiently isolates the congesting flows in reserved spaces at switch buffers. The experiment results based on simulations of realistic traffic scenarios show that our proposal removes the congestion impact.
△ Less
Submitted 1 February, 2025;
originally announced February 2025.
-
Towards an Efficient Combination of Adaptive Routing and Queuing Schemes in Fat-Tree Topologies
Authors:
Jose Rocher-Gonzalez,
Jesus Escudero-Sahuquillo,
Pedro J. Garcia,
Francisco J. Quiles,
Gaspar Mora
Abstract:
The interconnection network is a key element in High-Performance Computing (HPC) and Datacenter (DC) systems whose performance depends on several design parameters, such as the topology, the switch architecture, and the routing algorithm. Among the most common topologies in HPC systems, the Fat-Tree offers several shortest-path routes between any pair of end-nodes, which allows multi-path routing…
▽ More
The interconnection network is a key element in High-Performance Computing (HPC) and Datacenter (DC) systems whose performance depends on several design parameters, such as the topology, the switch architecture, and the routing algorithm. Among the most common topologies in HPC systems, the Fat-Tree offers several shortest-path routes between any pair of end-nodes, which allows multi-path routing schemes to balance traffic flows among the available links, thus reducing congestion probability. However, traffic balance cannot solve by itself some congestion situations that may still degrade network performance. Another approach to reduce congestion is queue-based flow separation, but our previous work shows that multi-path routing may spread congested flows across several queues, thus being counterproductive. In this paper, we propose a set of restrictions to improve alternative routes selection for multi-path routing algorithms in Fat-Tree networks, so that they can be positively combined with queuing schemes.
△ Less
Submitted 1 February, 2025;
originally announced February 2025.
-
A* Based Algorithm for Reduced Complexity ML Decoding of Tailbiting Codes
Authors:
Jorge Ortin,
Paloma Garcia,
Fernando Gutierrez,
Antonio Valdovinos
Abstract:
The A* algorithm is a graph search algorithm which has shown good results in terms of computational complexity for Maximum Likelihood (ML) decoding of tailbiting convolutional codes. The decoding of tailbiting codes with this algorithm is performed in two phases. In the first phase, a typical Viterbi decoding is employed to collect information regarding the trellis. The A* algorithm is then applie…
▽ More
The A* algorithm is a graph search algorithm which has shown good results in terms of computational complexity for Maximum Likelihood (ML) decoding of tailbiting convolutional codes. The decoding of tailbiting codes with this algorithm is performed in two phases. In the first phase, a typical Viterbi decoding is employed to collect information regarding the trellis. The A* algorithm is then applied in the second phase, using the information obtained in the first one to calculate the heuristic function. The improvements proposed in this work decrease the computational complexity of the A* algorithm using further information from the first phase of the algorithm. This information is used for obtaining a more accurate heuristic function and finding early terminating conditions for the A* algorithm. Simulation results show that the proposed modifications decrease the complexity of ML decoding with the A* algorithm in terms of the performed number of operations.
△ Less
Submitted 25 January, 2025;
originally announced January 2025.
-
Channel Independent Precoder for OFDM-based Systems over Fading Channels
Authors:
Jorge Ortin,
Paloma Garcia,
Fernando Gutierrez,
Antonio Valdovinos
Abstract:
In this paper we propose an independent channel precoder for orthogonal frequency division multiplexing (OFDM) systems over fading channels. The design of the precoder is based on the information redistribution of the input modulated symbols amongst the output precoded symbols. The proposed precoder decreases the variance of the instantaneous noise power at the receiver produced by the channel var…
▽ More
In this paper we propose an independent channel precoder for orthogonal frequency division multiplexing (OFDM) systems over fading channels. The design of the precoder is based on the information redistribution of the input modulated symbols amongst the output precoded symbols. The proposed precoder decreases the variance of the instantaneous noise power at the receiver produced by the channel variability. The employment of an interleaver together with a precoding matrix whose size does not depend on the number of data carriers in an OFDM symbol allows different configurations of time-frequency diversity which can be easily adapted to the channel conditions. The precoder is evaluated with a modified Zero Forcing (ZF) equalizer whose maximum gain is constrained by means of a clipping factor. Thus, the clipping factor limits the noise power transfer in the receiver deprecoding block in low SNR conditions.
△ Less
Submitted 24 January, 2025;
originally announced January 2025.
-
Performance Analysis of Turbo Decoding Algorithms in Wireless OFDM Systems
Authors:
Jorge Ortin,
Paloma Garcia,
Fernando Gutierrez,
Antonio Valdovinos
Abstract:
Turbo codes are well known to be one of the error correction techniques which achieve closer results to the Shannon limit. Nevertheless, the specific performance of the code highly depends on the particular decoding algorithm used at the receiver. In this sense, the election of the decoding algorithm involves a trade off between the gain introduced by the code and the complexity of the decoding pr…
▽ More
Turbo codes are well known to be one of the error correction techniques which achieve closer results to the Shannon limit. Nevertheless, the specific performance of the code highly depends on the particular decoding algorithm used at the receiver. In this sense, the election of the decoding algorithm involves a trade off between the gain introduced by the code and the complexity of the decoding process. In this work we perform a thorough analysis of the different iterative decoding techniques and analyze their suitability for being implemented in the user terminals of new cellular and broadcast systems which are based on orthogonal frequency division multiplexing (OFDM). The analyzed iterative decoding algorithms are the Max-Log-MAP and the soft output Viterbi algorithm (SOVA), since both of them have a relative low computational complexity, simplifying their implementation in cost efficient terminals. Simulation results have been obtained for different encoder structures, block sizes and considering realistic channel conditions (an OFDM transmission over a wireless channel).
△ Less
Submitted 23 January, 2025;
originally announced January 2025.
-
Two Step SOVA-Based Decoding Algorithm for Tailbiting Codes
Authors:
Jorge Ortin,
Paloma Garcia,
Fernando Gutierrez,
Antonio Valdovinos
Abstract:
In this work we propose a novel decoding algorithm for tailbiting convolutional codes and evaluate its performance over different channels. The proposed method consists on a fixed two-step Viterbi decoding of the received data. In the first step, an estimation of the most likely state is performed based on a SOVA decoding. The second step consists of a conventional Viterbi decoding that employs th…
▽ More
In this work we propose a novel decoding algorithm for tailbiting convolutional codes and evaluate its performance over different channels. The proposed method consists on a fixed two-step Viterbi decoding of the received data. In the first step, an estimation of the most likely state is performed based on a SOVA decoding. The second step consists of a conventional Viterbi decoding that employs the state estimated in the previous step as the initial and final states of the trellis. Simulations results show a performance close to that of maximum-likelihood decoding.
△ Less
Submitted 23 January, 2025;
originally announced January 2025.
-
ORCAst: Operational High-Resolution Current Forecasts
Authors:
Pierre Garcia,
Inès Larroche,
Amélie Pesnec,
Hannah Bull,
Théo Archambault,
Evangelos Moschos,
Alexandre Stegner,
Anastase Charantonis,
Dominique Béréziat
Abstract:
We present ORCAst, a multi-stage, multi-arm network for Operational high-Resolution Current forecAsts over one week. Producing real-time nowcasts and forecasts of ocean surface currents is a challenging problem due to indirect or incomplete information from satellite remote sensing data. Entirely trained on real satellite data and in situ measurements from drifters, our model learns to forecast gl…
▽ More
We present ORCAst, a multi-stage, multi-arm network for Operational high-Resolution Current forecAsts over one week. Producing real-time nowcasts and forecasts of ocean surface currents is a challenging problem due to indirect or incomplete information from satellite remote sensing data. Entirely trained on real satellite data and in situ measurements from drifters, our model learns to forecast global ocean surface currents using various sources of ground truth observations in a multi-stage learning procedure. Our multi-arm encoder-decoder model architecture allows us to first predict sea surface height and geostrophic currents from larger quantities of nadir and SWOT altimetry data, before learning to predict ocean surface currents from much more sparse in situ measurements from drifters. Training our model on specific regions improves performance. Our model achieves stronger nowcast and forecast performance in predicting ocean surface currents than various state-of-the-art methods.
△ Less
Submitted 21 January, 2025;
originally announced January 2025.
-
Real-Time Trajectory Generation for Soft Robot Manipulators Using Differential Flatness
Authors:
Akua Dickson,
Juan C. Pacheco Garcia,
Ran Jing,
Meredith L. Anderson,
Andrew P. Sabelhaus
Abstract:
Soft robots have the potential to interact with sensitive environments and perform complex tasks effectively. However, motion plans and trajectories for soft manipulators are challenging to calculate due to their deformable nature and nonlinear dynamics. This article introduces a fast real-time trajectory generation approach for soft robot manipulators, which creates dynamically-feasible motions f…
▽ More
Soft robots have the potential to interact with sensitive environments and perform complex tasks effectively. However, motion plans and trajectories for soft manipulators are challenging to calculate due to their deformable nature and nonlinear dynamics. This article introduces a fast real-time trajectory generation approach for soft robot manipulators, which creates dynamically-feasible motions for arbitrary kinematically-feasible paths of the robot's end effector. Our insight is that piecewise constant curvature (PCC) dynamics models of soft robots can be differentially flat, therefore control inputs can be calculated algebraically rather than through a nonlinear differential equation. We prove this flatness under certain conditions, with the curvatures of the robot as the flat outputs. Our two-step trajectory generation approach uses an inverse kinematics procedure to calculate a motion plan of robot curvatures per end-effector position, then, our flatness diffeomorphism generates corresponding control inputs that respect velocity. We validate our approach through simulations of our representative soft robot manipulator along three different trajectories, demonstrating a margin of 23x faster than real-time at a frequency of 100 Hz. This approach could allow fast verifiable replanning of soft robots' motions in safety-critical physical environments, crucial for deployment in the real world.
△ Less
Submitted 11 December, 2024;
originally announced December 2024.
-
SAV-SE: Scene-aware Audio-Visual Speech Enhancement with Selective State Space Model
Authors:
Xinyuan Qian,
Jiaran Gao,
Yaodan Zhang,
Qiquan Zhang,
Hexin Liu,
Leibny Paola Garcia,
Haizhou Li
Abstract:
Speech enhancement plays an essential role in various applications, and the integration of visual information has been demonstrated to bring substantial advantages. However, the majority of current research concentrates on the examination of facial and lip movements, which can be compromised or entirely inaccessible in scenarios where occlusions occur or when the camera view is distant. Whereas co…
▽ More
Speech enhancement plays an essential role in various applications, and the integration of visual information has been demonstrated to bring substantial advantages. However, the majority of current research concentrates on the examination of facial and lip movements, which can be compromised or entirely inaccessible in scenarios where occlusions occur or when the camera view is distant. Whereas contextual visual cues from the surrounding environment have been overlooked: for example, when we see a dog bark, our brain has the innate ability to discern and filter out the barking noise. To this end, in this paper, we introduce a novel task, i.e. SAV-SE. To our best knowledge, this is the first proposal to use rich contextual information from synchronized video as auxiliary cues to indicate the type of noise, which eventually improves the speech enhancement performance. Specifically, we propose the VC-S$^2$E method, which incorporates the Conformer and Mamba modules for their complementary strengths. Extensive experiments are conducted on public MUSIC, AVSpeech and AudioSet datasets, where the results demonstrate the superiority of VC-S$^2$E over other competitive methods. We will make the source code publicly available. Project demo page: https://AVSEPage.github.io/
△ Less
Submitted 2 April, 2025; v1 submitted 12 November, 2024;
originally announced November 2024.
-
Methane projections from Canada's oil sands tailings using scientific deep learning reveal significant underestimation
Authors:
Esha Saha,
Oscar Wang,
Amit K. Chakraborty,
Pablo Venegas Garcia,
Russell Milne,
Hao Wang
Abstract:
Bitumen extraction for the production of synthetic crude oil in Canada's Athabasca Oil Sands industry has recently come under spotlight for being a significant source of greenhouse gas emission. A major cause of concern is methane, a greenhouse gas produced by the anaerobic biodegradation of hydrocarbons in oil sands residues, or tailings, stored in settle basins commonly known as oil sands tailin…
▽ More
Bitumen extraction for the production of synthetic crude oil in Canada's Athabasca Oil Sands industry has recently come under spotlight for being a significant source of greenhouse gas emission. A major cause of concern is methane, a greenhouse gas produced by the anaerobic biodegradation of hydrocarbons in oil sands residues, or tailings, stored in settle basins commonly known as oil sands tailing ponds. In order to determine the methane emitting potential of these tailing ponds and have future methane projections, we use real-time weather data, mechanistic models developed from laboratory controlled experiments, and industrial reports to train a physics constrained machine learning model. Our trained model can successfully identify the directions of active ponds and estimate their emission levels, which are generally hard to obtain due to data sampling restrictions. We found that each active oil sands tailing pond could emit between 950 to 1500 tonnes of methane per year, whose environmental impact is equivalent to carbon dioxide emissions from at least 6000 gasoline powered vehicles. Although abandoned ponds are often presumed to have insignificant emissions, our findings indicate that these ponds could become active over time and potentially emit up to 1000 tonnes of methane each year. Taking an average over all datasets that was used in model training, we estimate that emissions around major oil sands regions would need to be reduced by approximately 12% over a year, to reduce the average methane concentrations to 2005 levels.
△ Less
Submitted 11 November, 2024;
originally announced November 2024.
-
HoloSpot: Intuitive Object Manipulation via Mixed Reality Drag-and-Drop
Authors:
Pablo Soler Garcia,
Petar Lukovic,
Lucie Reynaud,
Andrea Sgobbi,
Federica Bruni,
Martin Brun,
Marc Zünd,
Riccardo Bollati,
Marc Pollefeys,
Hermann Blum,
Zuria Bauer
Abstract:
Human-robot interaction through mixed reality (MR) technologies enables novel, intuitive interfaces to control robots in remote operations. Such interfaces facilitate operations in hazardous environments, where human presence is risky, yet human oversight remains crucial. Potential environments include disaster response scenarios and areas with high radiation or toxic chemicals. In this paper we p…
▽ More
Human-robot interaction through mixed reality (MR) technologies enables novel, intuitive interfaces to control robots in remote operations. Such interfaces facilitate operations in hazardous environments, where human presence is risky, yet human oversight remains crucial. Potential environments include disaster response scenarios and areas with high radiation or toxic chemicals. In this paper we present an interface system projecting a 3D representation of a scanned room as a scaled-down 'dollhouse' hologram, allowing users to select and manipulate objects using a straightforward drag-and-drop interface. We then translate these drag-and-drop user commands into real-time robot actions based on the recent Spot-Compose framework. The Unity-based application provides an interactive tutorial and a user-friendly experience, ensuring ease of use. Through comprehensive end-to-end testing, we validate the system's capability in executing pick-and-place tasks and a complementary user study affirms the interface's intuitive controls. Our findings highlight the advantages of this interface in improving user experience and operational efficiency. This work lays the groundwork for a robust framework that advances the potential for seamless human-robot collaboration in diverse applications. Paper website: https://holospot.github.io/
△ Less
Submitted 14 October, 2024;
originally announced October 2024.
-
Self-Sensing for Proprioception and Contact Detection in Soft Robots Using Shape Memory Alloy Artificial Muscles
Authors:
Ran Jing,
Meredith L. Anderson,
Juan C. Pacheco Garcia,
Andrew P. Sabelhaus
Abstract:
Estimating a soft robot's pose and applied forces, also called proprioception, is crucial for safe interaction of the robot with its environment. However, most solutions for soft robot proprioception use dedicated sensors, particularly for external forces, which introduce design trade-offs, rigidity, and risk of failure. This work presents an approach for pose estimation and contact detection for…
▽ More
Estimating a soft robot's pose and applied forces, also called proprioception, is crucial for safe interaction of the robot with its environment. However, most solutions for soft robot proprioception use dedicated sensors, particularly for external forces, which introduce design trade-offs, rigidity, and risk of failure. This work presents an approach for pose estimation and contact detection for soft robots actuated by shape memory alloy (SMA) artificial muscles, using no dedicated force sensors. Our framework uses the unique material properties of SMAs to self-sense their internal stress, via offboard measurements of their electrical resistance and in-situ temperature readings, in an existing fully-soft limb design. We demonstrate that a simple polynomial regression model on these measurements is sufficient to predict the robot's pose, under no-contact conditions. Then, we show that if an additional measurement of the true pose is available (e.g. from an already-in-place bending sensor), it is possible to predict a binary contact/no-contact using multiple combinations of self-sensing signals. Our hardware tests verify our hypothesis via a contact detection test with a human operator. This proof-of-concept validates that self-sensing signals in soft SMA-actuated soft robots can be used for proprioception and contact detection, and suggests a direction for integrating proprioception into soft robots without design compromises. Future work could employ machine learning for enhanced accuracy.
△ Less
Submitted 25 September, 2024;
originally announced September 2024.
-
The CHiME-8 DASR Challenge for Generalizable and Array Agnostic Distant Automatic Speech Recognition and Diarization
Authors:
Samuele Cornell,
Taejin Park,
Steve Huang,
Christoph Boeddeker,
Xuankai Chang,
Matthew Maciejewski,
Matthew Wiesner,
Paola Garcia,
Shinji Watanabe
Abstract:
This paper presents the CHiME-8 DASR challenge which carries on from the previous edition CHiME-7 DASR (C7DASR) and the past CHiME-6 challenge. It focuses on joint multi-channel distant speech recognition (DASR) and diarization with one or more, possibly heterogeneous, devices. The main goal is to spur research towards meeting transcription approaches that can generalize across arbitrary number of…
▽ More
This paper presents the CHiME-8 DASR challenge which carries on from the previous edition CHiME-7 DASR (C7DASR) and the past CHiME-6 challenge. It focuses on joint multi-channel distant speech recognition (DASR) and diarization with one or more, possibly heterogeneous, devices. The main goal is to spur research towards meeting transcription approaches that can generalize across arbitrary number of speakers, diverse settings (formal vs. informal conversations), meeting duration, wide-variety of acoustic scenarios and different recording configurations. Novelties with respect to C7DASR include: i) the addition of NOTSOFAR-1, an additional office/corporate meeting scenario, ii) a manually corrected Mixer 6 development set, iii) a new track in which we allow the use of large-language models (LLM) iv) a jury award mechanism to encourage participants to explore also more practical and innovative solutions. To lower the entry barrier for participants, we provide a standalone toolkit for downloading and preparing such datasets as well as performing text normalization and scoring their submissions. Furthermore, this year we also provide two baseline systems, one directly inherited from C7DASR and based on ESPnet and another one developed on NeMo and based on NeMo team submission in last year C7DASR. Baseline system results suggest that the addition of the NOTSOFAR-1 scenario significantly increases the task's difficulty due to its high number of speakers and very short duration.
△ Less
Submitted 23 July, 2024;
originally announced July 2024.
-
Cyber Physical Games
Authors:
Warisa Sritriratanarak,
Paulo Garcia
Abstract:
We describe a formulation of multi-agents operating within a Cyber-Physical System, resulting in collaborative or adversarial games. We show that the non-determinism inherent in the communication medium between agents and the underlying physical environment gives rise to environment evolution that is a probabilistic function of agents' strategies. We name these emergent properties Cyber Physical G…
▽ More
We describe a formulation of multi-agents operating within a Cyber-Physical System, resulting in collaborative or adversarial games. We show that the non-determinism inherent in the communication medium between agents and the underlying physical environment gives rise to environment evolution that is a probabilistic function of agents' strategies. We name these emergent properties Cyber Physical Games and study its properties. We present an algorithmic model that determines the most likely system evolution, approximating Cyber Physical Games through Probabilistic Finite State Automata, and evaluate it on collaborative and adversarial versions of the Iterated Boolean Game, comparing theoretical results with simulated ones. Results support the validity of the proposed model, and suggest several required research directions to continue evolving our understanding of Cyber Physical System, as well as how to best design agents that must operate within such environments.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
An Open and Reconfigurable User Interface to Manage Complex ROS-based Robotic Systems
Authors:
Pablo Malvido Fresnillo,
Saigopal Vasudevan,
Jose A. Perez Garcia,
Jose L. Martinez Lastra
Abstract:
The Robot Operating System (ROS) has significantly gained popularity among robotic engineers and researchers over the past five years, primarily due to its powerful infrastructure for node communication, which enables developers to build modular and large robotic applications. However, ROS presents a steep learning curve and lacks the intuitive usability of vendor-specific robotic Graphical User I…
▽ More
The Robot Operating System (ROS) has significantly gained popularity among robotic engineers and researchers over the past five years, primarily due to its powerful infrastructure for node communication, which enables developers to build modular and large robotic applications. However, ROS presents a steep learning curve and lacks the intuitive usability of vendor-specific robotic Graphical User Interfaces (GUIs). Moreover, its modular and distributed nature complicates the control and monitoring of extensive systems, even for advanced users. To address these challenges, this paper proposes a highly adaptable and reconfigurable web-based GUI for intuitively controlling, monitoring, and configuring complex ROS-based robotic systems. The GUI leverages ROSBridge and roslibjs to ensure seamless communication with ROS systems via topics and services. Designed as a versatile platform, the GUI allows for the selective incorporation of modular features to accommodate diverse robotic systems and applications. An initial set of commonly used features in robotic applications is presented. To demonstrate its reconfigurability, the GUI was customized and tested for four industrial use cases, receiving positive feedback. The project's repository has been made publicly available to support the robotics community and lower the entry barrier for ROS in industrial applications.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Perfect codes over non-prime power alphabets: an approach based on Diophantine equations
Authors:
Pedro-José Cazorla García
Abstract:
Perfect error correcting codes allow for an optimal transmission of information while guaranteeing error correction. For this reason, proving their existence has been a classical problem in both pure mathematics and information theory. Indeed, the classification of the parameters of $e-$error correcting perfect codes over $q-$ary alphabets was a very active topic of research in the late 20th centu…
▽ More
Perfect error correcting codes allow for an optimal transmission of information while guaranteeing error correction. For this reason, proving their existence has been a classical problem in both pure mathematics and information theory. Indeed, the classification of the parameters of $e-$error correcting perfect codes over $q-$ary alphabets was a very active topic of research in the late 20th century. Consequently, all parameters of perfect $e-$error correcting codes were found if $e \ge 3$, and it was conjectured that no perfect $2-$error correcting codes exist over any $q-$ary alphabet, where $q > 3$. In the 1970s, this was proved for $q$ a prime power, for $q = 2^r3^s$ and for only $7$ other values of $q$. Almost $50$ years later, it is surprising to note that there have been no new results in this regard and the classification of $2-$error correcting codes over non-prime power alphabets remains an open problem. In this paper, we use techniques from the resolution of generalised Ramanujan--Nagell equation and from modern computational number theory to show that perfect $2-$error correcting codes do not exist for $172$ new values of $q$ which are not prime powers, substantially increasing the values of $q$ which are now classified. In addition, we prove that, for any fixed value of $q$, there can be at most finitely many perfect $2-$error correcting codes over an alphabet of size $q$.
△ Less
Submitted 24 May, 2024; v1 submitted 6 May, 2024;
originally announced May 2024.
-
Runtime Verification and Field-based Testing for ROS-based Robotic Systems
Authors:
Ricardo Caldas,
Juan Antonio Pinera Garcia,
Matei Schiopu,
Patrizio Pelliccione,
Genaina Rodrigues,
Thorsten Berger
Abstract:
Robotic systems are becoming pervasive and adopted in increasingly many domains, such as manufacturing, healthcare, and space exploration. To this end, engineering software has emerged as a crucial discipline for building maintainable and reusable robotic systems. The robotics software engineering research field has received increasing attention, fostering autonomy as a fundamental goal. However,…
▽ More
Robotic systems are becoming pervasive and adopted in increasingly many domains, such as manufacturing, healthcare, and space exploration. To this end, engineering software has emerged as a crucial discipline for building maintainable and reusable robotic systems. The robotics software engineering research field has received increasing attention, fostering autonomy as a fundamental goal. However, robotics developers are still challenged to achieve this goal because simulation cannot realistically deliver solutions to emulate real-world phenomena. Robots also need to operate in unpredictable and uncontrollable environments, which require safe and trustworthy self-adaptation capabilities implemented in software. Typical techniques to address the challenges are runtime verification, field-based testing, and mitigation techniques that enable fail-safe solutions. However, no clear guidance exists for architecting ROS-based systems to enable and facilitate runtime verification and field-based testing. This paper aims to fill this gap by providing guidelines to help developers and quality assurance (QA) teams develop, verify, or test their robots in the field. These guidelines are carefully tailored to address the challenges and requirements of testing robotics systems in real-world scenarios. We conducted (i) a literature review on studies addressing runtime verification and field-based testing for robotic systems, (ii) mined ROS-based applications repositories, and (iii) validated the applicability, clarity, and usefulness via two questionnaires with 55 answers overall. We contribute 20 guidelines: 8 for developers and 12 for QA teams formulated for researchers and practitioners in robotic software engineering. Finally, we map our guidelines to open challenges in runtime verification and field-based testing for ROS-based systems, and we outline promising research directions in the field.
△ Less
Submitted 21 August, 2024; v1 submitted 17 April, 2024;
originally announced April 2024.
-
Enhancing Students' Learning Process Through Self-Generated Tests
Authors:
Marcos Sánchez-Élez,
Inmaculada Pardines,
Pablo García,
Guadalupe Miñana,
Sara Román,
Margarita Sánchez,
José L. Risco-Martín
Abstract:
The use of new technologies in higher education has surprisingly emphasized students' tendency to adopt a passive behavior in class. Participation and interaction of students are essential to improve academic results. This paper describes an educational experiment aimed at the promotion of students' autonomous learning by requiring them to generate test type questions related to the contents of th…
▽ More
The use of new technologies in higher education has surprisingly emphasized students' tendency to adopt a passive behavior in class. Participation and interaction of students are essential to improve academic results. This paper describes an educational experiment aimed at the promotion of students' autonomous learning by requiring them to generate test type questions related to the contents of the course. The main idea is to make the student feel part of the evaluation process by including students' questions in the evaluation exams. A set of applications running on our university online learning environment has been developed in order to provide both students and teachers with the necessary tools for a good interaction between them. Questions uploaded by students are visible to every enrolled student as well as to each involved teacher. In this way, we enhance critical analysis skills, by solving and finding possible mistakes in the questions sent by their fellows. The experiment was applied over 769 students from 12 different courses. Results show that the students who have actively participated in the experiment have obtained better academic performance.
△ Less
Submitted 21 March, 2024;
originally announced March 2024.
-
Speaking in Wavelet Domain: A Simple and Efficient Approach to Speed up Speech Diffusion Model
Authors:
Xiangyu Zhang,
Daijiao Liu,
Hexin Liu,
Qiquan Zhang,
Hanyu Meng,
Leibny Paola Garcia,
Eng Siong Chng,
Lina Yao
Abstract:
Recently, Denoising Diffusion Probabilistic Models (DDPMs) have attained leading performances across a diverse range of generative tasks. However, in the field of speech synthesis, although DDPMs exhibit impressive performance, their long training duration and substantial inference costs hinder practical deployment. Existing approaches primarily focus on enhancing inference speed, while approaches…
▽ More
Recently, Denoising Diffusion Probabilistic Models (DDPMs) have attained leading performances across a diverse range of generative tasks. However, in the field of speech synthesis, although DDPMs exhibit impressive performance, their long training duration and substantial inference costs hinder practical deployment. Existing approaches primarily focus on enhancing inference speed, while approaches to accelerate training a key factor in the costs associated with adding or customizing voices often necessitate complex modifications to the model, compromising their universal applicability. To address the aforementioned challenges, we propose an inquiry: is it possible to enhance the training/inference speed and performance of DDPMs by modifying the speech signal itself? In this paper, we double the training and inference speed of Speech DDPMs by simply redirecting the generative target to the wavelet domain. This method not only achieves comparable or superior performance to the original model in speech synthesis tasks but also demonstrates its versatility. By investigating and utilizing different wavelet bases, our approach proves effective not just in speech synthesis, but also in speech enhancement.
△ Less
Submitted 23 September, 2024; v1 submitted 16 February, 2024;
originally announced February 2024.
-
Agents Need Not Know Their Purpose
Authors:
Paulo Garcia
Abstract:
Ensuring artificial intelligence behaves in such a way that is aligned with human values is commonly referred to as the alignment challenge. Prior work has shown that rational agents, behaving in such a way that maximizes a utility function, will inevitably behave in such a way that is not aligned with human values, especially as their level of intelligence goes up. Prior work has also shown that…
▽ More
Ensuring artificial intelligence behaves in such a way that is aligned with human values is commonly referred to as the alignment challenge. Prior work has shown that rational agents, behaving in such a way that maximizes a utility function, will inevitably behave in such a way that is not aligned with human values, especially as their level of intelligence goes up. Prior work has also shown that there is no "one true utility function"; solutions must include a more holistic approach to alignment. This paper describes oblivious agents: agents that are architected in such a way that their effective utility function is an aggregation of a known and hidden sub-functions. The hidden component, to be maximized, is internally implemented as a black box, preventing the agent from examining it. The known component, to be minimized, is knowledge of the hidden sub-function. Architectural constraints further influence how agent actions can evolve its internal environment model. We show that an oblivious agent, behaving rationally, constructs an internal approximation of designers' intentions (i.e., infers alignment), and, as a consequence of its architecture and effective utility function, behaves in such a way that maximizes alignment; i.e., maximizing the approximated intention function. We show that, paradoxically, it does this for whatever utility function is used as the hidden component and, in contrast with extant techniques, chances of alignment actually improve as agent intelligence grows.
△ Less
Submitted 15 February, 2024;
originally announced February 2024.
-
Maximizing Consistent Force Output for Shape Memory Alloy Artificial Muscles in Soft Robots
Authors:
Meredith L. Anderson,
Ran Jing,
Juan C. Pacheco Garcia,
Ilyoung Yang,
Sarah Alizadeh-Shabdiz,
Charles DeLorey,
Andrew P. Sabelhaus
Abstract:
Soft robots have immense potential given their inherent safety and adaptability, but challenges in soft actuator forces and design constraints have limited scaling up soft robots to larger sizes. Electrothermal shape memory alloy (SMA) artificial muscles have the potential to create these large forces and high displacements, but consistently using these muscles under a well-defined model, in-situ…
▽ More
Soft robots have immense potential given their inherent safety and adaptability, but challenges in soft actuator forces and design constraints have limited scaling up soft robots to larger sizes. Electrothermal shape memory alloy (SMA) artificial muscles have the potential to create these large forces and high displacements, but consistently using these muscles under a well-defined model, in-situ in a soft robot, remains an open challenge. This article provides a system for maintaining the highest-possible consistent SMA forces, over long lifetimes, by combining a fatigue testing protocol with a supervisory control system for the muscles' internal temperature state. We propose a design of a soft limb with swap-able SMA muscles, and deploy the limb in a blocked-force test to quantify the relationship between the measured maximum force at different temperatures over different lifetimes. Then, by applying an invariance-based control system to maintain temperatures under our long-life limit, we demonstrate consistent high forces in a practical task over hundreds of cycles. The method we developed allows for practical implementation of SMAs in soft robots through characterizing and controlling their behavior in-situ, and provides a method to impose limits that maximize their consistent, repeatable behavior.
△ Less
Submitted 9 February, 2024;
originally announced February 2024.
-
Preserving Power Optimizations Across the High Level Synthesis of Distinct Application-Specific Circuits
Authors:
Paulo Garcia
Abstract:
We evaluate the use of software interpretation to push High Level Synthesis of application-specific accelerators toward a higher level of abstraction. Our methodology is supported by a formal power consumption model that computes the power consumption of accelerator components, accurately predicting the power consumption on new designs from prior optimization estimations. We demonstrate how our ap…
▽ More
We evaluate the use of software interpretation to push High Level Synthesis of application-specific accelerators toward a higher level of abstraction. Our methodology is supported by a formal power consumption model that computes the power consumption of accelerator components, accurately predicting the power consumption on new designs from prior optimization estimations. We demonstrate how our approach simplifies the re-use of power optimizations across distinct designs, by leveraging the higher level of design abstraction, using two accelerators representative of the robotics domain, implemented through the Bambu High Level Synthesis tool. Results support the research hypothesis, achieving predictions accurate within +/- 1%.
△ Less
Submitted 9 July, 2024; v1 submitted 15 January, 2024;
originally announced January 2024.
-
Accelerating Boolean Constraint Propagation for Efficient SAT-Solving on FPGAs
Authors:
Hariprasadh Govindasamy,
Babak Esfandiari,
Paulo Garcia
Abstract:
We present a hardware-accelerated SAT solver targeting processor/Field Programmable Gate Arrays (FPGA) SoCs. Our solution accelerates the most expensive subroutine of the Davis-Putnam-Logemann-Loveland (DPLL) algorithm, Boolean Constraint Propagation (BCP) through fine-grained FPGA parallelism. Unlike prior state-of-the-art solutions, our solver eliminates costly clause look-up operations by assig…
▽ More
We present a hardware-accelerated SAT solver targeting processor/Field Programmable Gate Arrays (FPGA) SoCs. Our solution accelerates the most expensive subroutine of the Davis-Putnam-Logemann-Loveland (DPLL) algorithm, Boolean Constraint Propagation (BCP) through fine-grained FPGA parallelism. Unlike prior state-of-the-art solutions, our solver eliminates costly clause look-up operations by assigning clauses directly to clause processors on the FPGA and dividing large formulas into smaller partitions manageable by FPGA. Partitions are hot-swapped during runtime as required and the supported formula size is limited only by available external memory, not on-chip FPGA memory. We evaluate our solver on a Xilinx Zynq platform with results showing quicker execution time across various formula sizes, subject to formula partitioning strategy. Compared to prior state-of-the-art, we achieve 1.7x and 1.1x speed up on BCP for 2 representative benchmarks and up to 6x total speedup over software-only implementation.
△ Less
Submitted 13 April, 2024; v1 submitted 14 January, 2024;
originally announced January 2024.
-
FPGAs (Can Get Some) SATisfaction
Authors:
Hariprasadh Godindasamy,
Babak Esfandiari,
Paulo Garcia
Abstract:
We present a hardware-accelerated SAT solver suitable for processor/Field Programmable Gate Arrays (FPGA) hybrid platforms, which have become the norm in the embedded domain. Our solution addresses a known bottleneck in SAT solving acceleration: unlike prior state-of-the-art solutions that have addressed the same bottleneck by limiting the amount of exploited parallelism, our solver takes advantag…
▽ More
We present a hardware-accelerated SAT solver suitable for processor/Field Programmable Gate Arrays (FPGA) hybrid platforms, which have become the norm in the embedded domain. Our solution addresses a known bottleneck in SAT solving acceleration: unlike prior state-of-the-art solutions that have addressed the same bottleneck by limiting the amount of exploited parallelism, our solver takes advantage of fine-grained parallelization opportunities by hot-swapping FPGA clause assignments at runtime. It is also the first modern completely open-source SAT accelerator, and formula size is limited only by the amount of available external memory, not by on-chip FPGA memory. Evaluation is performed on a Xilinx Zynq platform: experiments support that hardware acceleration results in shorter execution time across varying formula sizes, subject to formula partitioning strategy. We outperform prior state-of-the-art by 1.7x and 1.1x, respectively, for 2 representative benchmarks, and boast up to 6x performance increase over software-only implementation.
△ Less
Submitted 18 December, 2023;
originally announced December 2023.
-
On a Functional Definition of Intelligence
Authors:
Warisa Sritriratanarak,
Paulo Garcia
Abstract:
Without an agreed-upon definition of intelligence, asking "is this system intelligent?"" is an untestable question. This lack of consensus hinders research, and public perception, on Artificial Intelligence (AI), particularly since the rise of generative- and large-language models. Most work on precisely capturing what we mean by "intelligence" has come from the fields of philosophy, psychology, a…
▽ More
Without an agreed-upon definition of intelligence, asking "is this system intelligent?"" is an untestable question. This lack of consensus hinders research, and public perception, on Artificial Intelligence (AI), particularly since the rise of generative- and large-language models. Most work on precisely capturing what we mean by "intelligence" has come from the fields of philosophy, psychology, and cognitive science. Because these perspectives are intrinsically linked to intelligence as it is demonstrated by natural creatures, we argue such fields cannot, and will not, provide a sufficiently rigorous definition that can be applied to artificial means. Thus, we present an argument for a purely functional, black-box definition of intelligence, distinct from how that intelligence is actually achieved; focusing on the "what", rather than the "how". To achieve this, we first distinguish other related concepts (sentience, sensation, agency, etc.) from the notion of intelligence, particularly identifying how these concepts pertain to artificial intelligent systems. As a result, we achieve a formal definition of intelligence that is conceptually testable from only external observation, that suggests intelligence is a continuous variable. We conclude by identifying challenges that still remain towards quantifiable measurement. This work provides a useful perspective for both the development of AI, and for public perception of the capabilities and risks of AI.
△ Less
Submitted 15 December, 2023;
originally announced December 2023.
-
PhyOT: Physics-informed object tracking in surveillance cameras
Authors:
Kawisorn Kamtue,
Jose M. F. Moura,
Orathai Sangpetch,
Paulo Garcia
Abstract:
While deep learning has been very successful in computer vision, real world operating conditions such as lighting variation, background clutter, or occlusion hinder its accuracy across several tasks. Prior work has shown that hybrid models -- combining neural networks and heuristics/algorithms -- can outperform vanilla deep learning for several computer vision tasks, such as classification or trac…
▽ More
While deep learning has been very successful in computer vision, real world operating conditions such as lighting variation, background clutter, or occlusion hinder its accuracy across several tasks. Prior work has shown that hybrid models -- combining neural networks and heuristics/algorithms -- can outperform vanilla deep learning for several computer vision tasks, such as classification or tracking. We consider the case of object tracking, and evaluate a hybrid model (PhyOT) that conceptualizes deep neural networks as ``sensors'' in a Kalman filter setup, where prior knowledge, in the form of Newtonian laws of motion, is used to fuse sensor observations and to perform improved estimations. Our experiments combine three neural networks, performing position, indirect velocity and acceleration estimation, respectively, and evaluate such a formulation on two benchmark datasets: a warehouse security camera dataset that we collected and annotated and a traffic camera open dataset. Results suggest that our PhyOT can track objects in extreme conditions that the state-of-the-art deep neural networks fail while its performance in general cases does not degrade significantly from that of existing deep learning approaches. Results also suggest that our PhyOT components are generalizable and transferable.
△ Less
Submitted 13 December, 2023;
originally announced December 2023.
-
A Quantitative Approach to Understand Self-Supervised Models as Cross-lingual Feature Extractors
Authors:
Shuyue Stella Li,
Beining Xu,
Xiangyu Zhang,
Hexin Liu,
Wenhan Chao,
Leibny Paola Garcia
Abstract:
In this work, we study the features extracted by English self-supervised learning (SSL) models in cross-lingual contexts and propose a new metric to predict the quality of feature representations. Using automatic speech recognition (ASR) as a downstream task, we analyze the effect of model size, training objectives, and model architecture on the models' performance as a feature extractor for a set…
▽ More
In this work, we study the features extracted by English self-supervised learning (SSL) models in cross-lingual contexts and propose a new metric to predict the quality of feature representations. Using automatic speech recognition (ASR) as a downstream task, we analyze the effect of model size, training objectives, and model architecture on the models' performance as a feature extractor for a set of topologically diverse corpora. We develop a novel metric, the Phonetic-Syntax Ratio (PSR), to measure the phonetic and synthetic information in the extracted representations using deep generalized canonical correlation analysis. Results show the contrastive loss in the wav2vec2.0 objective facilitates more effective cross-lingual feature extraction. There is a positive correlation between PSR scores and ASR performance, suggesting that phonetic information extracted by monolingual SSL models can be used for downstream tasks in cross-lingual settings. The proposed metric is an effective indicator of the quality of the representations and can be useful for model selection.
△ Less
Submitted 27 November, 2023;
originally announced November 2023.
-
Software Testing and Code Refactoring: A Survey with Practitioners
Authors:
Danilo Leandro Lima,
Ronnie de Souza Santos,
Guilherme Pires Garcia,
Sildemir S. da Silva,
Cesar Franca,
Luiz Fernando Capretz
Abstract:
Nowadays, software testing professionals are commonly required to develop coding skills to work on test automation. One essential skill required from those who code is the ability to implement code refactoring, a valued quality aspect of software development; however, software developers usually encounter obstacles in successfully applying this practice. In this scenario, the present study aims to…
▽ More
Nowadays, software testing professionals are commonly required to develop coding skills to work on test automation. One essential skill required from those who code is the ability to implement code refactoring, a valued quality aspect of software development; however, software developers usually encounter obstacles in successfully applying this practice. In this scenario, the present study aims to explore how software testing professionals (e.g., software testers, test engineers, test analysts, and software QAs) deal with code refactoring to understand the benefits and limitations of this practice in the context of software testing. We followed the guidelines to conduct surveys in software engineering and applied three sampling techniques, namely convenience sampling, purposive sampling, and snowballing sampling, to collect data from testing professionals. We received answers from 80 individuals reporting their experience refactoring the code of automated tests. We concluded that in the context of software testing, refactoring offers several benefits, such as supporting the maintenance of automated tests and improving the performance of the testing team. However, practitioners might encounter barriers in effectively implementing this practice, in particular, the lack of interest from managers and leaders. Our study raises discussions on the importance of having testing professionals implement refactoring in the code of automated tests, allowing them to improve their coding abilities.
△ Less
Submitted 2 October, 2023;
originally announced October 2023.
-
Enhancing Code-switching Speech Recognition with Interactive Language Biases
Authors:
Hexin Liu,
Leibny Paola Garcia,
Xiangyu Zhang,
Andy W. H. Khong,
Sanjeev Khudanpur
Abstract:
Languages usually switch within a multilingual speech signal, especially in a bilingual society. This phenomenon is referred to as code-switching (CS), making automatic speech recognition (ASR) challenging under a multilingual scenario. We propose to improve CS-ASR by biasing the hybrid CTC/attention ASR model with multi-level language information comprising frame- and token-level language posteri…
▽ More
Languages usually switch within a multilingual speech signal, especially in a bilingual society. This phenomenon is referred to as code-switching (CS), making automatic speech recognition (ASR) challenging under a multilingual scenario. We propose to improve CS-ASR by biasing the hybrid CTC/attention ASR model with multi-level language information comprising frame- and token-level language posteriors. The interaction between various resolutions of language biases is subsequently explored in this work. We conducted experiments on datasets from the ASRU 2019 code-switching challenge. Compared to the baseline, the proposed interactive language biases (ILB) method achieves higher performance and ablation studies highlight the effects of different language biases and their interactions. In addition, the results presented indicate that language bias implicitly enhances internal language modeling, leading to performance degradation after employing an external language model.
△ Less
Submitted 28 September, 2023;
originally announced September 2023.
-
Unidirectional brain-computer interface: Artificial neural network encoding natural images to fMRI response in the visual cortex
Authors:
Ruixing Liang,
Xiangyu Zhang,
Qiong Li,
Lai Wei,
Hexin Liu,
Avisha Kumar,
Kelley M. Kempski Leadingham,
Joshua Punnoose,
Leibny Paola Garcia,
Amir Manbachi
Abstract:
While significant advancements in artificial intelligence (AI) have catalyzed progress across various domains, its full potential in understanding visual perception remains underexplored. We propose an artificial neural network dubbed VISION, an acronym for "Visual Interface System for Imaging Output of Neural activity," to mimic the human brain and show how it can foster neuroscientific inquiries…
▽ More
While significant advancements in artificial intelligence (AI) have catalyzed progress across various domains, its full potential in understanding visual perception remains underexplored. We propose an artificial neural network dubbed VISION, an acronym for "Visual Interface System for Imaging Output of Neural activity," to mimic the human brain and show how it can foster neuroscientific inquiries. Using visual and contextual inputs, this multimodal model predicts the brain's functional magnetic resonance imaging (fMRI) scan response to natural images. VISION successfully predicts human hemodynamic responses as fMRI voxel values to visual inputs with an accuracy exceeding state-of-the-art performance by 45%. We further probe the trained networks to reveal representational biases in different visual areas, generate experimentally testable hypotheses, and formulate an interpretable metric to associate these hypotheses with cortical functions. With both a model and evaluation metric, the cost and time burdens associated with designing and implementing functional analysis on the visual cortex could be reduced. Our work suggests that the evolution of computational models may shed light on our fundamental understanding of the visual cortex and provide a viable approach toward reliable brain-machine interfaces.
△ Less
Submitted 26 September, 2023;
originally announced September 2023.
-
The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple Devices in Diverse Scenarios
Authors:
Samuele Cornell,
Matthew Wiesner,
Shinji Watanabe,
Desh Raj,
Xuankai Chang,
Paola Garcia,
Matthew Maciejewski,
Yoshiki Masuyama,
Zhong-Qiu Wang,
Stefano Squartini,
Sanjeev Khudanpur
Abstract:
The CHiME challenges have played a significant role in the development and evaluation of robust automatic speech recognition (ASR) systems. We introduce the CHiME-7 distant ASR (DASR) task, within the 7th CHiME challenge. This task comprises joint ASR and diarization in far-field settings with multiple, and possibly heterogeneous, recording devices. Different from previous challenges, we evaluate…
▽ More
The CHiME challenges have played a significant role in the development and evaluation of robust automatic speech recognition (ASR) systems. We introduce the CHiME-7 distant ASR (DASR) task, within the 7th CHiME challenge. This task comprises joint ASR and diarization in far-field settings with multiple, and possibly heterogeneous, recording devices. Different from previous challenges, we evaluate systems on 3 diverse scenarios: CHiME-6, DiPCo, and Mixer 6. The goal is for participants to devise a single system that can generalize across different array geometries and use cases with no a-priori information. Another departure from earlier CHiME iterations is that participants are allowed to use open-source pre-trained models and datasets. In this paper, we describe the challenge design, motivation, and fundamental research questions in detail. We also present the baseline system, which is fully array-topology agnostic and features multi-channel diarization, channel selection, guided source separation and a robust ASR model that leverages self-supervised speech representations (SSLR).
△ Less
Submitted 14 July, 2023; v1 submitted 23 June, 2023;
originally announced June 2023.
-
Bypass Temporal Classification: Weakly Supervised Automatic Speech Recognition with Imperfect Transcripts
Authors:
Dongji Gao,
Matthew Wiesner,
Hainan Xu,
Leibny Paola Garcia,
Daniel Povey,
Sanjeev Khudanpur
Abstract:
This paper presents a novel algorithm for building an automatic speech recognition (ASR) model with imperfect training data. Imperfectly transcribed speech is a prevalent issue in human-annotated speech corpora, which degrades the performance of ASR models. To address this problem, we propose Bypass Temporal Classification (BTC) as an expansion of the Connectionist Temporal Classification (CTC) cr…
▽ More
This paper presents a novel algorithm for building an automatic speech recognition (ASR) model with imperfect training data. Imperfectly transcribed speech is a prevalent issue in human-annotated speech corpora, which degrades the performance of ASR models. To address this problem, we propose Bypass Temporal Classification (BTC) as an expansion of the Connectionist Temporal Classification (CTC) criterion. BTC explicitly encodes the uncertainties associated with transcripts during training. This is accomplished by enhancing the flexibility of the training graph, which is implemented as a weighted finite-state transducer (WFST) composition. The proposed algorithm improves the robustness and accuracy of ASR systems, particularly when working with imprecisely transcribed speech corpora. Our implementation will be open-sourced.
△ Less
Submitted 1 June, 2023;
originally announced June 2023.
-
Learning efficient backprojections across cortical hierarchies in real time
Authors:
Kevin Max,
Laura Kriener,
Garibaldi Pineda García,
Thomas Nowotny,
Ismael Jaras,
Walter Senn,
Mihai A. Petrovici
Abstract:
Models of sensory processing and learning in the cortex need to efficiently assign credit to synapses in all areas. In deep learning, a known solution is error backpropagation, which however requires biologically implausible weight transport from feed-forward to feedback paths.
We introduce Phaseless Alignment Learning (PAL), a bio-plausible method to learn efficient feedback weights in layered…
▽ More
Models of sensory processing and learning in the cortex need to efficiently assign credit to synapses in all areas. In deep learning, a known solution is error backpropagation, which however requires biologically implausible weight transport from feed-forward to feedback paths.
We introduce Phaseless Alignment Learning (PAL), a bio-plausible method to learn efficient feedback weights in layered cortical hierarchies. This is achieved by exploiting the noise naturally found in biophysical systems as an additional carrier of information. In our dynamical system, all weights are learned simultaneously with always-on plasticity and using only information locally available to the synapses. Our method is completely phase-free (no forward and backward passes or phased learning) and allows for efficient error propagation across multi-layer cortical hierarchies, while maintaining biologically plausible signal transport and learning.
Our method is applicable to a wide class of models and improves on previously known biologically plausible ways of credit assignment: compared to random synaptic feedback, it can solve complex tasks with less neurons and learn more useful latent representations. We demonstrate this on various classification tasks using a cortical microcircuit model with prospective coding.
△ Less
Submitted 2 February, 2024; v1 submitted 20 December, 2022;
originally announced December 2022.
-
Technological taxonomies for hypernym and hyponym retrieval in patent texts
Authors:
You Zuo,
Yixuan Li,
Alma Parias García,
Kim Gerdes
Abstract:
This paper presents an automatic approach to creating taxonomies of technical terms based on the Cooperative Patent Classification (CPC). The resulting taxonomy contains about 170k nodes in 9 separate technological branches and is freely available. We also show that a Text-to-Text Transfer Transformer (T5) model can be fine-tuned to generate hypernyms and hyponyms with relatively high precision, c…
▽ More
This paper presents an automatic approach to creating taxonomies of technical terms based on the Cooperative Patent Classification (CPC). The resulting taxonomy contains about 170k nodes in 9 separate technological branches and is freely available. We also show that a Text-to-Text Transfer Transformer (T5) model can be fine-tuned to generate hypernyms and hyponyms with relatively high precision, confirming the manually assessed quality of the resource. The T5 model opens the taxonomy to any new technological terms for which a hypernym can be generated, thus making the resource updateable with new terms, an essential feature for the constantly evolving field of technological terminology.
△ Less
Submitted 13 December, 2022; v1 submitted 14 November, 2022;
originally announced December 2022.
-
EURO: ESPnet Unsupervised ASR Open-source Toolkit
Authors:
Dongji Gao,
Jiatong Shi,
Shun-Po Chuang,
Leibny Paola Garcia,
Hung-yi Lee,
Shinji Watanabe,
Sanjeev Khudanpur
Abstract:
This paper describes the ESPnet Unsupervised ASR Open-source Toolkit (EURO), an end-to-end open-source toolkit for unsupervised automatic speech recognition (UASR). EURO adopts the state-of-the-art UASR learning method introduced by the Wav2vec-U, originally implemented at FAIRSEQ, which leverages self-supervised speech representations and adversarial training. In addition to wav2vec2, EURO extend…
▽ More
This paper describes the ESPnet Unsupervised ASR Open-source Toolkit (EURO), an end-to-end open-source toolkit for unsupervised automatic speech recognition (UASR). EURO adopts the state-of-the-art UASR learning method introduced by the Wav2vec-U, originally implemented at FAIRSEQ, which leverages self-supervised speech representations and adversarial training. In addition to wav2vec2, EURO extends the functionality and promotes reproducibility for UASR tasks by integrating S3PRL and k2, resulting in flexible frontends from 27 self-supervised models and various graph-based decoding strategies. EURO is implemented in ESPnet and follows its unified pipeline to provide UASR recipes with a complete setup. This improves the pipeline's efficiency and allows EURO to be easily applied to existing datasets in ESPnet. Extensive experiments on three mainstream self-supervised models demonstrate the toolkit's effectiveness and achieve state-of-the-art UASR performance on TIMIT and LibriSpeech datasets. EURO will be publicly available at https://github.com/espnet/espnet, aiming to promote this exciting and emerging research area based on UASR through open-source activity.
△ Less
Submitted 20 May, 2023; v1 submitted 30 November, 2022;
originally announced November 2022.
-
High-Quality Fault Resiliency in Fat Trees
Authors:
John Gliksberg,
Antoine Capra,
Alexandre Louvet,
Pedro Javier Garcia,
Devan Sohier
Abstract:
Coupling regular topologies with optimised routing algorithms is key in pushing the performance of interconnection networks of supercomputers.In this paper we present Dmodc, a fast deterministic routing algorithm for Parallel Generalised Fat-Trees (PGFTs) which minimises congestion risk even under massive network degradation caused by equipment failure.Dmodc computes forwarding tables with a close…
▽ More
Coupling regular topologies with optimised routing algorithms is key in pushing the performance of interconnection networks of supercomputers.In this paper we present Dmodc, a fast deterministic routing algorithm for Parallel Generalised Fat-Trees (PGFTs) which minimises congestion risk even under massive network degradation caused by equipment failure.Dmodc computes forwarding tables with a closed-form arithmetic formula by relying on a fast preprocessing phase.This allows complete re-routing of networks with tens of thousands of nodes in less than a second.In turn, this greatly helps centralised fabric management react to faults with high-quality routing tables and no impact to running applications in current and future very large-scale HPC clusters.
△ Less
Submitted 23 November, 2022;
originally announced November 2022.
-
Node-Type-Based Load-Balancing Routing for Parallel Generalized Fat-Trees
Authors:
John Gliksberg,
Jean-Noel Quintin,
Pedro Javier Garcia
Abstract:
High-Performance Computing (HPC) clusters are made up of a variety of node types (usually compute, I/O, service, and GPGPU nodes) and applications don't use nodes of a different type the same way. Resulting communication patterns reflect organization of groups of nodes, and current optimal routing algorithms for all-to-all patterns will not always maximize performance for group-specific communicat…
▽ More
High-Performance Computing (HPC) clusters are made up of a variety of node types (usually compute, I/O, service, and GPGPU nodes) and applications don't use nodes of a different type the same way. Resulting communication patterns reflect organization of groups of nodes, and current optimal routing algorithms for all-to-all patterns will not always maximize performance for group-specific communications. Since application communication patterns are rarely available beforehand, we choose to rely on node types as a good guess for node usage. We provide a description of node type heterogeneity and analyse performance degradation caused by unlucky repartition of nodes of the same type. We provide an extension to routing algorithms for Parallel Generalized Fat-Tree topologies (PGFTs) which balances load amongst groups of nodes of the same type. We show how it removes these performance issues by comparing results in a variety of situations against corresponding classical algorithms.
△ Less
Submitted 21 November, 2022;
originally announced November 2022.
-
High-Quality Fault-Resiliency in Fat-Tree Networks (Extended Abstract)
Authors:
John Gliksberg,
Antoine Capra,
Alexandre Louvet,
Pedro Javier Garcia,
Devan Sohier
Abstract:
Coupling regular topologies with optimized routing algorithms is key in pushing the performance of interconnection networks of HPC systems. In this paper we present Dmodc, a fast deterministic routing algorithm for Parallel Generalized Fat-Trees (PGFTs) which minimizes congestion risk even under massive topology degradation caused by equipment failure. It applies a modulo-based computation of forw…
▽ More
Coupling regular topologies with optimized routing algorithms is key in pushing the performance of interconnection networks of HPC systems. In this paper we present Dmodc, a fast deterministic routing algorithm for Parallel Generalized Fat-Trees (PGFTs) which minimizes congestion risk even under massive topology degradation caused by equipment failure. It applies a modulo-based computation of forwarding tables among switches closer to the destination, using only knowledge of subtrees for pre-modulo division. Dmodc allows complete rerouting of topologies with tens of thousands of nodes in less than a second, which greatly helps centralized fabric management react to faults with high-quality routing tables and no impact to running applications in current and future very large-scale HPC clusters. We compare Dmodc against routing algorithms available in the InfiniBand control software (OpenSM) first for routing execution time to show feasibility at scale, and then for congestion risk under degradation to demonstrate robustness. The latter comparison is done using static analysis of routing tables under random permutation (RP), shift permutation (SP) and all-to-all (A2A) traffic patterns. Results for Dmodc show A2A and RP congestion risks similar under heavy degradation as the most stable algorithms compared, and near-optimal SP congestion risk up to 1% of random degradation.
△ Less
Submitted 21 November, 2022;
originally announced November 2022.
-
Bridging Speech and Textual Pre-trained Models with Unsupervised ASR
Authors:
Jiatong Shi,
Chan-Jan Hsu,
Holam Chung,
Dongji Gao,
Paola Garcia,
Shinji Watanabe,
Ann Lee,
Hung-yi Lee
Abstract:
Spoken language understanding (SLU) is a task aiming to extract high-level semantics from spoken utterances. Previous works have investigated the use of speech self-supervised models and textual pre-trained models, which have shown reasonable improvements to various SLU tasks. However, because of the mismatched modalities between speech signals and text tokens, previous methods usually need comple…
▽ More
Spoken language understanding (SLU) is a task aiming to extract high-level semantics from spoken utterances. Previous works have investigated the use of speech self-supervised models and textual pre-trained models, which have shown reasonable improvements to various SLU tasks. However, because of the mismatched modalities between speech signals and text tokens, previous methods usually need complex designs of the frameworks. This work proposes a simple yet efficient unsupervised paradigm that connects speech and textual pre-trained models, resulting in an unsupervised speech-to-semantic pre-trained model for various tasks in SLU. To be specific, we propose to use unsupervised automatic speech recognition (ASR) as a connector that bridges different modalities used in speech and textual pre-trained models. Our experiments show that unsupervised ASR itself can improve the representations from speech self-supervised models. More importantly, it is shown as an efficient connector between speech and textual pre-trained models, improving the performances of five different SLU tasks. Notably, on spoken question answering, we reach the state-of-the-art result over the challenging NMSQA benchmark.
△ Less
Submitted 6 November, 2022;
originally announced November 2022.
-
Adapting self-supervised models to multi-talker speech recognition using speaker embeddings
Authors:
Zili Huang,
Desh Raj,
Paola García,
Sanjeev Khudanpur
Abstract:
Self-supervised learning (SSL) methods which learn representations of data without explicit supervision have gained popularity in speech-processing tasks, particularly for single-talker applications. However, these models often have degraded performance for multi-talker scenarios -- possibly due to the domain mismatch -- which severely limits their use for such applications. In this paper, we inve…
▽ More
Self-supervised learning (SSL) methods which learn representations of data without explicit supervision have gained popularity in speech-processing tasks, particularly for single-talker applications. However, these models often have degraded performance for multi-talker scenarios -- possibly due to the domain mismatch -- which severely limits their use for such applications. In this paper, we investigate the adaptation of upstream SSL models to the multi-talker automatic speech recognition (ASR) task under two conditions. First, when segmented utterances are given, we show that adding a target speaker extraction (TSE) module based on enrollment embeddings is complementary to mixture-aware pre-training. Second, for unsegmented mixtures, we propose a novel joint speaker modeling (JSM) approach, which aggregates information from all speakers in the mixture through their embeddings. With controlled experiments on Libri2Mix, we show that using speaker embeddings provides relative WER improvements of 9.1% and 42.1% over strong baselines for the segmented and unsegmented cases, respectively. We also demonstrate the effectiveness of our models for real conversational mixtures through experiments on the AMI dataset.
△ Less
Submitted 1 November, 2022;
originally announced November 2022.
-
Reducing Language confusion for Code-switching Speech Recognition with Token-level Language Diarization
Authors:
Hexin Liu,
Haihua Xu,
Leibny Paola Garcia,
Andy W. H. Khong,
Yi He,
Sanjeev Khudanpur
Abstract:
Code-switching (CS) refers to the phenomenon that languages switch within a speech signal and leads to language confusion for automatic speech recognition (ASR). This paper aims to address language confusion for improving CS-ASR from two perspectives: incorporating and disentangling language information. We incorporate language information in the CS-ASR model by dynamically biasing the model with…
▽ More
Code-switching (CS) refers to the phenomenon that languages switch within a speech signal and leads to language confusion for automatic speech recognition (ASR). This paper aims to address language confusion for improving CS-ASR from two perspectives: incorporating and disentangling language information. We incorporate language information in the CS-ASR model by dynamically biasing the model with token-level language posteriors which are outputs of a sequence-to-sequence auxiliary language diarization module. In contrast, the disentangling process reduces the difference between languages via adversarial training so as to normalize two languages. We conduct the experiments on the SEAME dataset. Compared to the baseline model, both the joint optimization with LD and the language posterior bias achieve performance improvement. The comparison of the proposed methods indicates that incorporating language information is more effective than disentangling for reducing language confusion in CS speech.
△ Less
Submitted 26 October, 2022;
originally announced October 2022.
-
On Compressing Sequences for Self-Supervised Speech Models
Authors:
Yen Meng,
Hsuan-Jui Chen,
Jiatong Shi,
Shinji Watanabe,
Paola Garcia,
Hung-yi Lee,
Hao Tang
Abstract:
Compressing self-supervised models has become increasingly necessary, as self-supervised models become larger. While previous approaches have primarily focused on compressing the model size, shortening sequences is also effective in reducing the computational cost. In this work, we study fixed-length and variable-length subsampling along the time axis in self-supervised learning. We explore how in…
▽ More
Compressing self-supervised models has become increasingly necessary, as self-supervised models become larger. While previous approaches have primarily focused on compressing the model size, shortening sequences is also effective in reducing the computational cost. In this work, we study fixed-length and variable-length subsampling along the time axis in self-supervised learning. We explore how individual downstream tasks are sensitive to input frame rates. Subsampling while training self-supervised models not only improves the overall performance on downstream tasks under certain frame rates, but also brings significant speed-up in inference. Variable-length subsampling performs particularly well under low frame rates. In addition, if we have access to phonetic boundaries, we find no degradation in performance for an average frame rate as low as 10 Hz.
△ Less
Submitted 25 October, 2022; v1 submitted 13 October, 2022;
originally announced October 2022.