-
Thinking in Directivity: Speech Large Language Model for Multi-Talker Directional Speech Recognition
Authors:
Jiamin Xie,
Ju Lin,
Yiteng Huang,
Tyler Vuong,
Zhaojiang Lin,
Zhaojun Yang,
Peng Su,
Prashant Rawat,
Sangeeta Srivastava,
Ming Sun,
Florian Metze
Abstract:
Recent studies have demonstrated that prompting large language models (LLM) with audio encodings enables effective speech recognition capabilities. However, the ability of Speech LLMs to comprehend and process multi-channel audio with spatial cues remains a relatively uninvestigated area of research. In this work, we present directional-SpeechLlama, a novel approach that leverages the microphone a…
▽ More
Recent studies have demonstrated that prompting large language models (LLM) with audio encodings enables effective speech recognition capabilities. However, the ability of Speech LLMs to comprehend and process multi-channel audio with spatial cues remains a relatively uninvestigated area of research. In this work, we present directional-SpeechLlama, a novel approach that leverages the microphone array of smart glasses to achieve directional speech recognition, source localization, and bystander cross-talk suppression. To enhance the model's ability to understand directivity, we propose two key techniques: serialized directional output training (S-DOT) and contrastive direction data augmentation (CDDA). Experimental results show that our proposed directional-SpeechLlama effectively captures the relationship between textual cues and spatial audio, yielding strong performance in both speech recognition and source localization tasks.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
Uncertainty-Driven Radar-Inertial Fusion for Instantaneous 3D Ego-Velocity Estimation
Authors:
Prashant Kumar Rai,
Elham Kowsari,
Nataliya Strokina,
Reza Ghabcheloo
Abstract:
We present a method for estimating ego-velocity in autonomous navigation by integrating high-resolution imaging radar with an inertial measurement unit. The proposed approach addresses the limitations of traditional radar-based ego-motion estimation techniques by employing a neural network to process complex-valued raw radar data and estimate instantaneous linear ego-velocity along with its associ…
▽ More
We present a method for estimating ego-velocity in autonomous navigation by integrating high-resolution imaging radar with an inertial measurement unit. The proposed approach addresses the limitations of traditional radar-based ego-motion estimation techniques by employing a neural network to process complex-valued raw radar data and estimate instantaneous linear ego-velocity along with its associated uncertainty. This uncertainty-aware velocity estimate is then integrated with inertial measurement unit data using an Extended Kalman Filter. The filter leverages the network-predicted uncertainty to refine the inertial sensor's noise and bias parameters, improving the overall robustness and accuracy of the ego-motion estimation. We evaluated the proposed method on the publicly available ColoRadar dataset. Our approach achieves significantly lower error compared to the closest publicly available method and also outperforms both instantaneous and scan matching-based techniques.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
FeatureSense: Protecting Speaker Attributes in Always-On Audio Sensing System
Authors:
Bhawana Chhaglani,
Sarmistha Sarna Gomasta,
Yuvraj Agarwal,
Jeremy Gummeson,
Prashant Shenoy
Abstract:
Audio is a rich sensing modality that is useful for a variety of human activity recognition tasks. However, the ubiquitous nature of smartphones and smart speakers with always-on microphones has led to numerous privacy concerns and a lack of trust in deploying these audio-based sensing systems. This paper addresses this critical challenge of preserving user privacy when using audio for sensing app…
▽ More
Audio is a rich sensing modality that is useful for a variety of human activity recognition tasks. However, the ubiquitous nature of smartphones and smart speakers with always-on microphones has led to numerous privacy concerns and a lack of trust in deploying these audio-based sensing systems. This paper addresses this critical challenge of preserving user privacy when using audio for sensing applications while maintaining utility. While prior work focuses primarily on protecting recoverable speech content, we show that sensitive speaker-specific attributes such as age and gender can still be inferred after masking speech and propose a comprehensive privacy evaluation framework to assess this speaker attribute leakage. We design and implement FeatureSense, an open-source library that provides a set of generalizable privacy-aware audio features that can be used for wide range of sensing applications. We present an adaptive task-specific feature selection algorithm that optimizes the privacy-utility-cost trade-off based on the application requirements. Through our extensive evaluation, we demonstrate the high utility of FeatureSense across a diverse set of sensing tasks. Our system outperforms existing privacy techniques by 60.6% in preserving user-specific privacy. This work provides a foundational framework for ensuring trust in audio sensing by enabling effective privacy-aware audio classification systems.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Dual Filter: A Mathematical Framework for Inference using Transformer-like Architectures
Authors:
Heng-Sheng Chang,
Prashant G. Mehta
Abstract:
This paper presents a mathematical framework for causal nonlinear prediction in settings where observations are generated from an underlying hidden Markov model (HMM). Both the problem formulation and the proposed solution are motivated by the decoder-only transformer architecture, in which a finite sequence of observations (tokens) is mapped to the conditional probability of the next token. Our o…
▽ More
This paper presents a mathematical framework for causal nonlinear prediction in settings where observations are generated from an underlying hidden Markov model (HMM). Both the problem formulation and the proposed solution are motivated by the decoder-only transformer architecture, in which a finite sequence of observations (tokens) is mapped to the conditional probability of the next token. Our objective is not to construct a mathematical model of a transformer. Rather, our interest lies in deriving, from first principles, transformer-like architectures that solve the prediction problem for which the transformer is designed. The proposed framework is based on an original optimal control approach, where the prediction objective (MMSE) is reformulated as an optimal control problem. An analysis of the optimal control problem is presented leading to a fixed-point equation on the space of probability measures. To solve the fixed-point equation, we introduce the dual filter, an iterative algorithm that closely parallels the architecture of decoder-only transformers. These parallels are discussed in detail along with the relationship to prior work on mathematical modeling of transformers as transport on the space of probability measures. Numerical experiments are provided to illustrate the performance of the algorithm using parameter values used in researchscale transformer models.
△ Less
Submitted 1 May, 2025;
originally announced May 2025.
-
The Tenth NTIRE 2025 Efficient Super-Resolution Challenge Report
Authors:
Bin Ren,
Hang Guo,
Lei Sun,
Zongwei Wu,
Radu Timofte,
Yawei Li,
Yao Zhang,
Xinning Chai,
Zhengxue Cheng,
Yingsheng Qin,
Yucai Yang,
Li Song,
Hongyuan Yu,
Pufan Xu,
Cheng Wan,
Zhijuan Huang,
Peng Guo,
Shuyuan Cui,
Chenjun Li,
Xuehai Hu,
Pan Pan,
Xin Zhang,
Heng Zhang,
Qing Luo,
Linyan Jiang
, et al. (122 additional authors not shown)
Abstract:
This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the…
▽ More
This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the $\operatorname{DIV2K\_LSDIR\_test}$ dataset. A robust participation saw \textbf{244} registered entrants, with \textbf{43} teams submitting valid entries. This report meticulously analyzes these methods and results, emphasizing groundbreaking advancements in state-of-the-art single-image ESR techniques. The analysis highlights innovative approaches and establishes benchmarks for future research in the field.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
Error Analysis of Sampling Algorithms for Approximating Stochastic Optimal Control
Authors:
Anant A. Joshi,
Amirhossein Taghvaei,
Prashant G. Mehta
Abstract:
This paper is concerned with the error analysis of two types of sampling algorithms, namely model predictive path integral (MPPI) and an interacting particle system (\IPS) algorithm, that have been proposed in the literature for numerical approximation of the stochastic optimal control. The analysis is presented through the lens of Gibbs variational principle. For an illustrative example of a sing…
▽ More
This paper is concerned with the error analysis of two types of sampling algorithms, namely model predictive path integral (MPPI) and an interacting particle system (\IPS) algorithm, that have been proposed in the literature for numerical approximation of the stochastic optimal control. The analysis is presented through the lens of Gibbs variational principle. For an illustrative example of a single-stage stochastic optimal control problem, analytical expressions for approximation error and scaling laws, with respect to the state dimension and sample size, are derived. The analytical results are illustrated with numerical simulations.
△ Less
Submitted 2 April, 2025;
originally announced April 2025.
-
Certified Approximate Reachability (CARe): Formal Error Bounds on Deep Learning of Reachable Sets
Authors:
Prashant Solanki,
Nikolaus Vertovec,
Yannik Schnitzer,
Jasper Van Beers,
Coen de Visser,
Alessandro Abate
Abstract:
Recent approaches to leveraging deep learning for computing reachable sets of continuous-time dynamical systems have gained popularity over traditional level-set methods, as they overcome the curse of dimensionality. However, as with level-set methods, considerable care needs to be taken in limiting approximation errors, particularly since no guarantees are provided during training on the accuracy…
▽ More
Recent approaches to leveraging deep learning for computing reachable sets of continuous-time dynamical systems have gained popularity over traditional level-set methods, as they overcome the curse of dimensionality. However, as with level-set methods, considerable care needs to be taken in limiting approximation errors, particularly since no guarantees are provided during training on the accuracy of the learned reachable set. To address this limitation, we introduce an epsilon-approximate Hamilton-Jacobi Partial Differential Equation (HJ-PDE), which establishes a relationship between training loss and accuracy of the true reachable set. To formally certify this approximation, we leverage Satisfiability Modulo Theories (SMT) solvers to bound the residual error of the HJ-based loss function across the domain of interest. Leveraging Counter Example Guided Inductive Synthesis (CEGIS), we close the loop around learning and verification, by fine-tuning the neural network on counterexamples found by the SMT solver, thus improving the accuracy of the learned reachable set. To the best of our knowledge, Certified Approximate Reachability (CARe) is the first approach to provide soundness guarantees on learned reachable sets of continuous dynamical systems.
△ Less
Submitted 31 March, 2025;
originally announced March 2025.
-
Fibonacci-Net: A Lightweight CNN model for Automatic Brain Tumor Classification
Authors:
Santanu Roy,
Ashvath Suresh,
Archit Gupta,
Shubhi Tiwari,
Palak Sahu,
Prashant Adhikari,
Yuvraj S. Shekhawat
Abstract:
This research proposes a very lightweight model "Fibonacci-Net" along with a novel pooling technique, for automatic brain tumor classification from imbalanced Magnetic Resonance Imaging (MRI) datasets. Automatic brain tumor detection from MRI dataset has garnered significant attention in the research community, since the inception of Convolutional Neural Network (CNN) models. However, the performa…
▽ More
This research proposes a very lightweight model "Fibonacci-Net" along with a novel pooling technique, for automatic brain tumor classification from imbalanced Magnetic Resonance Imaging (MRI) datasets. Automatic brain tumor detection from MRI dataset has garnered significant attention in the research community, since the inception of Convolutional Neural Network (CNN) models. However, the performance of conventional CNN models is hindered due to class imbalance problems. The novelties of this work are as follows: (I) A lightweight CNN model is proposed in which the number of filters in different convolutional layers is chosen according to the numbers of Fibonacci series. (II) In the last two blocks of the proposed model, depth-wise separable convolution (DWSC) layers are employed to considerably reduce the computational complexity of the model. (III) Two parallel concatenations (or, skip connections) are deployed from 2nd to 4th, and 3rd to 5th convolutional block in the proposed Fibonacci-Net. This skip connection encompasses a novel Average-2Max pooling layer that produces two stacks of convoluted output, having a bit different statistics. Therefore, this parallel concatenation block works as an efficient feature augmenter inside the model, thus, automatically alleviating the class imbalance problem to a certain extent. For validity purpose, we have implemented the proposed framework on three MRI datasets which are highly class-imbalanced. (a) The first dataset has four classes, i.e., glioma tumor, meningioma tumor, pituitary tumor, and no-tumor. (b) Second and third MRI datasets have 15 and 44 classes respectively. Experimental results reveal that, after employing the proposed Fibonacci-Net we have achieved 96.2% accuracy, 97.17% precision, 95.9% recall, 96.5% F1 score, and 99.9% specificity on the most challenging ``44-classes MRI dataset''.
△ Less
Submitted 18 March, 2025;
originally announced March 2025.
-
Surgical Vision World Model
Authors:
Saurabh Koju,
Saurav Bastola,
Prashant Shrestha,
Sanskar Amgain,
Yash Raj Shrestha,
Rudra P. K. Poudel,
Binod Bhattarai
Abstract:
Realistic and interactive surgical simulation has the potential to facilitate crucial applications, such as medical professional training and autonomous surgical agent training. In the natural visual domain, world models have enabled action-controlled data generation, demonstrating the potential to train autonomous agents in interactive simulated environments when large-scale real data acquisition…
▽ More
Realistic and interactive surgical simulation has the potential to facilitate crucial applications, such as medical professional training and autonomous surgical agent training. In the natural visual domain, world models have enabled action-controlled data generation, demonstrating the potential to train autonomous agents in interactive simulated environments when large-scale real data acquisition is infeasible. However, such works in the surgical domain have been limited to simplified computer simulations, and lack realism. Furthermore, existing literature in world models has predominantly dealt with action-labeled data, limiting their applicability to real-world surgical data, where obtaining action annotation is prohibitively expensive. Inspired by the recent success of Genie in leveraging unlabeled video game data to infer latent actions and enable action-controlled data generation, we propose the first surgical vision world model. The proposed model can generate action-controllable surgical data and the architecture design is verified with extensive experiments on the unlabeled SurgToolLoc-2022 dataset. Codes and implementation details are available at https://github.com/bhattarailab/Surgical-Vision-World-Model
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
LifeSaver: Predictive Load Limit Estimation for Transport Vehicles in Hilly Areas
Authors:
Chanakya Rao,
Vaibhav Chopra,
Moksh Soni,
Prashant Mishra
Abstract:
The transportation of essential goods in mountainous regions faces severe logistical challenges and frequent disruptions. To mitigate these difficulties, transport companies often overload trucks, which, though cost-saving, significantly heightens the risk of accidents and mechanical failures. This paper presents the development of a device that detects overloaded and insecurely fastened loads on…
▽ More
The transportation of essential goods in mountainous regions faces severe logistical challenges and frequent disruptions. To mitigate these difficulties, transport companies often overload trucks, which, though cost-saving, significantly heightens the risk of accidents and mechanical failures. This paper presents the development of a device that detects overloaded and insecurely fastened loads on trucks and commercial vehicles. Using advanced load sensors, the device offers real-time monitoring of cargo weight distribution, alerting drivers and authorities to unsafe conditions. The initial prototype utilised two basic load cells and an Arduino microcontroller. The second version was enhanced with four load cells and extended sensors. This version was tested by placing an electric golf cart onto the prototype. Various loads were then added to the cart in different orientations to assess whether the system could accurately detect improper or excessive load conditions.
△ Less
Submitted 14 February, 2025;
originally announced February 2025.
-
Zero-resource Speech Translation and Recognition with LLMs
Authors:
Karel Mundnich,
Xing Niu,
Prashant Mathur,
Srikanth Ronanki,
Brady Houston,
Veera Raghavendra Elluru,
Nilaksh Das,
Zejiang Hou,
Goeric Huybrechts,
Anshu Bhatia,
Daniel Garcia-Romero,
Kyu J. Han,
Katrin Kirchhoff
Abstract:
Despite recent advancements in speech processing, zero-resource speech translation (ST) and automatic speech recognition (ASR) remain challenging problems. In this work, we propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data. We achieve this by using a pre-trained multilingual speech encoder, a m…
▽ More
Despite recent advancements in speech processing, zero-resource speech translation (ST) and automatic speech recognition (ASR) remain challenging problems. In this work, we propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data. We achieve this by using a pre-trained multilingual speech encoder, a multilingual LLM, and a lightweight adaptation module that maps the audio representations to the token embedding space of the LLM. We perform several experiments both in ST and ASR to understand how to best train the model and what data has the most impact on performance in previously unseen languages. In ST, our best model is capable to achieve BLEU scores over 23 in CoVoST2 for two previously unseen languages, while in ASR, we achieve WERs of up to 28.2\%. We finally show that the performance of our system is bounded by the ability of the LLM to output text in the desired language.
△ Less
Submitted 30 December, 2024; v1 submitted 24 December, 2024;
originally announced December 2024.
-
Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech Translation
Authors:
Lucas Goncalves,
Prashant Mathur,
Xing Niu,
Brady Houston,
Chandrashekhar Lavania,
Srikanth Vishnubhotla,
Lijia Sun,
Anthony Ferritto
Abstract:
Audio-Visual Speech-to-Speech Translation typically prioritizes improving translation quality and naturalness. However, an equally critical aspect in audio-visual content is lip-synchrony-ensuring that the movements of the lips match the spoken content-essential for maintaining realism in dubbed videos. Despite its importance, the inclusion of lip-synchrony constraints in AVS2S models has been lar…
▽ More
Audio-Visual Speech-to-Speech Translation typically prioritizes improving translation quality and naturalness. However, an equally critical aspect in audio-visual content is lip-synchrony-ensuring that the movements of the lips match the spoken content-essential for maintaining realism in dubbed videos. Despite its importance, the inclusion of lip-synchrony constraints in AVS2S models has been largely overlooked. This study addresses this gap by integrating a lip-synchrony loss into the training process of AVS2S models. Our proposed method significantly enhances lip-synchrony in direct audio-visual speech-to-speech translation, achieving an average LSE-D score of 10.67, representing a 9.2% reduction in LSE-D over a strong baseline across four language pairs. Additionally, it maintains the naturalness and high quality of the translated speech when overlaid onto the original video, without any degradation in translation quality.
△ Less
Submitted 21 December, 2024;
originally announced December 2024.
-
MulModSeg: Enhancing Unpaired Multi-Modal Medical Image Segmentation with Modality-Conditioned Text Embedding and Alternating Training
Authors:
Chengyin Li,
Hui Zhu,
Rafi Ibn Sultan,
Hassan Bagher Ebadian,
Prashant Khanduri,
Chetty Indrin,
Kundan Thind,
Dongxiao Zhu
Abstract:
In the diverse field of medical imaging, automatic segmentation has numerous applications and must handle a wide variety of input domains, such as different types of Computed Tomography (CT) scans and Magnetic Resonance (MR) images. This heterogeneity challenges automatic segmentation algorithms to maintain consistent performance across different modalities due to the requirement for spatially ali…
▽ More
In the diverse field of medical imaging, automatic segmentation has numerous applications and must handle a wide variety of input domains, such as different types of Computed Tomography (CT) scans and Magnetic Resonance (MR) images. This heterogeneity challenges automatic segmentation algorithms to maintain consistent performance across different modalities due to the requirement for spatially aligned and paired images. Typically, segmentation models are trained using a single modality, which limits their ability to generalize to other types of input data without employing transfer learning techniques. Additionally, leveraging complementary information from different modalities to enhance segmentation precision often necessitates substantial modifications to popular encoder-decoder designs, such as introducing multiple branched encoding or decoding paths for each modality. In this work, we propose a simple Multi-Modal Segmentation (MulModSeg) strategy to enhance medical image segmentation across multiple modalities, specifically CT and MR. It incorporates two key designs: a modality-conditioned text embedding framework via a frozen text encoder that adds modality awareness to existing segmentation frameworks without significant structural modifications or computational overhead, and an alternating training procedure that facilitates the integration of essential features from unpaired CT and MR inputs. Through extensive experiments with both Fully Convolutional Network and Transformer-based backbones, MulModSeg consistently outperforms previous methods in segmenting abdominal multi-organ and cardiac substructures for both CT and MR modalities. The code is available in this {\href{https://github.com/ChengyinLee/MulModSeg_2024}{link}}.
△ Less
Submitted 23 November, 2024;
originally announced November 2024.
-
How to implement the Bayes' formula in the age of ML?
Authors:
Amirhossein Taghvaei,
Prashant G. Mehta
Abstract:
This chapter contains a self-contained introduction to the significance of Bayes' formula in the context of nonlinear filtering problems. Both discrete-time and continuous-time settings of the problem are considered in a unified manner. In control theory, the focus on optimization-based solution approaches is stressed together with a discussion of historical developments in this area (from 1960s o…
▽ More
This chapter contains a self-contained introduction to the significance of Bayes' formula in the context of nonlinear filtering problems. Both discrete-time and continuous-time settings of the problem are considered in a unified manner. In control theory, the focus on optimization-based solution approaches is stressed together with a discussion of historical developments in this area (from 1960s onwards). The heart of this chapter contains a presentation of a novel optimal transportation formulation for the Bayes formula (developed recently by the first author) and its relationship to some of the prior joint work (feedback particle filter) from the authors. The presentation highlights how optimal transportation theory is leveraged to overcome some of the numerical challenges of implementing Bayes' law by enabling the use of machine learning (ML) tools.
△ Less
Submitted 14 November, 2024;
originally announced November 2024.
-
Memristors based Computation and Synthesis
Authors:
Prashant Gupta,
Priscilla Jennifer
Abstract:
Memristor has been identified as the fourth fundamental circuit element by Dr. Leon Chua in 1971 and since then it has gathered a lot of interest because of its non-volatility and are considered as a viable solution to the beyond CMOS era computation. Recently, memristor have been used to perform basic logic operations like AND, OR, NAND, NOR, XOR etc. and are also used in applications like Dot Pr…
▽ More
Memristor has been identified as the fourth fundamental circuit element by Dr. Leon Chua in 1971 and since then it has gathered a lot of interest because of its non-volatility and are considered as a viable solution to the beyond CMOS era computation. Recently, memristor have been used to perform basic logic operations like AND, OR, NAND, NOR, XOR etc. and are also used in applications like Dot Product Engine, Convolution Neural Networks etc. This paper presents a new behavioural model of memristor then using it to build a 32-bit ripple carry adder. The paper later compares the area, power and time delay of the 32 bit Ripple Carry Adder using memristor with the 45nm CMOS technology and highlights its advantages and pitfalls.
△ Less
Submitted 4 September, 2024;
originally announced September 2024.
-
LEAD: Towards Learning-Based Equity-Aware Decarbonization in Ridesharing Platforms
Authors:
Mahsa Sahebdel,
Ali Zeynali,
Noman Bashir,
Prashant Shenoy,
Mohammad Hajiesmaili
Abstract:
Ridesharing platforms such as Uber, Lyft, and DiDi have grown in popularity due to their on-demand availability, ease of use, and commute cost reductions, among other benefits. However, not all ridesharing promises have panned out. Recent studies demonstrate that the expected drop in traffic congestion and reduction in greenhouse gas (GHG) emissions have not materialized. This is primarily due to…
▽ More
Ridesharing platforms such as Uber, Lyft, and DiDi have grown in popularity due to their on-demand availability, ease of use, and commute cost reductions, among other benefits. However, not all ridesharing promises have panned out. Recent studies demonstrate that the expected drop in traffic congestion and reduction in greenhouse gas (GHG) emissions have not materialized. This is primarily due to the substantial distances traveled by the ridesharing vehicles without passengers between rides, known as deadhead miles. Recent work has focused on reducing the impact of deadhead miles while considering additional metrics such as rider waiting time, GHG emissions from deadhead miles, or driver earnings. However, most prior studies consider these environmental and equity-based metrics individually despite them being interrelated. In this paper, we propose a Learning-based Equity-Aware Decarabonization approach, LEAD, for ridesharing platforms. LEAD targets minimizing emissions while ensuring that the driver's utility, defined as the difference between the trip distance and the deadhead miles, is fairly distributed. LEAD uses reinforcement learning to match riders with drivers based on the expected future utility of drivers and the expected carbon emissions of the platform without increasing the rider waiting times. Extensive experiments based on a real-world ridesharing dataset show that LEAD improves the defined notion of fairness by 150% when compared to emission-aware ride-assignment and reduces emissions by 14.6% while ensuring fairness within 28--52% of the fairness-focused baseline. It also reduces the rider wait time, by at least 32.1%, compared to a fairness-focused baseline.
△ Less
Submitted 12 April, 2025; v1 submitted 19 August, 2024;
originally announced August 2024.
-
A novel metric for detecting quadrotor loss-of-control
Authors:
Jasper van Beers,
Prashant Solanki,
Coen de Visser
Abstract:
Unmanned aerial vehicles (UAVs) are becoming an integral part of both industry and society. In particular, the quadrotor is now invaluable across a plethora of fields and recent developments, such as the inclusion of aerial manipulators, only extends their versatility. As UAVs become more widespread, preventing loss-of-control (LOC) is an ever growing concern. Unfortunately, LOC is not clearly def…
▽ More
Unmanned aerial vehicles (UAVs) are becoming an integral part of both industry and society. In particular, the quadrotor is now invaluable across a plethora of fields and recent developments, such as the inclusion of aerial manipulators, only extends their versatility. As UAVs become more widespread, preventing loss-of-control (LOC) is an ever growing concern. Unfortunately, LOC is not clearly defined for quadrotors, or indeed, many other autonomous systems. Moreover, any existing definitions are often incomplete and restrictive. A novel metric, based on actuator capabilities, is introduced to detect LOC in quadrotors. The potential of this metric for LOC detection is demonstrated through both simulated and real quadrotor flight data. It is able to detect LOC induced by actuator faults without explicit knowledge of the occurrence and nature of the failure. The proposed metric is also sensitive enough to detect LOC in more nuanced cases, where the quadrotor remains undamaged but nevertheless losses control through an aggressive yawing manoeuvre. As the metric depends only on system and actuator models, it is sufficiently general to be applied to other systems.
△ Less
Submitted 12 August, 2024;
originally announced August 2024.
-
Design of Interacting Particle Systems for Fast Linear Quadratic RL
Authors:
Anant A Joshi,
Heng-Sheng Chang,
Amirhossein Taghvaei,
Prashant G Mehta,
Sean P. Meyn
Abstract:
This paper is concerned with the design of algorithms based on systems of interacting particles to represent, approximate, and learn the optimal control law for reinforcement learning (RL). The primary contribution is that convergence rates are greatly accelerated by the interactions between particles. Theory focuses on the linear quadratic stochastic optimal control problem for which a complete a…
▽ More
This paper is concerned with the design of algorithms based on systems of interacting particles to represent, approximate, and learn the optimal control law for reinforcement learning (RL). The primary contribution is that convergence rates are greatly accelerated by the interactions between particles. Theory focuses on the linear quadratic stochastic optimal control problem for which a complete and novel theory is presented. Apart from the new algorithm, sample complexity bounds are obtained, and it is shown that the mean square error scales as $1/N$ where $N$ is the number of particles. The theoretical results and algorithms are illustrated with numerical experiments and comparisons with other recent approaches, where the faster convergence of the proposed algorithm is numerically demonstrated.
△ Less
Submitted 1 December, 2024; v1 submitted 16 June, 2024;
originally announced June 2024.
-
DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding
Authors:
Suwon Shon,
Kwangyoun Kim,
Yi-Te Hsu,
Prashant Sridhar,
Shinji Watanabe,
Karen Livescu
Abstract:
The integration of pre-trained text-based large language models (LLM) with speech input has enabled instruction-following capabilities for diverse speech tasks. This integration requires the use of a speech encoder, a speech adapter, and an LLM, trained on diverse tasks. We propose the use of discrete speech units (DSU), rather than continuous-valued speech encoder outputs, that are converted to t…
▽ More
The integration of pre-trained text-based large language models (LLM) with speech input has enabled instruction-following capabilities for diverse speech tasks. This integration requires the use of a speech encoder, a speech adapter, and an LLM, trained on diverse tasks. We propose the use of discrete speech units (DSU), rather than continuous-valued speech encoder outputs, that are converted to the LLM token embedding space using the speech adapter. We generate DSU using a self-supervised speech encoder followed by k-means clustering. The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering. We also explore various types of DSU extracted from different layers of the self-supervised speech encoder, as well as Mel frequency Cepstral Coefficients (MFCC). Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
SpeechVerse: A Large-scale Generalizable Audio Language Model
Authors:
Nilaksh Das,
Saket Dingliwal,
Srikanth Ronanki,
Rohit Paturi,
Zhaocheng Huang,
Prashant Mathur,
Jie Yuan,
Dhanush Bekal,
Xing Niu,
Sai Muralidhar Jayanthi,
Xilai Li,
Karel Mundnich,
Monica Sunkara,
Sravan Bodapati,
Sundararajan Srinivasan,
Kyu J Han,
Katrin Kirchhoff
Abstract:
Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore devel…
▽ More
Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore develop SpeechVerse, a robust multi-task training and curriculum learning framework that combines pre-trained speech and text foundation models via a small set of learnable parameters, while keeping the pre-trained models frozen during training. The models are instruction finetuned using continuous latent representations extracted from the speech foundation model to achieve optimal zero-shot performance on a diverse range of speech processing tasks using natural language instructions. We perform extensive benchmarking that includes comparing our model performance against traditional baselines across several datasets and tasks. Furthermore, we evaluate the model's capability for generalized instruction following by testing on out-of-domain datasets, novel prompts, and unseen tasks. Our empirical experiments reveal that our multi-task SpeechVerse model is even superior to conventional task-specific baselines on 9 out of the 11 tasks.
△ Less
Submitted 24 March, 2025; v1 submitted 13 May, 2024;
originally announced May 2024.
-
Towards Privacy-Preserving Audio Classification Systems
Authors:
Bhawana Chhaglani,
Jeremy Gummeson,
Prashant Shenoy
Abstract:
Audio signals can reveal intimate details about a person's life, including their conversations, health status, emotions, location, and personal preferences. Unauthorized access or misuse of this information can have profound personal and social implications. In an era increasingly populated by devices capable of audio recording, safeguarding user privacy is a critical obligation. This work studies…
▽ More
Audio signals can reveal intimate details about a person's life, including their conversations, health status, emotions, location, and personal preferences. Unauthorized access or misuse of this information can have profound personal and social implications. In an era increasingly populated by devices capable of audio recording, safeguarding user privacy is a critical obligation. This work studies the ethical and privacy concerns in current audio classification systems. We discuss the challenges and research directions in designing privacy-preserving audio sensing systems. We propose privacy-preserving audio features that can be used to classify wide range of audio classes, while being privacy preserving.
△ Less
Submitted 7 June, 2024; v1 submitted 27 April, 2024;
originally announced April 2024.
-
PEAVS: Perceptual Evaluation of Audio-Visual Synchrony Grounded in Viewers' Opinion Scores
Authors:
Lucas Goncalves,
Prashant Mathur,
Chandrashekhar Lavania,
Metehan Cekic,
Marcello Federico,
Kyu J. Han
Abstract:
Recent advancements in audio-visual generative modeling have been propelled by progress in deep learning and the availability of data-rich benchmarks. However, the growth is not attributed solely to models and benchmarks. Universally accepted evaluation metrics also play an important role in advancing the field. While there are many metrics available to evaluate audio and visual content separately…
▽ More
Recent advancements in audio-visual generative modeling have been propelled by progress in deep learning and the availability of data-rich benchmarks. However, the growth is not attributed solely to models and benchmarks. Universally accepted evaluation metrics also play an important role in advancing the field. While there are many metrics available to evaluate audio and visual content separately, there is a lack of metrics that offer a quantitative and interpretable measure of audio-visual synchronization for videos "in the wild". To address this gap, we first created a large scale human annotated dataset (100+ hrs) representing nine types of synchronization errors in audio-visual content and how human perceive them. We then developed a PEAVS (Perceptual Evaluation of Audio-Visual Synchrony) score, a novel automatic metric with a 5-point scale that evaluates the quality of audio-visual synchronization. We validate PEAVS using a newly generated dataset, achieving a Pearson correlation of 0.79 at the set level and 0.54 at the clip level when compared to human labels. In our experiments, we observe a relative gain 50% over a natural extension of Fréchet based metrics for Audio-Visual synchrony, confirming PEAVS efficacy in objectively modeling subjective perceptions of audio-visual synchronization for videos "in the wild".
△ Less
Submitted 10 April, 2024;
originally announced April 2024.
-
Dual Ensemble Kalman Filter for Stochastic Optimal Control
Authors:
Anant A. Joshi,
Amirhossein Taghvaei,
Prashant G. Mehta,
Sean P. Meyn
Abstract:
In this paper, stochastic optimal control problems in continuous time and space are considered. In recent years, such problems have received renewed attention from the lens of reinforcement learning (RL) which is also one of our motivation. The main contribution is a simulation-based algorithm -- dual ensemble Kalman filter (EnKF) -- to numerically approximate the solution of these problems. The p…
▽ More
In this paper, stochastic optimal control problems in continuous time and space are considered. In recent years, such problems have received renewed attention from the lens of reinforcement learning (RL) which is also one of our motivation. The main contribution is a simulation-based algorithm -- dual ensemble Kalman filter (EnKF) -- to numerically approximate the solution of these problems. The paper extends our previous work where the dual EnKF was applied in deterministic settings of the problem. The theoretical results and algorithms are illustrated with numerical experiments.
△ Less
Submitted 26 October, 2024; v1 submitted 9 April, 2024;
originally announced April 2024.
-
Investigating the Robustness of Vision Transformers against Label Noise in Medical Image Classification
Authors:
Bidur Khanal,
Prashant Shrestha,
Sanskar Amgain,
Bishesh Khanal,
Binod Bhattarai,
Cristian A. Linte
Abstract:
Label noise in medical image classification datasets significantly hampers the training of supervised deep learning methods, undermining their generalizability. The test performance of a model tends to decrease as the label noise rate increases. Over recent years, several methods have been proposed to mitigate the impact of label noise in medical image classification and enhance the robustness of…
▽ More
Label noise in medical image classification datasets significantly hampers the training of supervised deep learning methods, undermining their generalizability. The test performance of a model tends to decrease as the label noise rate increases. Over recent years, several methods have been proposed to mitigate the impact of label noise in medical image classification and enhance the robustness of the model. Predominantly, these works have employed CNN-based architectures as the backbone of their classifiers for feature extraction. However, in recent years, Vision Transformer (ViT)-based backbones have replaced CNNs, demonstrating improved performance and a greater ability to learn more generalizable features, especially when the dataset is large. Nevertheless, no prior work has rigorously investigated how transformer-based backbones handle the impact of label noise in medical image classification. In this paper, we investigate the architectural robustness of ViT against label noise and compare it to that of CNNs. We use two medical image classification datasets -- COVID-DU-Ex, and NCT-CRC-HE-100K -- both corrupted by injecting label noise at various rates. Additionally, we show that pretraining is crucial for ensuring ViT's improved robustness against label noise in supervised training.
△ Less
Submitted 26 February, 2024;
originally announced February 2024.
-
A Holistic Approach for Equity-aware Carbon Reduction of Ridesharing Platforms
Authors:
Mahsa Sahebdel,
Ali Zeynali,
Noman Bashir,
Prashant Shenoy,
Mohammad Hajiesmaili
Abstract:
Ridesharing services have revolutionized personal mobility, offering convenient on-demand transportation anytime. While early proponents of ridesharing suggested that these services would reduce the overall carbon emissions of the transportation sector, recent studies reported a type of rebound effect showing substantial carbon emissions of ridesharing platforms, mainly due to their deadhead miles…
▽ More
Ridesharing services have revolutionized personal mobility, offering convenient on-demand transportation anytime. While early proponents of ridesharing suggested that these services would reduce the overall carbon emissions of the transportation sector, recent studies reported a type of rebound effect showing substantial carbon emissions of ridesharing platforms, mainly due to their deadhead miles traveled between two consecutive rides. However, reducing deadhead miles' emissions can incur longer waiting times for riders and starvation of ride assignments for some drivers. Therefore, any efforts towards reducing the carbon emissions from ridesharing platforms must consider the impact on the quality of service, e.g., waiting time, and on the equitable distribution of rides across drivers. This paper proposes a holistic approach to reduce the carbon emissions of ridesharing platforms while minimizing the degradation in user waiting times and equitable ride assignments across drivers. Towards this end, we decompose the global carbon reduction problem into two sub-problems: carbon- and equity-aware ride assignment and fuel-efficient routing. For the ride assignment problem, we consider the trade-off between the amount of carbon reduction and the rider's waiting time and propose simple yet efficient algorithms to handle the conflicting trade-offs. For the routing problem, we analyze the impact of fuel-efficient routing in reducing the carbon footprint, trip duration, and driver efficiency of ridesharing platforms using route data from Google Maps. Our comprehensive trace-driven experimental results show significant emissions reduction with a minor increase in riders' waiting times. Finally, we release E$^2$-RideKit, a toolkit that enables researchers to augment ridesharing datasets with emissions and equity information for further research on emission analysis and platform improvement.
△ Less
Submitted 16 February, 2024; v1 submitted 2 January, 2024;
originally announced February 2024.
-
Neural Models and Algorithms for Sensorimotor Control of an Octopus Arm
Authors:
Tixian Wang,
Udit Halder,
Ekaterina Gribkova,
Rhanor Gillette,
Mattia Gazzola,
Prashant G. Mehta
Abstract:
In this article, a biophysically realistic model of a soft octopus arm with internal musculature is presented. The modeling is motivated by experimental observations of sensorimotor control where an arm localizes and reaches a target. Major contributions of this article are: (i) development of models to capture the mechanical properties of arm musculature, the electrical properties of the arm peri…
▽ More
In this article, a biophysically realistic model of a soft octopus arm with internal musculature is presented. The modeling is motivated by experimental observations of sensorimotor control where an arm localizes and reaches a target. Major contributions of this article are: (i) development of models to capture the mechanical properties of arm musculature, the electrical properties of the arm peripheral nervous system (PNS), and the coupling of PNS with muscular contractions; (ii) modeling the arm sensory system, including chemosensing and proprioception; and (iii) algorithms for sensorimotor control, which include a novel feedback neural motor control law for mimicking target-oriented arm reaching motions, and a novel consensus algorithm for solving sensing problems such as locating a food source from local chemical sensory information (exogenous) and arm deformation information (endogenous). Several analytical results, including rest-state characterization and stability properties of the proposed sensing and motor control algorithms, are provided. Numerical simulations demonstrate the efficacy of our approach. Qualitative comparisons against observed arm rest shapes and target-oriented reaching motions are also reported.
△ Less
Submitted 27 April, 2024; v1 submitted 1 February, 2024;
originally announced February 2024.
-
Uncertainty-Aware Guidance for Target Tracking subject to Intermittent Measurements using Motion Model Learning
Authors:
Andres Pulido,
Kyle Volle,
Kristy Waters,
Zachary I. Bell,
Prashant Ganesh,
Jane Shin
Abstract:
This paper presents a novel guidance law for target tracking applications where the target motion model is unknown and sensor measurements are intermittent due to unknown environmental conditions and low measurement update rate. In this work, the target motion model is represented by a transformer neural network and trained by previous target position measurements. This transformer motion model se…
▽ More
This paper presents a novel guidance law for target tracking applications where the target motion model is unknown and sensor measurements are intermittent due to unknown environmental conditions and low measurement update rate. In this work, the target motion model is represented by a transformer neural network and trained by previous target position measurements. This transformer motion model serves as the prediction step in a particle filter for target state estimation and uncertainty quantification. The particle filter estimation uncertainty is utilized in the information-driven guidance law to compute a path for the mobile agent to travel to a position with maximum expected entropy reduction (EER). The computation of EER is performed in real-time by approximating the information gain from the predicted particle distributions relative to the current distribution. Simulation and hardware experiments are performed with a quadcopter agent and TurtleBot target to demonstrate that the presented guidance law outperforms two other baseline guidance methods.
△ Less
Submitted 20 March, 2025; v1 submitted 1 February, 2024;
originally announced February 2024.
-
Ultra-lightweight Neural Differential DSP Vocoder For High Quality Speech Synthesis
Authors:
Prabhav Agrawal,
Thilo Koehler,
Zhiping Xiu,
Prashant Serai,
Qing He
Abstract:
Neural vocoders model the raw audio waveform and synthesize high-quality audio, but even the highly efficient ones, like MB-MelGAN and LPCNet, fail to run real-time on a low-end device like a smartglass. A pure digital signal processing (DSP) based vocoder can be implemented via lightweight fast Fourier transforms (FFT), and therefore, is a magnitude faster than any neural vocoder. A DSP vocoder o…
▽ More
Neural vocoders model the raw audio waveform and synthesize high-quality audio, but even the highly efficient ones, like MB-MelGAN and LPCNet, fail to run real-time on a low-end device like a smartglass. A pure digital signal processing (DSP) based vocoder can be implemented via lightweight fast Fourier transforms (FFT), and therefore, is a magnitude faster than any neural vocoder. A DSP vocoder often gets a lower audio quality due to consuming over-smoothed acoustic model predictions of approximate representations for the vocal tract. In this paper, we propose an ultra-lightweight differential DSP (DDSP) vocoder that uses a jointly optimized acoustic model with a DSP vocoder, and learns without an extracted spectral feature for the vocal tract. The model achieves audio quality comparable to neural vocoders with a high average MOS of 4.36 while being efficient as a DSP vocoder. Our C++ implementation, without any hardware-specific optimization, is at 15 MFLOPS, surpasses MB-MelGAN by 340 times in terms of FLOPS, and achieves a vocoder-only RTF of 0.003 and overall RTF of 0.044 while running single-threaded on a 2GHz Intel Xeon CPU.
△ Less
Submitted 18 January, 2024;
originally announced January 2024.
-
Improving ASR Contextual Biasing with Guided Attention
Authors:
Jiyang Tang,
Kwangyoun Kim,
Suwon Shon,
Felix Wu,
Prashant Sridhar,
Shinji Watanabe
Abstract:
In this paper, we propose a Guided Attention (GA) auxiliary training loss, which improves the effectiveness and robustness of automatic speech recognition (ASR) contextual biasing without introducing additional parameters. A common challenge in previous literature is that the word error rate (WER) reduction brought by contextual biasing diminishes as the number of bias phrases increases. To addres…
▽ More
In this paper, we propose a Guided Attention (GA) auxiliary training loss, which improves the effectiveness and robustness of automatic speech recognition (ASR) contextual biasing without introducing additional parameters. A common challenge in previous literature is that the word error rate (WER) reduction brought by contextual biasing diminishes as the number of bias phrases increases. To address this challenge, we employ a GA loss as an additional training objective besides the Transducer loss. The proposed GA loss aims to teach the cross attention how to align bias phrases with text tokens or audio frames. Compared to studies with similar motivations, the proposed loss operates directly on the cross attention weights and is easier to implement. Through extensive experiments based on Conformer Transducer with Contextual Adapter, we demonstrate that the proposed method not only leads to a lower WER but also retains its effectiveness as the number of bias phrases increases. Specifically, the GA loss decreases the WER of rare vocabularies by up to 19.2% on LibriSpeech compared to the contextual biasing baseline, and up to 49.3% compared to a vanilla Transducer.
△ Less
Submitted 16 January, 2024;
originally announced January 2024.
-
Generative Context-aware Fine-tuning of Self-supervised Speech Models
Authors:
Suwon Shon,
Kwangyoun Kim,
Prashant Sridhar,
Yi-Te Hsu,
Shinji Watanabe,
Karen Livescu
Abstract:
When performing tasks like automatic speech recognition or spoken language understanding for a given utterance, access to preceding text or audio provides contextual information can improve performance. Considering the recent advances in generative large language models (LLM), we hypothesize that an LLM could generate useful context information using the preceding text. With appropriate prompts, L…
▽ More
When performing tasks like automatic speech recognition or spoken language understanding for a given utterance, access to preceding text or audio provides contextual information can improve performance. Considering the recent advances in generative large language models (LLM), we hypothesize that an LLM could generate useful context information using the preceding text. With appropriate prompts, LLM could generate a prediction of the next sentence or abstractive text like titles or topics. In this paper, we study the use of LLM-generated context information and propose an approach to distill the generated information during fine-tuning of self-supervised speech models, which we refer to as generative context-aware fine-tuning. This approach allows the fine-tuned model to make improved predictions without access to the true surrounding segments or to the LLM at inference time, while requiring only a very small additional context module. We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks: automatic speech recognition, named entity recognition, and sentiment analysis. The results show that generative context-aware fine-tuning outperforms a context injection fine-tuning approach that accesses the ground-truth previous text, and is competitive with a generative context injection fine-tuning approach that requires the LLM at inference time.
△ Less
Submitted 15 December, 2023;
originally announced December 2023.
-
End-to-End Single-Channel Speaker-Turn Aware Conversational Speech Translation
Authors:
Juan Zuluaga-Gomez,
Zhaocheng Huang,
Xing Niu,
Rohit Paturi,
Sundararajan Srinivasan,
Prashant Mathur,
Brian Thompson,
Marcello Federico
Abstract:
Conventional speech-to-text translation (ST) systems are trained on single-speaker utterances, and they may not generalize to real-life scenarios where the audio contains conversations by multiple speakers. In this paper, we tackle single-channel multi-speaker conversational ST with an end-to-end and multi-task training model, named Speaker-Turn Aware Conversational Speech Translation, that combin…
▽ More
Conventional speech-to-text translation (ST) systems are trained on single-speaker utterances, and they may not generalize to real-life scenarios where the audio contains conversations by multiple speakers. In this paper, we tackle single-channel multi-speaker conversational ST with an end-to-end and multi-task training model, named Speaker-Turn Aware Conversational Speech Translation, that combines automatic speech recognition, speech translation and speaker turn detection using special tokens in a serialized labeling format. We run experiments on the Fisher-CALLHOME corpus, which we adapted by merging the two single-speaker channels into one multi-speaker channel, thus representing the more realistic and challenging scenario with multi-speaker turns and cross-talk. Experimental results across single- and multi-speaker conditions and against conventional ST systems, show that our model outperforms the reference systems on the multi-speaker condition, while attaining comparable performance on the single-speaker condition. We release scripts for data processing and model training.
△ Less
Submitted 1 November, 2023;
originally announced November 2023.
-
Deep Nonlinear Adaptive Control for Unmanned Aerial Systems Operating under Dynamic Uncertainties
Authors:
Zachary Lamb,
Zachary I. Bell,
Matthew Longmire,
Jared Paquet,
Prashant Ganesh,
Ricardo Sanfelice
Abstract:
Recent literature in the field of machine learning (ML) control has shown promising theoretical results for a Deep Neural Network (DNN) based Nonlinear Adaptive Controller (DNAC) capable of achieving trajectory tracking for nonlinear systems. Expanding on this work, this paper applies DNAC to the Attitude Control System (ACS) of a quadrotor and shows improvement to attitude control performance und…
▽ More
Recent literature in the field of machine learning (ML) control has shown promising theoretical results for a Deep Neural Network (DNN) based Nonlinear Adaptive Controller (DNAC) capable of achieving trajectory tracking for nonlinear systems. Expanding on this work, this paper applies DNAC to the Attitude Control System (ACS) of a quadrotor and shows improvement to attitude control performance under disturbed flying conditions where the model uncertainty is high. Moreover, these results are noteworthy for ML control because they were achieved with no prior training data and an arbitrary system dynamics initialization; simply put, the controller presented in this paper is practically modelless, yet yields the ability to force trajectory tracking for nonlinear systems while rejecting significant undesirable model disturbances learned through a DNN. The combination of ML techniques to learn a system's dynamics and the Lyapunov analysis required to provide stability guarantees leads to a controller with applications in safety-critical systems that may undergo uncertain model changes, as is the case for most aerial systems. Experimental findings are analyzed in the final section of this paper, and DNAC is shown to outperform the trajectory tracking capabilities of PID, MRAC, and the recently developed Deep Model Reference Adaptive Control (DMRAC) schemes.
△ Less
Submitted 14 October, 2023;
originally announced October 2023.
-
Carbon Containers: A System-level Facility for Managing Application-level Carbon Emissions
Authors:
John Thiede,
Noman Bashir,
David Irwin,
Prashant Shenoy
Abstract:
To reduce their environmental impact, cloud datacenters' are increasingly focused on optimizing applications' carbon-efficiency, or work done per mass of carbon emitted. To facilitate such optimizations, we present Carbon Containers, a simple system-level facility, which extends prior work on power containers, that automatically regulates applications' carbon emissions in response to variations in…
▽ More
To reduce their environmental impact, cloud datacenters' are increasingly focused on optimizing applications' carbon-efficiency, or work done per mass of carbon emitted. To facilitate such optimizations, we present Carbon Containers, a simple system-level facility, which extends prior work on power containers, that automatically regulates applications' carbon emissions in response to variations in both their workload's intensity and their energy's carbon-intensity. Specifically, \carbonContainerS enable applications to specify a maximum carbon emissions rate (in g$\cdot$CO$_2$e/hr), and then transparently enforce this rate via a combination of vertical scaling, container migration, and suspend/resume while maximizing either energy-efficiency or performance.
Carbon Containers are especially useful for applications that i) must continue running even during high-carbon periods, and ii) execute in regions with few variations in carbon-intensity. These low-variability regions also tend to have high average carbon-intensity, which increases the importance of regulating carbon emissions. We implement a Carbon Containers prototype by extending Linux Containers to incorporate the mechanisms above and evaluate it using real workload traces and carbon-intensity data from multiple regions. We compare Carbon Containers with prior work that regulates carbon emissions by suspending/resuming applications during high/low carbon periods. We show that Carbon Containers are more carbon-efficient and improve performance while maintaining similar carbon emissions.
△ Less
Submitted 25 September, 2023;
originally announced September 2023.
-
Fault Detection and Classification using Wavelet and ANN in DFIG and TCSC Connected Transmission Line
Authors:
Satya Vikram Pratap Singh,
Tanu Prasad,
Siddharth Kamila,
Prashant Agnihotri
Abstract:
This paper presents fault detection and classification using Wavelet and ANN based methods in a DFIG-based series compensated system. The state-of-the art methods include Wavelet transform, Fourier transform, and Wavelet-neuro fuzzy methods-based system for fault detection and classification. However, the accuracy of these state-of-the-art methods diminishes during variable conditions such as chan…
▽ More
This paper presents fault detection and classification using Wavelet and ANN based methods in a DFIG-based series compensated system. The state-of-the art methods include Wavelet transform, Fourier transform, and Wavelet-neuro fuzzy methods-based system for fault detection and classification. However, the accuracy of these state-of-the-art methods diminishes during variable conditions such as changes in wind speed, high impedance faults, and the changes in the series compensation level. Specifically, in Wavelet transform based methods, the threshold values need to be adapted based on the variable field conditions. To solve this problem, this paper has proposed a Wavelet-ANN based fault detection method where Wavelet is used as an identifier and ANN is used as a classifier for detecting various fault cases. This methodology is also effective under SSR condition. The proposed methodology is evaluated on various fault and non-fault cases generated on an IEEE first benchmark model under varying compensation levels from 20% to 55%, impedance faults, and wind velocity from 6m/sec to 10m/sec using MATLAB/Simulink, OPALRT(OP4510) manufactured real-time digital simulator environment, Arduino board I/O ports communicating with external PC in which ANN model dumped, using Arduino support package of MATLAB. The preliminary results are compared with the state-of-the-art fault detection method, where the proposed method shows robust performance under varying field conditions.
△ Less
Submitted 17 August, 2023;
originally announced August 2023.
-
On the Limitations of Carbon-Aware Temporal and Spatial Workload Shifting in the Cloud
Authors:
Thanathorn Sukprasert,
Abel Souza,
Noman Bashir,
David Irwin,
Prashant Shenoy
Abstract:
Cloud platforms have been focusing on reducing their carbon emissions by shifting workloads across time and locations to when and where low-carbon energy is available. Despite the prominence of this idea, prior work has only quantified the potential of spatiotemporal workload shifting in narrow settings, i.e., for specific workloads in select regions. In particular, there has been limited work on…
▽ More
Cloud platforms have been focusing on reducing their carbon emissions by shifting workloads across time and locations to when and where low-carbon energy is available. Despite the prominence of this idea, prior work has only quantified the potential of spatiotemporal workload shifting in narrow settings, i.e., for specific workloads in select regions. In particular, there has been limited work on quantifying an upper bound on the ideal and practical benefits of carbon-aware spatiotemporal workload shifting for a wide range of cloud workloads. To address the problem, we conduct a detailed data-driven analysis to understand the benefits and limitations of carbon-aware spatiotemporal scheduling for cloud workloads. We utilize carbon intensity data from 123 regions, encompassing most major cloud sites, to analyze two broad classes of workloads -- batch and interactive -- and their various characteristics, e.g., job duration, deadlines, and SLOs. Our findings show that while spatiotemporal workload shifting can reduce workloads' carbon emissions, the practical upper bounds of these carbon reductions are currently limited and far from ideal. We also show that simple scheduling policies often yield most of these reductions, with more sophisticated techniques yielding little additional benefit. Notably, we also find that the benefit of carbon-aware workload scheduling relative to carbon-agnostic scheduling will decrease as the energy supply becomes "greener".
△ Less
Submitted 10 March, 2024; v1 submitted 10 June, 2023;
originally announced June 2023.
-
Deep Learning based Skin-layer Segmentation for Characterizing Cutaneous Wounds from Optical Coherence Tomography Images
Authors:
Prashant Kumar,
Swatantra Dhara,
Ayan Gope,
Jyotirmoy Chatterjee,
Subhamoy Mandal
Abstract:
Optical coherence tomography (OCT) is a medical imaging modality that allows us to probe deeper substructures of skin. The state-of-the-art wound care prediction and monitoring methods are based on visual evaluation and focus on surface information. However, research studies have shown that sub-surface information of the wound is critical for understanding the wound healing progression. This work…
▽ More
Optical coherence tomography (OCT) is a medical imaging modality that allows us to probe deeper substructures of skin. The state-of-the-art wound care prediction and monitoring methods are based on visual evaluation and focus on surface information. However, research studies have shown that sub-surface information of the wound is critical for understanding the wound healing progression. This work demonstrated the use of OCT as an effective imaging tool for objective and non-invasive assessments of wound severity, the potential for healing, and healing progress by measuring the optical characteristics of skin components. We have demonstrated the efficacy of OCT in studying wound healing progress in vivo small animal models. Automated analysis of OCT datasets poses multiple challenges, such as limitations in the training dataset size, variation in data distribution induced by uncertainties in sample quality and experiment conditions. We have employed a U-Net-based model for the segmentation of skin layers based on OCT images and to study epithelial and regenerated tissue thickness wound closure dynamics and thus quantify the progression of wound healing. In the experimental evaluation of the OCT skin image datasets, we achieved the objective of skin layer segmentation with an average intersection over union (IOU) of 0.9234. The results have been corroborated using gold-standard histology images and co-validated using inputs from pathologists. Clinical Relevance: To monitor wound healing progression without disrupting the healing procedure by superficial, noninvasive means via the identification of pixel characteristics of individual layers.
△ Less
Submitted 1 June, 2023;
originally announced June 2023.
-
Spoken Language Identification System for English-Mandarin Code-Switching Child-Directed Speech
Authors:
Shashi Kant Gupta,
Sushant Hiray,
Prashant Kukde
Abstract:
This work focuses on improving the Spoken Language Identification (LangId) system for a challenge that focuses on developing robust language identification systems that are reliable for non-standard, accented (Singaporean accent), spontaneous code-switched, and child-directed speech collected via Zoom. We propose a two-stage Encoder-Decoder-based E2E model. The encoder module consists of 1D depth-…
▽ More
This work focuses on improving the Spoken Language Identification (LangId) system for a challenge that focuses on developing robust language identification systems that are reliable for non-standard, accented (Singaporean accent), spontaneous code-switched, and child-directed speech collected via Zoom. We propose a two-stage Encoder-Decoder-based E2E model. The encoder module consists of 1D depth-wise separable convolutions with Squeeze-and-Excitation (SE) layers with a global context. The decoder module uses an attentive temporal pooling mechanism to get fixed length time-independent feature representation. The total number of parameters in the model is around 22.1 M, which is relatively light compared to using some large-scale pre-trained speech models. We achieved an EER of 15.6% in the closed track and 11.1% in the open track (baseline system 22.1%). We also curated additional LangId data from YouTube videos (having Singaporean speakers), which will be released for public use.
△ Less
Submitted 1 June, 2023;
originally announced June 2023.
-
Improving Isochronous Machine Translation with Target Factors and Auxiliary Counters
Authors:
Proyag Pal,
Brian Thompson,
Yogesh Virkar,
Prashant Mathur,
Alexandra Chronopoulou,
Marcello Federico
Abstract:
To translate speech for automatic dubbing, machine translation needs to be isochronous, i.e. translated speech needs to be aligned with the source in terms of speech durations. We introduce target factors in a transformer model to predict durations jointly with target language phoneme sequences. We also introduce auxiliary counters to help the decoder to keep track of the timing information while…
▽ More
To translate speech for automatic dubbing, machine translation needs to be isochronous, i.e. translated speech needs to be aligned with the source in terms of speech durations. We introduce target factors in a transformer model to predict durations jointly with target language phoneme sequences. We also introduce auxiliary counters to help the decoder to keep track of the timing information while generating target phonemes. We show that our model improves translation quality and isochrony compared to previous work where the translation model is instead trained to predict interleaved sequences of phonemes and durations.
△ Less
Submitted 22 May, 2023;
originally announced May 2023.
-
A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks
Authors:
Yifan Peng,
Kwangyoun Kim,
Felix Wu,
Brian Yan,
Siddhant Arora,
William Chen,
Jiyang Tang,
Suwon Shon,
Prashant Sridhar,
Shinji Watanabe
Abstract:
Conformer, a convolution-augmented Transformer variant, has become the de facto encoder architecture for speech processing due to its superior performance in various tasks, including automatic speech recognition (ASR), speech translation (ST) and spoken language understanding (SLU). Recently, a new encoder called E-Branchformer has outperformed Conformer in the LibriSpeech ASR benchmark, making it…
▽ More
Conformer, a convolution-augmented Transformer variant, has become the de facto encoder architecture for speech processing due to its superior performance in various tasks, including automatic speech recognition (ASR), speech translation (ST) and spoken language understanding (SLU). Recently, a new encoder called E-Branchformer has outperformed Conformer in the LibriSpeech ASR benchmark, making it promising for more general speech applications. This work compares E-Branchformer and Conformer through extensive experiments using different types of end-to-end sequence-to-sequence models. Results demonstrate that E-Branchformer achieves comparable or better performance than Conformer in almost all evaluation sets across 15 ASR, 2 ST, and 3 SLU benchmarks, while being more stable during training. We will release our training configurations and pre-trained models for reproducibility, which can benefit the speech community.
△ Less
Submitted 18 May, 2023;
originally announced May 2023.
-
Robust Model Predictive Techno-Economic Control of Active Distribution Networks
Authors:
Salish Maharjan,
Prashant Tiwari,
Rui Cheng,
Zhaoyu Wang
Abstract:
Stochastic controllers are perceived as a promising solution for techno-economic operation of distribution networks having higher generation uncertainties at large penetration of renewables. These controllers are supported by forecasters capable of predicting generation uncertainty by means of lower/upper bounds rather than by probability density function (PDF). Hence, the stochastic controller as…
▽ More
Stochastic controllers are perceived as a promising solution for techno-economic operation of distribution networks having higher generation uncertainties at large penetration of renewables. These controllers are supported by forecasters capable of predicting generation uncertainty by means of lower/upper bounds rather than by probability density function (PDF). Hence, the stochastic controller assumes a suitable PDF for scenario creation and optimization, requiring validation of the assumption. To effectively bridge the forecaster's capability and resolve the assumption issues, the paper proposes a robust model prediction-based techno-economic controller, which essentially utilizes only the lower/upper bounds of the forecast, eliminating the necessity of PDF. Both discrete and continuous control resources such as tap-changers and DERs are utilized for regulating the lower/upper bounds of the network states and robustly minimizing the cost of energy import. The proposed controller is implemented for UKGDS network and validated by comparing performance at various confidence levels of lower/upper bound forecast.
△ Less
Submitted 5 May, 2023;
originally announced May 2023.
-
Jointly Managing Electrical and Thermal Energy in Solar- and Battery-powered Computer Systems
Authors:
Noman Bashir,
Yasra Chandio,
David Irwin,
Fatima M. Anwar,
Jeremy Gummeson,
Prashant Shenoy
Abstract:
Environmentally-powered computer systems operate on renewable energy harvested from their environment, such as solar or wind, and stored in batteries. While harvesting environmental energy has long been necessary for small-scale embedded systems without access to external power sources, it is also increasingly important in designing sustainable larger-scale systems for edge applications. For susta…
▽ More
Environmentally-powered computer systems operate on renewable energy harvested from their environment, such as solar or wind, and stored in batteries. While harvesting environmental energy has long been necessary for small-scale embedded systems without access to external power sources, it is also increasingly important in designing sustainable larger-scale systems for edge applications. For sustained operations, such systems must consider not only the electrical energy but also the thermal energy available in the environment in their design and operation. Unfortunately, prior work generally ignores the impact of thermal effects, and instead implicitly assumes ideal temperatures. To address the problem, we develop a thermodynamic model that captures the interplay of electrical and thermal energy in environmentally-powered computer systems. The model captures the effect of environmental conditions, the system's physical properties, and workload scheduling on performance. In evaluating our model, we distill the thermal effects that impact these systems using a small-scale prototype and a programmable incubator. We then leverage our model to show how considering these thermal effects in designing and operating environmentally-powered computer systems of varying scales can improve their energy-efficiency, performance, and availability.
△ Less
Submitted 1 May, 2023;
originally announced May 2023.
-
Topology, dynamics, and control of an octopus-analog muscular hydrostat
Authors:
Arman Tekinalp,
Noel Naughton,
Seung-Hyun Kim,
Udit Halder,
Rhanor Gillette,
Prashant G. Mehta,
William Kier,
Mattia Gazzola
Abstract:
Muscular hydrostats, such as octopus arms or elephant trunks, lack bones entirely, endowing them with exceptional dexterity and reconfigurability. Key to their unmatched ability to control nearly infinite degrees of freedom is the architecture into which muscle fibers are weaved. Their arrangement is, effectively, the instantiation of a sophisticated mechanical program that mediates, and likely fa…
▽ More
Muscular hydrostats, such as octopus arms or elephant trunks, lack bones entirely, endowing them with exceptional dexterity and reconfigurability. Key to their unmatched ability to control nearly infinite degrees of freedom is the architecture into which muscle fibers are weaved. Their arrangement is, effectively, the instantiation of a sophisticated mechanical program that mediates, and likely facilitates, the control and realization of complex, dynamic morphological reconfigurations. Here, by combining medical imaging, biomechanical data, live behavioral experiments and numerical simulations, we synthesize a model octopus arm entailing ~200 continuous muscles groups, and begin to unravel its complexity. We show how 3D arm motions can be understood in terms of storage, transport, and conversion of topological quantities, effected by simple muscle activation templates. These, in turn, can be composed into higher-level control strategies that, compounded by the arm's compliance, are demonstrated in a range of object manipulation tasks rendered additionally challenging by the need to appropriately align suckers, to sense and grasp. Overall, our work exposes broad design and algorithmic principles pertinent to muscular hydrostats, robotics, and dynamics, while significantly advancing our ability to model muscular structures from medical imaging, with potential implications for human health and care.
△ Less
Submitted 17 April, 2023;
originally announced April 2023.
-
Synthetic Cross-accent Data Augmentation for Automatic Speech Recognition
Authors:
Philipp Klumpp,
Pooja Chitkara,
Leda Sarı,
Prashant Serai,
Jilong Wu,
Irina-Elena Veliche,
Rongqing Huang,
Qing He
Abstract:
The awareness for biased ASR datasets or models has increased notably in recent years. Even for English, despite a vast amount of available training data, systems perform worse for non-native speakers. In this work, we improve an accent-conversion model (ACM) which transforms native US-English speech into accented pronunciation. We include phonetic knowledge in the ACM training to provide accurate…
▽ More
The awareness for biased ASR datasets or models has increased notably in recent years. Even for English, despite a vast amount of available training data, systems perform worse for non-native speakers. In this work, we improve an accent-conversion model (ACM) which transforms native US-English speech into accented pronunciation. We include phonetic knowledge in the ACM training to provide accurate feedback about how well certain pronunciation patterns were recovered in the synthesized waveform. Furthermore, we investigate the feasibility of learned accent representations instead of static embeddings. Generated data was then used to train two state-of-the-art ASR systems. We evaluated our approach on native and non-native English datasets and found that synthetically accented data helped the ASR to better understand speech from seen accents. This observation did not translate to unseen accents, and it was not observed for a model that had been pre-trained exclusively with native speech.
△ Less
Submitted 1 March, 2023;
originally announced March 2023.
-
Structured Pruning of Self-Supervised Pre-trained Models for Speech Recognition and Understanding
Authors:
Yifan Peng,
Kwangyoun Kim,
Felix Wu,
Prashant Sridhar,
Shinji Watanabe
Abstract:
Self-supervised speech representation learning (SSL) has shown to be effective in various downstream tasks, but SSL models are usually large and slow. Model compression techniques such as pruning aim to reduce the model size and computation without degradation in accuracy. Prior studies focus on the pruning of Transformers; however, speech models not only utilize a stack of Transformer blocks, but…
▽ More
Self-supervised speech representation learning (SSL) has shown to be effective in various downstream tasks, but SSL models are usually large and slow. Model compression techniques such as pruning aim to reduce the model size and computation without degradation in accuracy. Prior studies focus on the pruning of Transformers; however, speech models not only utilize a stack of Transformer blocks, but also combine a frontend network based on multiple convolutional layers for low-level feature representation learning. This frontend has a small size but a heavy computational cost. In this work, we propose three task-specific structured pruning methods to deal with such heterogeneous networks. Experiments on LibriSpeech and SLURP show that the proposed method is more accurate than the original wav2vec2-base with 10% to 30% less computation, and is able to reduce the computation by 40% to 50% without any degradation.
△ Less
Submitted 27 February, 2023;
originally announced February 2023.
-
Jointly Optimizing Translations and Speech Timing to Improve Isochrony in Automatic Dubbing
Authors:
Alexandra Chronopoulou,
Brian Thompson,
Prashant Mathur,
Yogesh Virkar,
Surafel M. Lakew,
Marcello Federico
Abstract:
Automatic dubbing (AD) is the task of translating the original speech in a video into target language speech. The new target language speech should satisfy isochrony; that is, the new speech should be time aligned with the original video, including mouth movements, pauses, hand gestures, etc. In this paper, we propose training a model that directly optimizes both the translation as well as the spe…
▽ More
Automatic dubbing (AD) is the task of translating the original speech in a video into target language speech. The new target language speech should satisfy isochrony; that is, the new speech should be time aligned with the original video, including mouth movements, pauses, hand gestures, etc. In this paper, we propose training a model that directly optimizes both the translation as well as the speech duration of the generated translations. We show that this system generates speech that better matches the timing of the original speech, compared to prior work, while simplifying the system architecture.
△ Less
Submitted 24 February, 2023;
originally announced February 2023.
-
Hierarchical control and learning of a foraging CyberOctopus
Authors:
Chia-Hsien Shih,
Noel Naughton,
Udit Halder,
Heng-Sheng Chang,
Seung Hyun Kim,
Rhanor Gillette,
Prashant G. Mehta,
Mattia Gazzola
Abstract:
Inspired by the unique neurophysiology of the octopus, we propose a hierarchical framework that simplifies the coordination of multiple soft arms by decomposing control into high-level decision making, low-level motor activation, and local reflexive behaviors via sensory feedback. When evaluated in the illustrative problem of a model octopus foraging for food, this hierarchical decomposition resul…
▽ More
Inspired by the unique neurophysiology of the octopus, we propose a hierarchical framework that simplifies the coordination of multiple soft arms by decomposing control into high-level decision making, low-level motor activation, and local reflexive behaviors via sensory feedback. When evaluated in the illustrative problem of a model octopus foraging for food, this hierarchical decomposition results in significant improvements relative to end-to-end methods. Performance is achieved through a mixed-modes approach, whereby qualitatively different tasks are addressed via complementary control schemes. Here, model-free reinforcement learning is employed for high-level decision-making, while model-based energy shaping takes care of arm-level motor execution. To render the pairing computationally tenable, a novel neural-network energy shaping (NN-ES) controller is developed, achieving accurate motions with time-to-solutions 200 times faster than previous attempts. Our hierarchical framework is then successfully deployed in increasingly challenging foraging scenarios, including an arena littered with obstacles in 3D space, demonstrating the viability of our approach.
△ Less
Submitted 11 February, 2023;
originally announced February 2023.
-
Equitable Network-Aware Decarbonization of Residential Heating at City Scale
Authors:
Adam Lechowicz,
Noman Bashir,
John Wamburu,
Mohammad Hajiesmaili,
Prashant Shenoy
Abstract:
Residential heating, primarily powered by natural gas, accounts for a significant portion of residential sector energy use and carbon emissions in many parts of the world. Hence, there is a push towards decarbonizing residential heating by transitioning to energy-efficient heat pumps powered by an increasingly greener and less carbon-intensive electric grid. However, such a transition will add add…
▽ More
Residential heating, primarily powered by natural gas, accounts for a significant portion of residential sector energy use and carbon emissions in many parts of the world. Hence, there is a push towards decarbonizing residential heating by transitioning to energy-efficient heat pumps powered by an increasingly greener and less carbon-intensive electric grid. However, such a transition will add additional load to the electric grid triggering infrastructure upgrades, and subsequently erode the customer base using the gas distribution network. Utilities want to guide these transition efforts to ensure a phased decommissioning of the gas network and deferred electric grid infrastructure upgrades while achieving carbon reduction goals. To facilitate such a transition, we present a network-aware optimization framework for decarbonizing residential heating at city scale with an objective to maximize carbon reduction under budgetary constraints. Our approach operates on a graph representation of the gas network topology to compute the cost of transitioning and select neighborhoods for transition. We further extend our approach to explicitly incorporate equity and ensure an equitable distribution of benefits across different socioeconomic groups. We apply our framework to a city in the New England region of the U.S., using real-world gas usage, electric usage, and grid infrastructure data. We show that our network-aware strategy achieves 55% higher carbon reductions than prior network-oblivious work under the same budget. Our equity-aware strategy achieves an equitable outcome while preserving the carbon reduction benefits of the network-aware strategy.
△ Less
Submitted 11 January, 2023;
originally announced January 2023.
-
A Survey of Feedback Particle Filter and related Controlled Interacting Particle Systems (CIPS)
Authors:
Amirhossein Taghvaei,
Prashant G. Mehta
Abstract:
In this survey, we describe controlled interacting particle systems (CIPS) to approximate the solution of the optimal filtering and the optimal control problems. Part I of the survey is focussed on the feedback particle filter (FPF) algorithm, its derivation based on optimal transportation theory, and its relationship to the ensemble Kalman filter (EnKF) and the conventional sequential importance…
▽ More
In this survey, we describe controlled interacting particle systems (CIPS) to approximate the solution of the optimal filtering and the optimal control problems. Part I of the survey is focussed on the feedback particle filter (FPF) algorithm, its derivation based on optimal transportation theory, and its relationship to the ensemble Kalman filter (EnKF) and the conventional sequential importance sampling-resampling (SIR) particle filters. The central numerical problem of FPF -- to approximate the solution of the Poisson equation -- is described together with the main solution approaches. An analytical and numerical comparison with the SIR particle filter is given to illustrate the advantages of the CIPS approach. Part II of the survey is focussed on adapting these algorithms for the problem of reinforcement learning. The survey includes several remarks that describe extensions as well as open problems in this subject.
△ Less
Submitted 20 March, 2023; v1 submitted 2 January, 2023;
originally announced January 2023.
-
Context-aware Fine-tuning of Self-supervised Speech Models
Authors:
Suwon Shon,
Felix Wu,
Kwangyoun Kim,
Prashant Sridhar,
Karen Livescu,
Shinji Watanabe
Abstract:
Self-supervised pre-trained transformers have improved the state of the art on a variety of speech tasks. Due to the quadratic time and space complexity of self-attention, they usually operate at the level of relatively short (e.g., utterance) segments. In this paper, we study the use of context, i.e., surrounding segments, during fine-tuning and propose a new approach called context-aware fine-tu…
▽ More
Self-supervised pre-trained transformers have improved the state of the art on a variety of speech tasks. Due to the quadratic time and space complexity of self-attention, they usually operate at the level of relatively short (e.g., utterance) segments. In this paper, we study the use of context, i.e., surrounding segments, during fine-tuning and propose a new approach called context-aware fine-tuning. We attach a context module on top of the last layer of a pre-trained model to encode the whole segment into a context embedding vector which is then used as an additional feature for the final prediction. During the fine-tuning stage, we introduce an auxiliary loss that encourages this context embedding vector to be similar to context vectors of surrounding segments. This allows the model to make predictions without access to these surrounding segments at inference time and requires only a tiny overhead compared to standard fine-tuned models. We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks: Automatic speech recognition (ASR), named entity recognition (NER), and sentiment analysis (SA). The results show that context-aware fine-tuning not only outperforms a standard fine-tuning baseline but also rivals a strong context injection baseline that uses neighboring speech segments during inference.
△ Less
Submitted 28 March, 2023; v1 submitted 16 December, 2022;
originally announced December 2022.
-
ICSPatch: Automated Vulnerability Localization and Non-Intrusive Hotpatching in Industrial Control Systems using Data Dependence Graphs
Authors:
Prashant Hari Narayan Rajput,
Constantine Doumanidis,
Michail Maniatakos
Abstract:
The paradigm shift of enabling extensive intercommunication between the Operational Technology (OT) and Information Technology (IT) devices allows vulnerabilities typical to the IT world to propagate to the OT side. Therefore, the security layer offered in the past by air gapping is removed, making security patching for OT devices a hard requirement. Conventional patching involves a device reboot…
▽ More
The paradigm shift of enabling extensive intercommunication between the Operational Technology (OT) and Information Technology (IT) devices allows vulnerabilities typical to the IT world to propagate to the OT side. Therefore, the security layer offered in the past by air gapping is removed, making security patching for OT devices a hard requirement. Conventional patching involves a device reboot to load the patched code in the main memory, which does not apply to OT devices controlling critical processes due to downtime, necessitating in-memory vulnerability patching. Furthermore, these control binaries are often compiled by in-house proprietary compilers, further hindering the patching process and placing reliance on OT vendors for rapid vulnerability discovery and patch development. The current state-of-the-art hotpatching approaches only focus on firmware and/or RTOS. Therefore, in this work, we develop ICSPatch, a framework to automate control logic vulnerability localization using Data Dependence Graphs (DDGs). With the help of DDGs, ICSPatch pinpoints the vulnerability in the control application. As an independent second step, ICSPatch can non-intrusively hotpatch vulnerabilities in the control application directly in the main memory of Programmable Logic Controllers while maintaining reliable continuous operation. To evaluate our framework, we test ICSPatch on a synthetic dataset of 24 vulnerable control application binaries from diverse critical infrastructure sectors. Results show that ICSPatch could successfully localize all vulnerabilities and generate patches accordingly. Furthermore, the patch added negligible latency increase in the execution cycle while maintaining correctness and protection against the vulnerability.
△ Less
Submitted 8 December, 2022;
originally announced December 2022.