-
GenPlanX. Generation of Plans and Execution
Authors:
Daniel Borrajo,
Giuseppe Canonaco,
Tomás de la Rosa,
Alfredo Garrachón,
Sriram Gopalakrishnan,
Simerjot Kaur,
Marianela Morales,
Sunandita Patra,
Alberto Pozanco,
Keshav Ramani,
Charese Smiley,
Pietro Totis,
Manuela Veloso
Abstract:
Classical AI Planning techniques generate sequences of actions for complex tasks. However, they lack the ability to understand planning tasks when provided using natural language. The advent of Large Language Models (LLMs) has introduced novel capabilities in human-computer interaction. In the context of planning tasks, LLMs have shown to be particularly good in interpreting human intents among ot…
▽ More
Classical AI Planning techniques generate sequences of actions for complex tasks. However, they lack the ability to understand planning tasks when provided using natural language. The advent of Large Language Models (LLMs) has introduced novel capabilities in human-computer interaction. In the context of planning tasks, LLMs have shown to be particularly good in interpreting human intents among other uses. This paper introduces GenPlanX that integrates LLMs for natural language-based description of planning tasks, with a classical AI planning engine, alongside an execution and monitoring framework. We demonstrate the efficacy of GenPlanX in assisting users with office-related tasks, highlighting its potential to streamline workflows and enhance productivity through seamless human-AI collaboration.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Kinship in Speech: Leveraging Linguistic Relatedness for Zero-Shot TTS in Indian Languages
Authors:
Utkarsh Pathak,
Chandra Sai Krishna Gunda,
Anusha Prakash,
Keshav Agarwal,
Hema A. Murthy
Abstract:
Text-to-speech (TTS) systems typically require high-quality studio data and accurate transcriptions for training. India has 1369 languages, with 22 official using 13 scripts. Training a TTS system for all these languages, most of which have no digital resources, seems a Herculean task. Our work focuses on zero-shot synthesis, particularly for languages whose scripts and phonotactics come from diff…
▽ More
Text-to-speech (TTS) systems typically require high-quality studio data and accurate transcriptions for training. India has 1369 languages, with 22 official using 13 scripts. Training a TTS system for all these languages, most of which have no digital resources, seems a Herculean task. Our work focuses on zero-shot synthesis, particularly for languages whose scripts and phonotactics come from different families. The novelty of our work is in the augmentation of a shared phone representation and modifying the text parsing rules to match the phonotactics of the target language, thus reducing the synthesiser overhead and enabling rapid adaptation. Intelligible and natural speech was generated for Sanskrit, Maharashtrian and Canara Konkani, Maithili and Kurukh by leveraging linguistic connections across languages with suitable synthesisers. Evaluations confirm the effectiveness of this approach, highlighting its potential to expand speech technology access for under-represented languages.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
Exemplifying Emerging Phishing: QR-based Browser-in-The-Browser (BiTB) Attack
Authors:
Muhammad Wahid Akram,
Keshav Sood,
Muneeb Ul Hassan,
Basant Subba
Abstract:
Lately, cybercriminals constantly formulate productive approaches to exploit individuals. This article exemplifies an innovative attack, namely QR-based Browser-in-The-Browser (BiTB), using proficiencies of Large Language Model (LLM) i.e. Google Gemini. The presented attack is a fusion of two emerging attacks: BiTB and Quishing (QR code phishing). Our study underscores attack's simplistic implemen…
▽ More
Lately, cybercriminals constantly formulate productive approaches to exploit individuals. This article exemplifies an innovative attack, namely QR-based Browser-in-The-Browser (BiTB), using proficiencies of Large Language Model (LLM) i.e. Google Gemini. The presented attack is a fusion of two emerging attacks: BiTB and Quishing (QR code phishing). Our study underscores attack's simplistic implementation utilizing malicious prompts provided to Gemini-LLM. Moreover, we presented a case study to highlight a lucrative attack method, we also performed an experiment to comprehend the attack execution on victims' device. The findings of this work obligate the researchers' contributions in confronting this type of phishing attempts through LLMs.
△ Less
Submitted 24 May, 2025;
originally announced May 2025.
-
Conformal Language Model Reasoning with Coherent Factuality
Authors:
Maxon Rubin-Toles,
Maya Gambhir,
Keshav Ramji,
Aaron Roth,
Surbhi Goel
Abstract:
Language models are increasingly being used in important decision pipelines, so ensuring the correctness of their outputs is crucial. Recent work has proposed evaluating the "factuality" of claims decomposed from a language model generation and applying conformal prediction techniques to filter out those claims that are not factual. This can be effective for tasks such as information retrieval, wh…
▽ More
Language models are increasingly being used in important decision pipelines, so ensuring the correctness of their outputs is crucial. Recent work has proposed evaluating the "factuality" of claims decomposed from a language model generation and applying conformal prediction techniques to filter out those claims that are not factual. This can be effective for tasks such as information retrieval, where constituent claims may be evaluated in isolation for factuality, but is not appropriate for reasoning tasks, as steps of a logical argument can be evaluated for correctness only within the context of the claims that precede them. To capture this, we define "coherent factuality" and develop a conformal-prediction-based method to guarantee coherent factuality for language model outputs. Our approach applies split conformal prediction to subgraphs within a "deducibility" graph" that represents the steps of a reasoning problem. We evaluate our method on mathematical reasoning problems from the MATH and FELM datasets and find that our algorithm consistently produces correct and substantiated orderings of claims, achieving coherent factuality across target coverage levels. Moreover, we achieve 90% factuality on our stricter definition while retaining 80% or more of the original claims, highlighting the utility of our deducibility-graph-guided approach.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
Latent Principle Discovery for Language Model Self-Improvement
Authors:
Keshav Ramji,
Tahira Naseem,
Ramón Fernandez Astudillo
Abstract:
When language model (LM) users aim to improve the quality of its generations, it is crucial to specify concrete behavioral attributes that the model should strive to reflect. However, curating such principles across many domains, even non-exhaustively, requires a labor-intensive annotation process. To automate this process, we propose eliciting these latent attributes guiding model reasoning towar…
▽ More
When language model (LM) users aim to improve the quality of its generations, it is crucial to specify concrete behavioral attributes that the model should strive to reflect. However, curating such principles across many domains, even non-exhaustively, requires a labor-intensive annotation process. To automate this process, we propose eliciting these latent attributes guiding model reasoning towards human-preferred responses by explicitly modeling them in a self-correction setting. Our approach mines new principles from the LM itself and compresses the discovered elements to an interpretable set via clustering. Specifically, we employ an approximation of posterior-regularized Monte Carlo Expectation-Maximization to both identify a condensed set of the most effective latent principles and teach the LM to strategically invoke them in order to intrinsically refine its responses. We demonstrate that bootstrapping our algorithm over multiple iterations enables smaller language models (7-8B parameters) to self-improve, achieving +8-10% in AlpacaEval win-rate, an average of +0.3 on MT-Bench, and +19-23% in principle-following win-rate on IFEval. We also show that clustering the principles yields interpretable and diverse model-generated constitutions while retaining model performance. The gains our method achieves highlight the potential of automated, principle-driven post-training recipes toward continual self-improvement.
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
A Framework to Prevent Biometric Data Leakage in the Immersive Technologies Domain
Authors:
Keshav Sood,
Iynkaran Natgunanathan,
Uthayasanker Thayasivam,
Vithurabiman Senthuran,
Xiaoning Zhang,
Shui Yu
Abstract:
Doubtlessly, the immersive technologies have potential to ease people's life and uplift economy, however the obvious data privacy risks cannot be ignored. For example, a participant wears a 3D headset device which detects participant's head motion to track the pose of participant's head to match the orientation of camera with participant's eyes positions in the real-world. In a preliminary study,…
▽ More
Doubtlessly, the immersive technologies have potential to ease people's life and uplift economy, however the obvious data privacy risks cannot be ignored. For example, a participant wears a 3D headset device which detects participant's head motion to track the pose of participant's head to match the orientation of camera with participant's eyes positions in the real-world. In a preliminary study, researchers have proved that the voice command features on such headsets could lead to major privacy leakages. By analyzing the facial dynamics captured with the motion sensors, the headsets suffer security vulnerabilities revealing a user's sensitive speech without user's consent. The psychography data (such as voice command features, facial dynamics, etc.) is sensitive data and it should not be leaked out of the device without users consent else it is a privacy breach. To the best of our literature review, the work done in this particular research problem is very limited. Motivated from this, we develop a simple technical framework to mitigate sensitive data (or biometric data) privacy leaks in immersive technology domain. The performance evaluation is conducted in a robust way using six data sets, to show that the proposed solution is effective and feasible to prevent this issue.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
Mitigating Backdoor Triggered and Targeted Data Poisoning Attacks in Voice Authentication Systems
Authors:
Alireza Mohammadi,
Keshav Sood,
Dhananjay Thiruvady,
Asef Nazari
Abstract:
Voice authentication systems remain susceptible to two major threats: backdoor triggered attacks and targeted data poisoning attacks. This dual vulnerability is critical because conventional solutions typically address each threat type separately, leaving systems exposed to adversaries who can exploit both attacks simultaneously. We propose a unified defense framework that effectively addresses bo…
▽ More
Voice authentication systems remain susceptible to two major threats: backdoor triggered attacks and targeted data poisoning attacks. This dual vulnerability is critical because conventional solutions typically address each threat type separately, leaving systems exposed to adversaries who can exploit both attacks simultaneously. We propose a unified defense framework that effectively addresses both BTA and TDPA. Our framework integrates a frequency focused detection mechanism that flags covert pitch boosting and sound masking backdoor attacks in near real time, followed by a convolutional neural network that addresses TDPA. This dual layered defense approach utilizes multidimensional acoustic features to isolate anomalous signals without requiring costly model retraining. In particular, our PBSM detection mechanism can seamlessly integrate into existing voice authentication pipelines and scale effectively for large scale deployments. Experimental results on benchmark datasets and their compression with the state of the art algorithm demonstrate that our PBSM detection mechanism outperforms the state of the art. Our framework reduces attack success rates to as low as five to fifteen percent while maintaining a recall rate of up to ninety five percent in recognizing TDPA.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
ColBERT-serve: Efficient Multi-Stage Memory-Mapped Scoring
Authors:
Kaili Huang,
Thejas Venkatesh,
Uma Dingankar,
Antonio Mallia,
Daniel Campos,
Jian Jiao,
Christopher Potts,
Matei Zaharia,
Kwabena Boahen,
Omar Khattab,
Saarthak Sarup,
Keshav Santhanam
Abstract:
We study serving retrieval models, specifically late interaction models like ColBERT, to many concurrent users at once and under a small budget, in which the index may not fit in memory. We present ColBERT-serve, a novel serving system that applies a memory-mapping strategy to the ColBERT index, reducing RAM usage by 90% and permitting its deployment on cheap servers, and incorporates a multi-stag…
▽ More
We study serving retrieval models, specifically late interaction models like ColBERT, to many concurrent users at once and under a small budget, in which the index may not fit in memory. We present ColBERT-serve, a novel serving system that applies a memory-mapping strategy to the ColBERT index, reducing RAM usage by 90% and permitting its deployment on cheap servers, and incorporates a multi-stage architecture with hybrid scoring, reducing ColBERT's query latency and supporting many concurrent queries in parallel.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
Evolutionary Policy Optimization
Authors:
Zelal Su "Lain" Mustafaoglu,
Keshav Pingali,
Risto Miikkulainen
Abstract:
A key challenge in reinforcement learning (RL) is managing the exploration-exploitation trade-off without sacrificing sample efficiency. Policy gradient (PG) methods excel in exploitation through fine-grained, gradient-based optimization but often struggle with exploration due to their focus on local search. In contrast, evolutionary computation (EC) methods excel in global exploration, but lack m…
▽ More
A key challenge in reinforcement learning (RL) is managing the exploration-exploitation trade-off without sacrificing sample efficiency. Policy gradient (PG) methods excel in exploitation through fine-grained, gradient-based optimization but often struggle with exploration due to their focus on local search. In contrast, evolutionary computation (EC) methods excel in global exploration, but lack mechanisms for exploitation. To address these limitations, this paper proposes Evolutionary Policy Optimization (EPO), a hybrid algorithm that integrates neuroevolution with policy gradient methods for policy optimization. EPO leverages the exploration capabilities of EC and the exploitation strengths of PG, offering an efficient solution to the exploration-exploitation dilemma in RL. EPO is evaluated on the Atari Pong and Breakout benchmarks. Experimental results show that EPO improves both policy quality and sample efficiency compared to standard PG and EC methods, making it effective for tasks that require both exploration and local optimization.
△ Less
Submitted 16 April, 2025;
originally announced April 2025.
-
Boosting Reservoir Computing with Brain-inspired Adaptive Dynamics
Authors:
Keshav Srinivasan,
Dietmar Plenz,
Michelle Girvan
Abstract:
Reservoir computers (RCs) provide a computationally efficient alternative to deep learning while also offering a framework for incorporating brain-inspired computational principles. By using an internal neural network with random, fixed connections$-$the 'reservoir'$-$and training only the output weights, RCs simplify the training process but remain sensitive to the choice of hyperparameters that…
▽ More
Reservoir computers (RCs) provide a computationally efficient alternative to deep learning while also offering a framework for incorporating brain-inspired computational principles. By using an internal neural network with random, fixed connections$-$the 'reservoir'$-$and training only the output weights, RCs simplify the training process but remain sensitive to the choice of hyperparameters that govern activation functions and network architecture. Moreover, typical RC implementations overlook a critical aspect of neuronal dynamics: the balance between excitatory and inhibitory (E-I) signals, which is essential for robust brain function. We show that RCs characteristically perform best in balanced or slightly over-inhibited regimes, outperforming excitation-dominated ones. To reduce the need for precise hyperparameter tuning, we introduce a self-adapting mechanism that locally adjusts E/I balance to achieve target neuronal firing rates, improving performance by up to 130% in tasks like memory capacity and time series prediction compared with globally tuned RCs. Incorporating brain-inspired heterogeneity in target neuronal firing rates further reduces the need for fine-tuning hyperparameters and enables RCs to excel across linear and non-linear tasks. These results support a shift from static optimization to dynamic adaptation in reservoir design, demonstrating how brain-inspired mechanisms improve RC performance and robustness while deepening our understanding of neural computation.
△ Less
Submitted 16 April, 2025;
originally announced April 2025.
-
Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models
Authors:
NVIDIA,
:,
Aaron Blakeman,
Aarti Basant,
Abhinav Khattar,
Adithya Renduchintala,
Akhiad Bercovich,
Aleksander Ficek,
Alexis Bjorlin,
Ali Taghibakhshi,
Amala Sanjay Deshmukh,
Ameya Sunil Mahabaleshwarkar,
Andrew Tao,
Anna Shors,
Ashwath Aithal,
Ashwin Poojary,
Ayush Dattagupta,
Balaram Buddharaju,
Bobby Chen,
Boris Ginsburg,
Boxin Wang,
Brandon Norick,
Brian Butterfield,
Bryan Catanzaro,
Carlo del Mundo
, et al. (176 additional authors not shown)
Abstract:
As inference-time scaling becomes critical for enhanced reasoning capabilities, it is increasingly becoming important to build models that are efficient to infer. We introduce Nemotron-H, a family of 8B and 56B/47B hybrid Mamba-Transformer models designed to reduce inference cost for a given accuracy level. To achieve this goal, we replace the majority of self-attention layers in the common Transf…
▽ More
As inference-time scaling becomes critical for enhanced reasoning capabilities, it is increasingly becoming important to build models that are efficient to infer. We introduce Nemotron-H, a family of 8B and 56B/47B hybrid Mamba-Transformer models designed to reduce inference cost for a given accuracy level. To achieve this goal, we replace the majority of self-attention layers in the common Transformer model architecture with Mamba layers that perform constant computation and require constant memory per generated token. We show that Nemotron-H models offer either better or on-par accuracy compared to other similarly-sized state-of-the-art open-sourced Transformer models (e.g., Qwen-2.5-7B/72B and Llama-3.1-8B/70B), while being up to 3$\times$ faster at inference. To further increase inference speed and reduce the memory required at inference time, we created Nemotron-H-47B-Base from the 56B model using a new compression via pruning and distillation technique called MiniPuzzle. Nemotron-H-47B-Base achieves similar accuracy to the 56B model, but is 20% faster to infer. In addition, we introduce an FP8-based training recipe and show that it can achieve on par results with BF16-based training. This recipe is used to train the 56B model. We are releasing Nemotron-H base model checkpoints with support in Hugging Face and NeMo.
△ Less
Submitted 15 April, 2025; v1 submitted 4 April, 2025;
originally announced April 2025.
-
Spectral Normalization and Voigt-Reuss net: A universal approach to microstructure-property forecasting with physical guarantees
Authors:
Sanath Keshav,
Julius Herb,
Felix Fritzen
Abstract:
Heterogeneous materials are crucial to producing lightweight components, functional components, and structures composed of them. A crucial step in the design process is the rapid evaluation of their effective mechanical, thermal, or, in general, constitutive properties. The established procedure is to use forward models that accept microstructure geometry and local constitutive properties as input…
▽ More
Heterogeneous materials are crucial to producing lightweight components, functional components, and structures composed of them. A crucial step in the design process is the rapid evaluation of their effective mechanical, thermal, or, in general, constitutive properties. The established procedure is to use forward models that accept microstructure geometry and local constitutive properties as inputs. The classical simulation-based approach, which uses, e.g., finite elements and FFT-based solvers, can require substantial computational resources. At the same time, simulation-based models struggle to provide gradients with respect to the microstructure and the constitutive parameters. Such gradients are, however, of paramount importance for microstructure design and for inverting the microstructure-property mapping. Machine learning surrogates can excel in these situations. However, they can lead to unphysical predictions that violate essential bounds on the constitutive response, such as the upper (Voigt-like) or the lower (Reuss-like) bound in linear elasticity. Therefore, we propose a novel spectral normalization scheme that a priori enforces these bounds. The approach is fully agnostic with respect to the chosen microstructural features and the utilized surrogate model. All of these will automatically and strictly predict outputs that obey the upper and lower bounds by construction. The technique can be used for any constitutive tensor that is symmetric and where upper and lower bounds (in the Löwner sense) exist, i.e., for permeability, thermal conductivity, linear elasticity, and many more. We demonstrate the use of spectral normalization in the Voigt-Reuss net using a simple neural network. Numerical examples on truly extensive datasets illustrate the improved accuracy, robustness, and independence of the type of input features in comparison to much-used neural networks.
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
AutoML Algorithms for Online Generalized Additive Model Selection: Application to Electricity Demand Forecasting
Authors:
Keshav Das,
Julie Keisler,
Margaux Brégère,
Amaury Durand
Abstract:
Electricity demand forecasting is key to ensuring that supply meets demand lest the grid would blackout. Reliable short-term forecasts may be obtained by combining a Generalized Additive Models (GAM) with a State-Space model (Obst et al., 2021), leading to an adaptive (or online) model. A GAM is an over-parameterized linear model defined by a formula and a state-space model involves hyperparameter…
▽ More
Electricity demand forecasting is key to ensuring that supply meets demand lest the grid would blackout. Reliable short-term forecasts may be obtained by combining a Generalized Additive Models (GAM) with a State-Space model (Obst et al., 2021), leading to an adaptive (or online) model. A GAM is an over-parameterized linear model defined by a formula and a state-space model involves hyperparameters. Both the formula and adaptation parameters have to be fixed before model training and have a huge impact on the model's predictive performance. We propose optimizing them using the DRAGON package of Keisler (2025), originally designed for neural architecture search. This work generalizes it for automated online generalized additive model selection by defining an efficient modeling of the search space (namely, the space of the GAM formulae and adaptation parameters). Its application to short-term French electricity demand forecasting demonstrates the relevance of the approach
△ Less
Submitted 31 March, 2025;
originally announced March 2025.
-
TAMA: A Human-AI Collaborative Thematic Analysis Framework Using Multi-Agent LLMs for Clinical Interviews
Authors:
Huimin Xu,
Seungjun Yi,
Terence Lim,
Jiawei Xu,
Andrew Well,
Carlos Mery,
Aidong Zhang,
Yuji Zhang,
Heng Ji,
Keshav Pingali,
Yan Leng,
Ying Ding
Abstract:
Thematic analysis (TA) is a widely used qualitative approach for uncovering latent meanings in unstructured text data. TA provides valuable insights in healthcare but is resource-intensive. Large Language Models (LLMs) have been introduced to perform TA, yet their applications in healthcare remain unexplored. Here, we propose TAMA: A Human-AI Collaborative Thematic Analysis framework using Multi-A…
▽ More
Thematic analysis (TA) is a widely used qualitative approach for uncovering latent meanings in unstructured text data. TA provides valuable insights in healthcare but is resource-intensive. Large Language Models (LLMs) have been introduced to perform TA, yet their applications in healthcare remain unexplored. Here, we propose TAMA: A Human-AI Collaborative Thematic Analysis framework using Multi-Agent LLMs for clinical interviews. We leverage the scalability and coherence of multi-agent systems through structured conversations between agents and coordinate the expertise of cardiac experts in TA. Using interview transcripts from parents of children with Anomalous Aortic Origin of a Coronary Artery (AAOCA), a rare congenital heart disease, we demonstrate that TAMA outperforms existing LLM-assisted TA approaches, achieving higher thematic hit rate, coverage, and distinctiveness. TAMA demonstrates strong potential for automated TA in clinical settings by leveraging multi-agent LLM systems with human-in-the-loop integration by enhancing quality while significantly reducing manual workload.
△ Less
Submitted 26 March, 2025;
originally announced March 2025.
-
Vision-Guided Loco-Manipulation with a Snake Robot
Authors:
Adarsh Salagame,
Sasank Potluri,
Keshav Bharadwaj Vaidyanathan,
Kruthika Gangaraju,
Eric Sihite,
Milad Ramezani,
Alireza Ramezani
Abstract:
This paper presents the development and integration of a vision-guided loco-manipulation pipeline for Northeastern University's snake robot, COBRA. The system leverages a YOLOv8-based object detection model and depth data from an onboard stereo camera to estimate the 6-DOF pose of target objects in real time. We introduce a framework for autonomous detection and control, enabling closed-loop loco-…
▽ More
This paper presents the development and integration of a vision-guided loco-manipulation pipeline for Northeastern University's snake robot, COBRA. The system leverages a YOLOv8-based object detection model and depth data from an onboard stereo camera to estimate the 6-DOF pose of target objects in real time. We introduce a framework for autonomous detection and control, enabling closed-loop loco-manipulation for transporting objects to specified goal locations. Additionally, we demonstrate open-loop experiments in which COBRA successfully performs real-time object detection and loco-manipulation tasks.
△ Less
Submitted 23 March, 2025;
originally announced March 2025.
-
Neural reservoir control of a soft bio-hybrid arm
Authors:
Noel Naughton,
Arman Tekinalp,
Keshav Shivam,
Seung Hung Kim,
Volodymyr Kindratenko,
Mattia Gazzola
Abstract:
A long-standing engineering problem, the control of soft robots is difficult because of their highly non-linear, heterogeneous, anisotropic, and distributed nature. Here, bridging engineering and biology, a neural reservoir is employed for the dynamic control of a bio-hybrid model arm made of multiple muscle-tendon groups enveloping an elastic spine. We show how the use of reservoirs facilitates s…
▽ More
A long-standing engineering problem, the control of soft robots is difficult because of their highly non-linear, heterogeneous, anisotropic, and distributed nature. Here, bridging engineering and biology, a neural reservoir is employed for the dynamic control of a bio-hybrid model arm made of multiple muscle-tendon groups enveloping an elastic spine. We show how the use of reservoirs facilitates simultaneous control and self-modeling across a set of challenging tasks, outperforming classic neural network approaches. Further, by implementing a spiking reservoir on neuromorphic hardware, energy efficiency is achieved, with nearly two-orders of magnitude improvement relative to standard CPUs, with implications for the on-board control of untethered, small-scale soft robots.
△ Less
Submitted 12 March, 2025;
originally announced March 2025.
-
DashCop: Automated E-ticket Generation for Two-Wheeler Traffic Violations Using Dashcam Videos
Authors:
Deepti Rawat,
Keshav Gupta,
Aryamaan Basu Roy,
Ravi Kiran Sarvadevabhatla
Abstract:
Motorized two-wheelers are a prevalent and economical means of transportation, particularly in the Asia-Pacific region. However, hazardous driving practices such as triple riding and non-compliance with helmet regulations contribute significantly to accident rates. Addressing these violations through automated enforcement mechanisms can enhance traffic safety. In this paper, we propose DashCop, an…
▽ More
Motorized two-wheelers are a prevalent and economical means of transportation, particularly in the Asia-Pacific region. However, hazardous driving practices such as triple riding and non-compliance with helmet regulations contribute significantly to accident rates. Addressing these violations through automated enforcement mechanisms can enhance traffic safety. In this paper, we propose DashCop, an end-to-end system for automated E-ticket generation. The system processes vehicle-mounted dashcam videos to detect two-wheeler traffic violations. Our contributions include: (1) a novel Segmentation and Cross-Association (SAC) module to accurately associate riders with their motorcycles, (2) a robust cross-association-based tracking algorithm optimized for the simultaneous presence of riders and motorcycles, and (3) the RideSafe-400 dataset, a comprehensive annotated dashcam video dataset for triple riding and helmet rule violations. Our system demonstrates significant improvements in violation detection, validated through extensive evaluations on the RideSafe-400 dataset.
△ Less
Submitted 1 March, 2025;
originally announced March 2025.
-
ImprovNet -- Generating Controllable Musical Improvisations with Iterative Corruption Refinement
Authors:
Keshav Bhandari,
Sungkyun Chang,
Tongyu Lu,
Fareza R. Enus,
Louis B. Bradshaw,
Dorien Herremans,
Simon Colton
Abstract:
Despite deep learning's remarkable advances in style transfer across various domains, generating controllable performance-level musical style transfer for complete symbolically represented musical works remains a challenging area of research. Much of this is owed to limited datasets, especially for genres such as jazz, and the lack of unified models that can handle multiple music generation tasks.…
▽ More
Despite deep learning's remarkable advances in style transfer across various domains, generating controllable performance-level musical style transfer for complete symbolically represented musical works remains a challenging area of research. Much of this is owed to limited datasets, especially for genres such as jazz, and the lack of unified models that can handle multiple music generation tasks. This paper presents ImprovNet, a transformer-based architecture that generates expressive and controllable musical improvisations through a self-supervised corruption-refinement training strategy. The improvisational style transfer is aimed at making meaningful modifications to one or more musical elements - melody, harmony or rhythm of the original composition with respect to the target genre. ImprovNet unifies multiple capabilities within a single model: it can perform cross-genre and intra-genre improvisations, harmonize melodies with genre-specific styles, and execute short prompt continuation and infilling tasks. The model's iterative generation framework allows users to control the degree of style transfer and structural similarity to the original composition. Objective and subjective evaluations demonstrate ImprovNet's effectiveness in generating musically coherent improvisations while maintaining structural relationships with the original pieces. The model outperforms Anticipatory Music Transformer in short continuation and infilling tasks and successfully achieves recognizable genre conversion, with 79\% of participants correctly identifying jazz-style improvisations of classical pieces. Our code and demo page can be found at https://github.com/keshavbhandari/improvnet.
△ Less
Submitted 16 May, 2025; v1 submitted 6 February, 2025;
originally announced February 2025.
-
Advanced Architectures Integrated with Agentic AI for Next-Generation Wireless Networks
Authors:
Kapal Dev,
Sunder Ali Khowaja,
Keshav Singh,
Engin Zeydan,
Merouane Debbah
Abstract:
This paper investigates a range of cutting-edge technologies and architectural innovations aimed at simplifying network operations, reducing operational expenditure (OpEx), and enabling the deployment of new service models. The focus is on (i) Proposing novel, more efficient 6G architectures, with both Control and User planes enabling the seamless expansion of services, while addressing long-term…
▽ More
This paper investigates a range of cutting-edge technologies and architectural innovations aimed at simplifying network operations, reducing operational expenditure (OpEx), and enabling the deployment of new service models. The focus is on (i) Proposing novel, more efficient 6G architectures, with both Control and User planes enabling the seamless expansion of services, while addressing long-term 6G network evolution. (ii) Exploring advanced techniques for constrained artificial intelligence (AI) operations, particularly the design of AI agents for real-time learning, optimizing energy consumption, and the allocation of computational resources. (iii) Identifying technologies and architectures that support the orchestration of backend services using serverless computing models across multiple domains, particularly for vertical industries. (iv) Introducing optically-based, ultra-high-speed, low-latency network architectures, with fast optical switching and real-time control, replacing conventional electronic switching to reduce power consumption by an order of magnitude.
△ Less
Submitted 15 April, 2025; v1 submitted 3 February, 2025;
originally announced February 2025.
-
Simultaneous Estimation of Manipulation Skill and Hand Grasp Force from Forearm Ultrasound Images
Authors:
Keshav Bimbraw,
Srikar Nekkanti,
Daniel B. Tiller II,
Mihir Deshmukh,
Berk Calli,
Robert D. Howe,
Haichong K. Zhang
Abstract:
Accurate estimation of human hand configuration and the forces they exert is critical for effective teleoperation and skill transfer in robotic manipulation. A deeper understanding of human interactions with objects can further enhance teleoperation performance. To address this need, researchers have explored methods to capture and translate human manipulation skills and applied forces to robotic…
▽ More
Accurate estimation of human hand configuration and the forces they exert is critical for effective teleoperation and skill transfer in robotic manipulation. A deeper understanding of human interactions with objects can further enhance teleoperation performance. To address this need, researchers have explored methods to capture and translate human manipulation skills and applied forces to robotic systems. Among these, biosignal-based approaches, particularly those using forearm ultrasound data, have shown significant potential for estimating hand movements and finger forces. In this study, we present a method for simultaneously estimating manipulation skills and applied hand force using forearm ultrasound data. Data collected from seven participants were used to train deep learning models for classifying manipulation skills and estimating grasp force. Our models achieved an average classification accuracy of 94.87 percent plus or minus 10.16 percent for manipulation skills and an average root mean square error (RMSE) of 0.51 plus or minus 0.19 Newtons for force estimation, as evaluated using five-fold cross-validation. These results highlight the effectiveness of forearm ultrasound in advancing human-machine interfacing and robotic teleoperation for complex manipulation tasks. This work enables new and effective possibilities for human-robot skill transfer and tele-manipulation, bridging the gap between human dexterity and robotic control.
△ Less
Submitted 31 January, 2025;
originally announced February 2025.
-
Yin-Yang: Developing Motifs With Long-Term Structure And Controllability
Authors:
Keshav Bhandari,
Geraint A. Wiggins,
Simon Colton
Abstract:
Transformer models have made great strides in generating symbolically represented music with local coherence. However, controlling the development of motifs in a structured way with global form remains an open research area. One of the reasons for this challenge is due to the note-by-note autoregressive generation of such models, which lack the ability to correct themselves after deviations from t…
▽ More
Transformer models have made great strides in generating symbolically represented music with local coherence. However, controlling the development of motifs in a structured way with global form remains an open research area. One of the reasons for this challenge is due to the note-by-note autoregressive generation of such models, which lack the ability to correct themselves after deviations from the motif. In addition, their structural performance on datasets with shorter durations has not been studied in the literature. In this study, we propose Yin-Yang, a framework consisting of a phrase generator, phrase refiner, and phrase selector models for the development of motifs into melodies with long-term structure and controllability. The phrase refiner is trained on a novel corruption-refinement strategy which allows it to produce melodic and rhythmic variations of an original motif at generation time, thereby rectifying deviations of the phrase generator. We also introduce a new objective evaluation metric for quantifying how smoothly the motif manifests itself within the piece. Evaluation results show that our model achieves better performance compared to state-of-the-art transformer models while having the advantage of being controllable and making the generated musical structure semi-interpretable, paving the way for musical analysis. Our code and demo page can be found at https://github.com/keshavbhandari/yinyang.
△ Less
Submitted 29 January, 2025;
originally announced January 2025.
-
Figurative-cum-Commonsense Knowledge Infusion for Multimodal Mental Health Meme Classification
Authors:
Abdullah Mazhar,
Zuhair hasan shaik,
Aseem Srivastava,
Polly Ruhnke,
Lavanya Vaddavalli,
Sri Keshav Katragadda,
Shweta Yadav,
Md Shad Akhtar
Abstract:
The expression of mental health symptoms through non-traditional means, such as memes, has gained remarkable attention over the past few years, with users often highlighting their mental health struggles through figurative intricacies within memes. While humans rely on commonsense knowledge to interpret these complex expressions, current Multimodal Language Models (MLMs) struggle to capture these…
▽ More
The expression of mental health symptoms through non-traditional means, such as memes, has gained remarkable attention over the past few years, with users often highlighting their mental health struggles through figurative intricacies within memes. While humans rely on commonsense knowledge to interpret these complex expressions, current Multimodal Language Models (MLMs) struggle to capture these figurative aspects inherent in memes. To address this gap, we introduce a novel dataset, AxiOM, derived from the GAD anxiety questionnaire, which categorizes memes into six fine-grained anxiety symptoms. Next, we propose a commonsense and domain-enriched framework, M3H, to enhance MLMs' ability to interpret figurative language and commonsense knowledge. The overarching goal remains to first understand and then classify the mental health symptoms expressed in memes. We benchmark M3H against 6 competitive baselines (with 20 variations), demonstrating improvements in both quantitative and qualitative metrics, including a detailed human evaluation. We observe a clear improvement of 4.20% and 4.66% on weighted-F1 metric. To assess the generalizability, we perform extensive experiments on a public dataset, RESTORE, for depressive symptom identification, presenting an extensive ablation study that highlights the contribution of each module in both datasets. Our findings reveal limitations in existing models and the advantage of employing commonsense to enhance figurative understanding.
△ Less
Submitted 25 January, 2025;
originally announced January 2025.
-
Hypercone Assisted Contour Generation for Out-of-Distribution Detection
Authors:
Annita Vapsi,
Andrés Muñoz,
Nancy Thomas,
Keshav Ramani,
Daniel Borrajo
Abstract:
Recent advances in the field of out-of-distribution (OOD) detection have placed great emphasis on learning better representations suited to this task. While there are distance-based approaches, distributional awareness has seldom been exploited for better performance. We present HAC$_k$-OOD, a novel OOD detection method that makes no distributional assumption about the data, but automatically adap…
▽ More
Recent advances in the field of out-of-distribution (OOD) detection have placed great emphasis on learning better representations suited to this task. While there are distance-based approaches, distributional awareness has seldom been exploited for better performance. We present HAC$_k$-OOD, a novel OOD detection method that makes no distributional assumption about the data, but automatically adapts to its distribution. Specifically, HAC$_k$-OOD constructs a set of hypercones by maximizing the angular distance to neighbors in a given data-point's vicinity to approximate the contour within which in-distribution (ID) data-points lie. Experimental results show state-of-the-art FPR@95 and AUROC performance on Near-OOD detection and on Far-OOD detection on the challenging CIFAR-100 benchmark without explicitly training for OOD performance.
△ Less
Submitted 17 February, 2025; v1 submitted 17 January, 2025;
originally announced January 2025.
-
HyperQuery: Beyond Binary Link Prediction
Authors:
Sepideh Maleki,
Josh Vekhter,
Keshav Pingali
Abstract:
Groups with complex set intersection relations are a natural way to model a wide array of data, from the formation of social groups to the complex protein interactions which form the basis of biological life. One approach to representing such higher order relationships is as a hypergraph. However, efforts to apply machine learning techniques to hypergraph structured datasets have been limited thus…
▽ More
Groups with complex set intersection relations are a natural way to model a wide array of data, from the formation of social groups to the complex protein interactions which form the basis of biological life. One approach to representing such higher order relationships is as a hypergraph. However, efforts to apply machine learning techniques to hypergraph structured datasets have been limited thus far. In this paper, we address the problem of link prediction in knowledge hypergraphs as well as simple hypergraphs and develop a novel, simple, and effective optimization architecture that addresses both tasks. Additionally, we introduce a novel feature extraction technique using node level clustering and we show how integrating data from node-level labels can improve system performance. Our self-supervised approach achieves significant improvement over state of the art baselines on several hyperedge prediction and knowledge hypergraph completion benchmarks.
△ Less
Submitted 13 January, 2025;
originally announced January 2025.
-
Inductive Linguistic Reasoning with Large Language Models
Authors:
Raghav Ramji,
Keshav Ramji
Abstract:
Evaluating large language models (LLMs) on their linguistic reasoning capabilities is an important task to understand the gaps in their skills that may surface during large-scale adoption. In this work, we investigate the abilities of such models to perform abstract multilingual reasoning through the lens of linguistic puzzles on extremely low-resource languages. As these translation tasks involve…
▽ More
Evaluating large language models (LLMs) on their linguistic reasoning capabilities is an important task to understand the gaps in their skills that may surface during large-scale adoption. In this work, we investigate the abilities of such models to perform abstract multilingual reasoning through the lens of linguistic puzzles on extremely low-resource languages. As these translation tasks involve inductive and deductive reasoning from reference instances, we examine whether diverse auxiliary demonstrations can be automatically induced from seed exemplars, through analogical prompting. We employ a two-stage procedure, first generating analogical exemplars with a language model, and then applying them in-context along with provided target language exemplars. Our results on the modeLing dataset show that analogical prompting is effective in eliciting models' knowledge of language grammar similarities, boosting the performance of GPT-4o by as much as 8.1% and Llama-3.1-405B-Instruct by 5.9% over chain-of-thought approaches. These gains are attributable to the analogical demonstrations, both when self-generated as well as when produced by weaker multilingual models. Furthermore, we demonstrate that our method generalizes to other tasks present in Linguistics Olympiad competitions, achieving sizable improvements across all problem types and difficulty levels included in the LINGOLY dataset with GPT-4o. We also report several findings about interesting phenomena which drive linguistic reasoning performance, suggesting that such puzzles are a valuable benchmark for new reasoning methods.
△ Less
Submitted 8 December, 2024;
originally announced December 2024.
-
Popularity Estimation and New Bundle Generation using Content and Context based Embeddings
Authors:
Ashutosh Nayak,
Prajwal NJ,
Sameeksha Keshav,
Kavitha S. N.,
Roja Reddy,
Rajasekhara Reddy Duvvuru Muni
Abstract:
Recommender systems create enormous value for businesses and their consumers. They increase revenue for businesses while improving the consumer experience by recommending relevant products amidst huge product base. Product bundling is an exciting development in the field of product recommendations. It aims at generating new bundles and recommending exciting and relevant bundles to their consumers.…
▽ More
Recommender systems create enormous value for businesses and their consumers. They increase revenue for businesses while improving the consumer experience by recommending relevant products amidst huge product base. Product bundling is an exciting development in the field of product recommendations. It aims at generating new bundles and recommending exciting and relevant bundles to their consumers. Unlike traditional recommender systems that recommend single items to consumers, product bundling aims at targeting a bundle, or a set of items, to the consumers. While bundle recommendation has attracted significant research interest recently, extant literature on bundle generation is scarce. Moreover, metrics to identify if a bundle is popular or not is not well studied. In this work, we aim to fulfill this gap by introducing new bundle popularity metrics based on sales, consumer experience and item diversity in a bundle. We use these metrics in the methodology proposed in this paper to generate new bundles for mobile games using content aware and context aware embeddings. We use opensource Steam Games dataset for our analysis. Our experiments indicate that we can generate new bundles that can outperform the existing bundles on the popularity metrics by 32% - 44%. Our experiments are computationally efficient and the proposed methodology is generic that can be extended to other bundling problems e.g. product bundling, music bundling.
△ Less
Submitted 23 December, 2024;
originally announced December 2024.
-
Text2midi: Generating Symbolic Music from Captions
Authors:
Keshav Bhandari,
Abhinaba Roy,
Kyra Wang,
Geeta Puri,
Simon Colton,
Dorien Herremans
Abstract:
This paper introduces text2midi, an end-to-end model to generate MIDI files from textual descriptions. Leveraging the growing popularity of multimodal generative approaches, text2midi capitalizes on the extensive availability of textual data and the success of large language models (LLMs). Our end-to-end system harnesses the power of LLMs to generate symbolic music in the form of MIDI files. Speci…
▽ More
This paper introduces text2midi, an end-to-end model to generate MIDI files from textual descriptions. Leveraging the growing popularity of multimodal generative approaches, text2midi capitalizes on the extensive availability of textual data and the success of large language models (LLMs). Our end-to-end system harnesses the power of LLMs to generate symbolic music in the form of MIDI files. Specifically, we utilize a pretrained LLM encoder to process captions, which then condition an autoregressive transformer decoder to produce MIDI sequences that accurately reflect the provided descriptions. This intuitive and user-friendly method significantly streamlines the music creation process by allowing users to generate music pieces using text prompts. We conduct comprehensive empirical evaluations, incorporating both automated and human studies, that show our model generates MIDI files of high quality that are indeed controllable by text captions that may include music theory terms such as chords, keys, and tempo. We release the code and music samples on our demo page (https://github.com/AMAAI-Lab/Text2midi) for users to interact with text2midi.
△ Less
Submitted 31 December, 2024; v1 submitted 21 December, 2024;
originally announced December 2024.
-
LogBabylon: A Unified Framework for Cross-Log File Integration and Analysis
Authors:
Rabimba Karanjai,
Yang Lu,
Dana Alsagheer,
Keshav Kasichainula,
Lei Xu,
Weidong Shi,
Shou-Hsuan Stephen Huang
Abstract:
Logs are critical resources that record events, activities, or messages produced by software applications, operating systems, servers, and network devices. However, consolidating the heterogeneous logs and cross-referencing them is challenging and complicated. Manually analyzing the log data is time-consuming and prone to errors. LogBabylon is a centralized log data consolidating solution that lev…
▽ More
Logs are critical resources that record events, activities, or messages produced by software applications, operating systems, servers, and network devices. However, consolidating the heterogeneous logs and cross-referencing them is challenging and complicated. Manually analyzing the log data is time-consuming and prone to errors. LogBabylon is a centralized log data consolidating solution that leverages Large Language Models (LLMs) integrated with Retrieval-Augmented Generation (RAG) technology. LogBabylon interprets the log data in a human-readable way and adds insight analysis of the system performance and anomaly alerts. It provides a paramount view of the system landscape, enabling proactive management and rapid incident response. LogBabylon consolidates diverse log sources and enhances the extracted information's accuracy and relevancy. This facilitates a deeper understanding of log data, supporting more effective decision-making and operational efficiency. Furthermore, LogBabylon streamlines the log analysis process, significantly reducing the time and effort required to interpret complex datasets. Its capabilities extend to generating context-aware insights, offering an invaluable tool for continuous monitoring, performance optimization, and security assurance in dynamic computing environments.
△ Less
Submitted 16 December, 2024;
originally announced December 2024.
-
SlideSpawn: An Automatic Slides Generation System for Research Publications
Authors:
Keshav Kumar,
Ravindranath Chowdary
Abstract:
Research papers are well structured documents. They have text, figures, equations, tables etc., to covey their ideas and findings. They are divided into sections like Introduction, Model, Experiments etc., which deal with different aspects of research. Characteristics like these set research papers apart from ordinary documents and allows us to significantly improve their summarization. In this pa…
▽ More
Research papers are well structured documents. They have text, figures, equations, tables etc., to covey their ideas and findings. They are divided into sections like Introduction, Model, Experiments etc., which deal with different aspects of research. Characteristics like these set research papers apart from ordinary documents and allows us to significantly improve their summarization. In this paper, we propose a novel system, SlideSpwan, that takes PDF of a research document as an input and generates a quality presentation providing it's summary in a visual and concise fashion. The system first converts the PDF of the paper to an XML document that has the structural information about various elements. Then a machine learning model, trained on PS5K dataset and Aminer 9.5K Insights dataset (that we introduce), is used to predict salience of each sentence in the paper. Sentences for slides are selected using ILP and clustered based on their similarity with each cluster being given a suitable title. Finally a slide is generated by placing any graphical element referenced in the selected sentences next to them. Experiments on a test set of 650 pairs of papers and slides demonstrate that our system generates presentations with better quality.
△ Less
Submitted 20 November, 2024;
originally announced November 2024.
-
Optimal Capacity Modification for Stable Matchings with Ties
Authors:
Keshav Ranjan,
Meghana Nasre,
Prajakta Nimbhorkar
Abstract:
We consider the Hospitals/Residents (HR) problem in the presence of ties in preference lists. Among the three notions of stability, viz. weak, strong, and super stability, we focus on the notion of strong stability. Strong stability has many desirable properties, both theoretically and practically; however, its existence is not guaranteed. In this paper, our objective is to optimally increase the…
▽ More
We consider the Hospitals/Residents (HR) problem in the presence of ties in preference lists. Among the three notions of stability, viz. weak, strong, and super stability, we focus on the notion of strong stability. Strong stability has many desirable properties, both theoretically and practically; however, its existence is not guaranteed. In this paper, our objective is to optimally increase the quotas of hospitals to ensure that a strongly stable matching exists in the modified instance. First, we show that if ties are allowed in residents' preference lists, it may not be possible to augment the hospital quotas to obtain an instance that admits a strongly stable matching. When residents' preference lists are strict, we explore two natural optimization criteria: (i) minimizing the total capacity increase across all hospitals (MINSUM) and (ii) minimizing the maximum capacity increase for any hospital (MINMAX). We show that the MINSUM problem admits a polynomial-time algorithm, whereas the MINMAX problem is NP-hard. We prove an analogue of the Rural Hospitals theorem for the MINSUM problem. When each hospital incurs a cost for a unit increase in its quota, the MINSUM problem becomes NP-hard, even for 0/1 costs. In fact, we show that the problem cannot be approximated to any multiplicative factor. We also present a polynomial-time algorithm for optimal MINSUM augmentation when a specified subset of edges is required to be included in the matching. We show that the MINMAX problem is NP-hard in general. When hospital preference lists have ties of length at most $\ell+1$, we give a polynomial-time algorithm that increases each hospital's quota by at most $\ell$. Amongst all instances obtained by at most $\ell$ augmentations per hospital, our algorithm produces a strongly stable matching that is best for residents.
△ Less
Submitted 23 May, 2025; v1 submitted 15 November, 2024;
originally announced November 2024.
-
LLC Intra-set Write Balancing
Authors:
Keshav Krishna,
Ayush Verma
Abstract:
The increasing use of Non-Volatile Memory (NVM) in computer architecture has brought about new challenges, one of which is the write endurance problem. Frequent writes to a particular cache cell in NVM can lead to degradation of the memory cell and reduce its lifespan. To solve this problem, we propose a sample-based blocking technique for the Last Level Cache (LLC). Our approach involves defining…
▽ More
The increasing use of Non-Volatile Memory (NVM) in computer architecture has brought about new challenges, one of which is the write endurance problem. Frequent writes to a particular cache cell in NVM can lead to degradation of the memory cell and reduce its lifespan. To solve this problem, we propose a sample-based blocking technique for the Last Level Cache (LLC). Our approach involves defining a threshold value and sampling a subset of cache sets. If the number of writes to a way in a sampled set exceeds the threshold, the way is blocked, and writes are redirected to other ways. We also maintain a history structure to record the number of writes in a set and a PC-Table to use for blocking in unsampled sets. Based on blocking on sampled sets, variance of values stored in history is used to determine whether blocking had a positive impact or not, and on this basis, value corresponding to instruction pointer is incremented or decremented. This value is later used for blocking in unsampled sets. Our results show that our approach significantly balances write traffic to the cache and improves the overall lifespan of the memory cells while having better performance to the base-line system. Our approach can also be applied to other cache hierarchies and NVM technologies to mitigate the problem of write endurance.
△ Less
Submitted 20 October, 2024;
originally announced October 2024.
-
Hand Gesture Classification Based on Forearm Ultrasound Video Snippets Using 3D Convolutional Neural Networks
Authors:
Keshav Bimbraw,
Ankit Talele,
Haichong K. Zhang
Abstract:
Ultrasound based hand movement estimation is a crucial area of research with applications in human-machine interaction. Forearm ultrasound offers detailed information about muscle morphology changes during hand movement which can be used to estimate hand gestures. Previous work has focused on analyzing 2-Dimensional (2D) ultrasound image frames using techniques such as convolutional neural network…
▽ More
Ultrasound based hand movement estimation is a crucial area of research with applications in human-machine interaction. Forearm ultrasound offers detailed information about muscle morphology changes during hand movement which can be used to estimate hand gestures. Previous work has focused on analyzing 2-Dimensional (2D) ultrasound image frames using techniques such as convolutional neural networks (CNNs). However, such 2D techniques do not capture temporal features from segments of ultrasound data corresponding to continuous hand movements. This study uses 3D CNN based techniques to capture spatio-temporal patterns within ultrasound video segments for gesture recognition. We compared the performance of a 2D convolution-based network with (2+1)D convolution-based, 3D convolution-based, and our proposed network. Our methodology enhanced the gesture classification accuracy to 98.8 +/- 0.9%, from 96.5 +/- 2.3% compared to a network trained with 2D convolution layers. These results demonstrate the advantages of using ultrasound video snippets for improving hand gesture classification performance.
△ Less
Submitted 24 September, 2024;
originally announced September 2024.
-
Improving Intersession Reproducibility for Forearm Ultrasound based Hand Gesture Classification through an Incremental Learning Approach
Authors:
Keshav Bimbraw,
Jack Rothenberg,
Haichong K. Zhang
Abstract:
Ultrasound images of the forearm can be used to classify hand gestures towards developing human machine interfaces. In our previous work, we have demonstrated gesture classification using ultrasound on a single subject without removing the probe before evaluation. This has limitations in usage as once the probe is removed and replaced, the accuracy declines since the classifier performance is sens…
▽ More
Ultrasound images of the forearm can be used to classify hand gestures towards developing human machine interfaces. In our previous work, we have demonstrated gesture classification using ultrasound on a single subject without removing the probe before evaluation. This has limitations in usage as once the probe is removed and replaced, the accuracy declines since the classifier performance is sensitive to the probe location on the arm. In this paper, we propose training a model on multiple data collection sessions to create a generalized model, utilizing incremental learning through fine tuning. Ultrasound data was acquired for 5 hand gestures within a session (without removing and putting the probe back on) and across sessions. A convolutional neural network (CNN) with 5 cascaded convolution layers was used for this study. A pre-trained CNN was fine tuned with the convolution blocks acting as a feature extractor, and the parameters of the remaining layers updated in an incremental fashion. Fine tuning was done using different session splits within a session and between multiple sessions. We found that incremental fine tuning can help enhance classification accuracy with more fine tuning sessions. After 2 fine tuning sessions for each experiment, we found an approximate 10% increase in classification accuracy. This work demonstrates that incremental learning through fine tuning on ultrasound based hand gesture classification can be used improves accuracy while saving storage, processing power, and time. It can be expanded to generalize between multiple subjects and towards developing personalized wearable devices.
△ Less
Submitted 24 September, 2024;
originally announced September 2024.
-
Forearm Ultrasound based Gesture Recognition on Edge
Authors:
Keshav Bimbraw,
Haichong K. Zhang,
Bashima Islam
Abstract:
Ultrasound imaging of the forearm has demonstrated significant potential for accurate hand gesture classification. Despite this progress, there has been limited focus on developing a stand-alone end- to-end gesture recognition system which makes it mobile, real-time and more user friendly. To bridge this gap, this paper explores the deployment of deep neural networks for forearm ultrasound-based h…
▽ More
Ultrasound imaging of the forearm has demonstrated significant potential for accurate hand gesture classification. Despite this progress, there has been limited focus on developing a stand-alone end- to-end gesture recognition system which makes it mobile, real-time and more user friendly. To bridge this gap, this paper explores the deployment of deep neural networks for forearm ultrasound-based hand gesture recognition on edge devices. Utilizing quantization techniques, we achieve substantial reductions in model size while maintaining high accuracy and low latency. Our best model, with Float16 quantization, achieves a test accuracy of 92% and an inference time of 0.31 seconds on a Raspberry Pi. These results demonstrate the feasibility of efficient, real-time gesture recognition on resource-limited edge devices, paving the way for wearable ultrasound-based systems.
△ Less
Submitted 15 September, 2024;
originally announced September 2024.
-
Auctioning Escape Permits for Multiple Correlated Pollutants Using CMRA
Authors:
Keshav Goyal,
Sooraj Sathish,
Shrisha Rao
Abstract:
In the context of increasingly complex environmental challenges, effective pollution control mechanisms are crucial. By extending the state of the art auction mechanisms, we aim to develop an efficient approach for allocating pollution abatement resources in a multi-pollutant setting with pollutants affecting each other's reduction costs. We modify the Combinatorial Multi-Round Ascending Auction f…
▽ More
In the context of increasingly complex environmental challenges, effective pollution control mechanisms are crucial. By extending the state of the art auction mechanisms, we aim to develop an efficient approach for allocating pollution abatement resources in a multi-pollutant setting with pollutants affecting each other's reduction costs. We modify the Combinatorial Multi-Round Ascending Auction for the auction of escape permits of pollutants with co-dependent reduction processes, specifically, greenhouse gas emissions and nutrient runoff in Finnish agriculture. We show the significant advantages of this mechanism in pollution control through experiments on the bid prices and amount of escape permits sold in multiple auction simulations.
△ Less
Submitted 19 August, 2024;
originally announced August 2024.
-
Random Channel Ablation for Robust Hand Gesture Classification with Multimodal Biosignals
Authors:
Keshav Bimbraw,
Jing Liu,
Ye Wang,
Toshiaki Koike-Akino
Abstract:
Biosignal-based hand gesture classification is an important component of effective human-machine interaction. For multimodal biosignal sensing, the modalities often face data loss due to missing channels in the data which can adversely affect the gesture classification performance. To make the classifiers robust to missing channels in the data, this paper proposes using Random Channel Ablation (RC…
▽ More
Biosignal-based hand gesture classification is an important component of effective human-machine interaction. For multimodal biosignal sensing, the modalities often face data loss due to missing channels in the data which can adversely affect the gesture classification performance. To make the classifiers robust to missing channels in the data, this paper proposes using Random Channel Ablation (RChA) during the training process. Ultrasound and force myography (FMG) data were acquired from the forearm for 12 hand gestures over 2 subjects. The resulting multimodal data had 16 total channels, 8 for each modality. The proposed method was applied to convolutional neural network architecture, and compared with baseline, imputation, and oracle methods. Using 5-fold cross-validation for the two subjects, on average, 12.2% and 24.5% improvement was observed for gesture classification with up to 4 and 8 missing channels respectively compared to the baseline. Notably, the proposed method is also robust to an increase in the number of missing channels compared to other methods. These results show the efficacy of using random channel ablation to improve classifier robustness for multimodal and multi-channel biosignal-based hand gesture classification.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
GPT Sonograpy: Hand Gesture Decoding from Forearm Ultrasound Images via VLM
Authors:
Keshav Bimbraw,
Ye Wang,
Jing Liu,
Toshiaki Koike-Akino
Abstract:
Large vision-language models (LVLMs), such as the Generative Pre-trained Transformer 4-omni (GPT-4o), are emerging multi-modal foundation models which have great potential as powerful artificial-intelligence (AI) assistance tools for a myriad of applications, including healthcare, industrial, and academic sectors. Although such foundation models perform well in a wide range of general tasks, their…
▽ More
Large vision-language models (LVLMs), such as the Generative Pre-trained Transformer 4-omni (GPT-4o), are emerging multi-modal foundation models which have great potential as powerful artificial-intelligence (AI) assistance tools for a myriad of applications, including healthcare, industrial, and academic sectors. Although such foundation models perform well in a wide range of general tasks, their capability without fine-tuning is often limited in specialized tasks. However, full fine-tuning of large foundation models is challenging due to enormous computation/memory/dataset requirements. We show that GPT-4o can decode hand gestures from forearm ultrasound data even with no fine-tuning, and improves with few-shot, in-context learning.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inference on Heterogeneous Systems
Authors:
Grant Wilkins,
Srinivasan Keshav,
Richard Mortier
Abstract:
The rapid adoption of large language models (LLMs) has led to significant advances in natural language processing and text generation. However, the energy consumed through LLM model inference remains a major challenge for sustainable AI deployment. To address this problem, we model the workload-dependent energy consumption and runtime of LLM inference tasks on heterogeneous GPU-CPU systems. By con…
▽ More
The rapid adoption of large language models (LLMs) has led to significant advances in natural language processing and text generation. However, the energy consumed through LLM model inference remains a major challenge for sustainable AI deployment. To address this problem, we model the workload-dependent energy consumption and runtime of LLM inference tasks on heterogeneous GPU-CPU systems. By conducting an extensive characterization study of several state-of-the-art LLMs and analyzing their energy and runtime behavior across different magnitudes of input prompts and output text, we develop accurate (R^2>0.96) energy and runtime models for each LLM. We employ these models to explore an offline, energy-optimal LLM workload scheduling framework. Through a case study, we demonstrate the advantages of energy and accuracy aware scheduling compared to existing best practices.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
Hybrid Heterogeneous Clusters Can Lower the Energy Consumption of LLM Inference Workloads
Authors:
Grant Wilkins,
Srinivasan Keshav,
Richard Mortier
Abstract:
Both the training and use of Large Language Models (LLMs) require large amounts of energy. Their increasing popularity, therefore, raises critical concerns regarding the energy efficiency and sustainability of data centers that host them. This paper addresses the challenge of reducing energy consumption in data centers running LLMs. We propose a hybrid data center model that uses a cost-based sche…
▽ More
Both the training and use of Large Language Models (LLMs) require large amounts of energy. Their increasing popularity, therefore, raises critical concerns regarding the energy efficiency and sustainability of data centers that host them. This paper addresses the challenge of reducing energy consumption in data centers running LLMs. We propose a hybrid data center model that uses a cost-based scheduling framework to dynamically allocate LLM tasks across hardware accelerators that differ in their energy efficiencies and computational capabilities. Specifically, our workload-aware strategy determines whether tasks are processed on energy-efficient processors or high-performance GPUs based on the number of input and output tokens in a query. Our analysis of a representative LLM dataset, finds that this hybrid strategy can reduce CPU+GPU energy consumption by 7.5% compared to a workload-unaware baseline.
△ Less
Submitted 25 April, 2024;
originally announced July 2024.
-
Securing Voice Authentication Applications Against Targeted Data Poisoning
Authors:
Alireza Mohammadi,
Keshav Sood,
Asef Nazari,
Dhananjay Thiruvady
Abstract:
Deep neural network-based voice authentication systems are promising biometric verification techniques that uniquely identify biological characteristics to verify a user. However, they are particularly susceptible to targeted data poisoning attacks, where attackers replace legitimate users' utterances with their own. We propose an enhanced framework using realworld datasets considering realistic a…
▽ More
Deep neural network-based voice authentication systems are promising biometric verification techniques that uniquely identify biological characteristics to verify a user. However, they are particularly susceptible to targeted data poisoning attacks, where attackers replace legitimate users' utterances with their own. We propose an enhanced framework using realworld datasets considering realistic attack scenarios. The results show that the proposed approach is robust, providing accurate authentications even when only a small fraction (5% of the dataset) is poisoned.
△ Less
Submitted 1 October, 2024; v1 submitted 25 June, 2024;
originally announced June 2024.
-
Collaborative Framework with Shared Responsibility for Relief Management in Disaster Scenarios
Authors:
Bhupesh Kumar Mishra,
Keshav Dahal
Abstract:
Disasters instances have been increasing both in frequency and intensity causing the tragic loss of life and making life harder for the survivors. Disaster relief management plays a crucial role in enhancing the lifestyle of disaster victims by managing the disaster impacts. Disaster relief management is a process with many collaborative sectors where different stakeholders should operate in all m…
▽ More
Disasters instances have been increasing both in frequency and intensity causing the tragic loss of life and making life harder for the survivors. Disaster relief management plays a crucial role in enhancing the lifestyle of disaster victims by managing the disaster impacts. Disaster relief management is a process with many collaborative sectors where different stakeholders should operate in all major phases of the disaster management progression. In the different phases of the disaster management process, many collaborative government organisations along with nongovernment organisations, leadership, community, and media at different levels need to share the responsibility with disaster victims to achieve effective disaster relief management. Shared responsibility enhances disaster relief management effectiveness and reduces the disaster's impact on the victims. Considering the diverse roles of different stakeholders, there has been a need for a framework that can bind different stakeholders together during disaster management. this paper shows a framework with major stakeholders of disaster relief management and how different stakeholders can take part in an effective disaster relief management process. The framework also highlights how each stakeholder can contribute to relief management at different phases after a disaster. The paper also explores some of the shared responsibility collaborative practices that have been implemented around the world in response to the disaster as a disaster relief management process. In addition, the paper highlights the knowledge obtained from those disaster instances and how this knowledge can be transferred and can be helpful in disaster mitigation and preparedness for future disaster scenarios.
△ Less
Submitted 15 June, 2024;
originally announced June 2024.
-
TAGMol: Target-Aware Gradient-guided Molecule Generation
Authors:
Vineeth Dorna,
D. Subhalingam,
Keshav Kolluru,
Shreshth Tuli,
Mrityunjay Singh,
Saurabh Singal,
N. M. Anoop Krishnan,
Sayan Ranu
Abstract:
3D generative models have shown significant promise in structure-based drug design (SBDD), particularly in discovering ligands tailored to specific target binding sites. Existing algorithms often focus primarily on ligand-target binding, characterized by binding affinity. Moreover, models trained solely on target-ligand distribution may fall short in addressing the broader objectives of drug disco…
▽ More
3D generative models have shown significant promise in structure-based drug design (SBDD), particularly in discovering ligands tailored to specific target binding sites. Existing algorithms often focus primarily on ligand-target binding, characterized by binding affinity. Moreover, models trained solely on target-ligand distribution may fall short in addressing the broader objectives of drug discovery, such as the development of novel ligands with desired properties like drug-likeness, and synthesizability, underscoring the multifaceted nature of the drug design process. To overcome these challenges, we decouple the problem into molecular generation and property prediction. The latter synergistically guides the diffusion sampling process, facilitating guided diffusion and resulting in the creation of meaningful molecules with the desired properties. We call this guided molecular generation process as TAGMol. Through experiments on benchmark datasets, TAGMol demonstrates superior performance compared to state-of-the-art baselines, achieving a 22% improvement in average Vina Score and yielding favorable outcomes in essential auxiliary properties. This establishes TAGMol as a comprehensive framework for drug generation.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
MAGIC: Modular Auto-encoder for Generalisable Model Inversion with Bias Corrections
Authors:
Yihang She,
Clement Atzberger,
Andrew Blake,
Adriano Gualandi,
Srinivasan Keshav
Abstract:
Scientists often model physical processes to understand the natural world and uncover the causation behind observations. Due to unavoidable simplification, discrepancies often arise between model predictions and actual observations, in the form of systematic biases, whose impact varies with model completeness. Classical model inversion methods such as Bayesian inference or regressive neural networ…
▽ More
Scientists often model physical processes to understand the natural world and uncover the causation behind observations. Due to unavoidable simplification, discrepancies often arise between model predictions and actual observations, in the form of systematic biases, whose impact varies with model completeness. Classical model inversion methods such as Bayesian inference or regressive neural networks tend either to overlook biases or make assumptions about their nature during data preprocessing, potentially leading to implausible results. Inspired by recent work in inverse graphics, we replace the decoder stage of a standard autoencoder with a physical model followed by a bias-correction layer. This generalisable approach simultaneously inverts the model and corrects its biases in an end-to-end manner without making strong assumptions about the nature of the biases. We demonstrate the effectiveness of our approach using two physical models from disparate domains: a complex radiative transfer model from remote sensing; and a volcanic deformation model from geodesy. Our method matches or surpasses results from classical approaches without requiring biases to be explicitly filtered out, suggesting an effective pathway for understanding the causation of various physical processes.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
A Framework for Leveraging Partially-Labeled Data for Product Attribute-Value Identification
Authors:
D. Subhalingam,
Keshav Kolluru,
Mausam,
Saurabh Singal
Abstract:
In the e-commerce domain, the accurate extraction of attribute-value pairs (e.g., Brand: Apple) from product titles and user search queries is crucial for enhancing search and recommendation systems. A major challenge with neural models for this task is the lack of high-quality training data, as the annotations for attribute-value pairs in the available datasets are often incomplete. To address th…
▽ More
In the e-commerce domain, the accurate extraction of attribute-value pairs (e.g., Brand: Apple) from product titles and user search queries is crucial for enhancing search and recommendation systems. A major challenge with neural models for this task is the lack of high-quality training data, as the annotations for attribute-value pairs in the available datasets are often incomplete. To address this, we introduce GenToC, a model designed for training directly with partially-labeled data, eliminating the necessity for a fully annotated dataset. GenToC employs a marker-augmented generative model to identify potential attributes, followed by a token classification model that determines the associated values for each attribute. GenToC outperforms existing state-of-the-art models, exhibiting upto 56.3% increase in the number of accurate extractions. Furthermore, we utilize GenToC to regenerate the training dataset to expand attribute-value annotations. This bootstrapping substantially improves the data quality for training other standard NER models, which are typically faster but less capable in handling partially-labeled data, enabling them to achieve comparable performance to GenToC. Our results demonstrate GenToC's unique ability to learn from a limited set of partially-labeled data and improve the training of more efficient models, advancing the automated extraction of attribute-value pairs. Finally, our model has been successfully integrated into IndiaMART, India's largest B2B e-commerce platform, achieving a significant increase of 20.2% in the number of correctly identified attribute-value pairs over the existing deployed system while achieving a high precision of 89.5%.
△ Less
Submitted 18 November, 2024; v1 submitted 17 May, 2024;
originally announced May 2024.
-
An Overview of Intelligent Meta-surfaces for 6G and Beyond: Opportunities, Trends, and Challenges
Authors:
Mayur Katwe,
Aryan Kaushik,
Lina Mohjazi,
Mohammad Abualhayja'a,
Davide Dardari,
Keshav Singh,
Muhammad Ali Imran,
M. Majid Butt,
Octavia A. Dobre
Abstract:
With the impending arrival of the sixth generation (6G) of wireless communication technology, the telecommunications landscape is poised for another revolutionary transformation. At the forefront of this evolution are intelligent meta-surfaces (IS), emerging as a disruptive physical layer technology with the potential to redefine the capabilities and performance metrics of future wireless networks…
▽ More
With the impending arrival of the sixth generation (6G) of wireless communication technology, the telecommunications landscape is poised for another revolutionary transformation. At the forefront of this evolution are intelligent meta-surfaces (IS), emerging as a disruptive physical layer technology with the potential to redefine the capabilities and performance metrics of future wireless networks. As 6G evolves from concept to reality, industry stakeholders, standards organizations, and regulatory bodies are collaborating to define the specifications, protocols, and interoperability standards governing IS deployment. Against this background, this article delves into the ongoing standardization efforts, emerging trends, potential opportunities, and prevailing challenges surrounding the integration of IS into the framework of 6G and beyond networks. Specifically, it provides a tutorial-style overview of recent advancements in IS and explores their potential applications within future networks beyond 6G. Additionally, the article identifies key challenges in the design and implementation of various types of intelligent surfaces, along with considerations for their practical standardization. Finally, it highlights potential future prospects in this evolving field.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
UCINet0: A Machine Learning based Receiver for 5G NR PUCCH Format 0
Authors:
Anil Kumar Yerrapragada,
Jeeva Keshav Sattianarayanin,
Radha Krishna Ganti
Abstract:
Accurate decoding of Uplink Control Information (UCI) on the Physical Uplink Control Channel (PUCCH) is essential for enabling 5G wireless links. This paper explores an AI/ML-based receiver design for PUCCH Format 0. Format 0 signaling encodes the UCI content within the phase of a known base waveform and even supports multiplexing of up to 12 users within the same time-frequency resources. Our fir…
▽ More
Accurate decoding of Uplink Control Information (UCI) on the Physical Uplink Control Channel (PUCCH) is essential for enabling 5G wireless links. This paper explores an AI/ML-based receiver design for PUCCH Format 0. Format 0 signaling encodes the UCI content within the phase of a known base waveform and even supports multiplexing of up to 12 users within the same time-frequency resources. Our first-of-a-kind neural network classifier, which we term UCINet0, is capable of predicting when no user is transmitting on the PUCCH, as well as decoding the UCI content of any number of multiplexed users, up to 12. Inference results with both simulated and hardware-captured field datasets show that the UCINet0 model outperforms conventional DFT-based decoders across all SNR ranges.
△ Less
Submitted 10 March, 2024;
originally announced April 2024.
-
Biomimicry in Radiation Therapy: Optimizing Patient Scheduling for Improved Treatment Outcomes
Authors:
Keshav Kumar K.,
NVSL Narasimham
Abstract:
In the realm of medical science, the pursuit of enhancing treatment efficacy and patient outcomes continues to drive innovation. This study delves into the integration of biomimicry principles within the domain of Radiation Therapy (RT) to optimize patient scheduling, ultimately aiming to augment treatment results. RT stands as a vital medical technique for eradicating cancer cells and diminishing…
▽ More
In the realm of medical science, the pursuit of enhancing treatment efficacy and patient outcomes continues to drive innovation. This study delves into the integration of biomimicry principles within the domain of Radiation Therapy (RT) to optimize patient scheduling, ultimately aiming to augment treatment results. RT stands as a vital medical technique for eradicating cancer cells and diminishing tumor sizes. Yet, the manual scheduling of patients for RT proves both laborious and intricate. In this research, the focus is on automating patient scheduling for RT through the application of optimization methodologies. Three bio-inspired algorithms are employed for optimization to tackle the complex online stochastic scheduling problem. These algorithms include the Genetic Algorithm (GA), Firefly Optimization (FFO), and Wolf Optimization (WO). These algorithms are harnessed to address the intricate challenges of online stochastic scheduling. Through rigorous evaluation, involving the scrutiny of convergence time, runtime, and objective values, the comparative performance of these algorithms is determined. The results of this study unveil the effectiveness of the applied bio-inspired algorithms in optimizing patient scheduling for RT. Among the algorithms examined, WO emerges as the frontrunner, consistently delivering superior outcomes across various evaluation criteria. The optimization approach showcased in this study holds the potential to streamline processes, reduce manual intervention, and ultimately improve treatment outcomes for patients undergoing RT.
△ Less
Submitted 16 January, 2024;
originally announced April 2024.
-
Global, robust and comparable digital carbon assets
Authors:
Sadiq Jaffer,
Michael Dales,
Patrick Ferris,
Thomas Swinfield,
Derek Sorensen,
Robin Message,
Srinivasan Keshav,
Anil Madhavapeddy
Abstract:
Carbon credits purchased in the voluntary carbon market allow unavoidable emissions, such as from international flights for essential travel, to be offset by an equivalent climate benefit, such as avoiding emissions from tropical deforestation. However, many concerns regarding the credibility of these offsetting claims have been raised. Moreover, the credit market is manual, therefore inefficient…
▽ More
Carbon credits purchased in the voluntary carbon market allow unavoidable emissions, such as from international flights for essential travel, to be offset by an equivalent climate benefit, such as avoiding emissions from tropical deforestation. However, many concerns regarding the credibility of these offsetting claims have been raised. Moreover, the credit market is manual, therefore inefficient and unscalable, and non-fungible, therefore illiquid. To address these issues, we propose an efficient digital methodology that combines remote sensing data, modern econometric techniques, and on-chain certification and trading to create a new digital carbon asset (the PACT stablecoin) against which carbon offsetting claims can be transparently verified. PACT stablecoins are produced as outputs from a reproducible computational pipeline for estimating the climate benefits of carbon offset projects that not only quantifies the CO2 emissions involved, but also allows for similar credits to be pooled based on their co-benefits such as biodiversity and jurisdictional attributes, increasing liquidity through fungibility within pools. We implement and evaluate the PACT carbon stablecoin on the Tezos blockchain, which is designed to facilitate low-cost transactions while minimizing environmental impact. Our implementation includes a contract for a registry for tracking issuance, ownership, and retirement of credits, and a custodian contract to bridge on-chain and off-chain transactions. Our work brings scale and trust to the voluntary carbon market by providing a transparent, scalable, and efficient framework for high integrity carbon credit transactions.
△ Less
Submitted 3 April, 2024; v1 submitted 21 March, 2024;
originally announced March 2024.
-
Methods for Matching English Language Addresses
Authors:
Keshav Ramani,
Daniel Borrajo
Abstract:
Addresses occupy a niche location within the landscape of textual data, due to the positional importance carried by every word, and the geographical scope it refers to. The task of matching addresses happens everyday and is present in various fields like mail redirection, entity resolution, etc. Our work defines, and formalizes a framework to generate matching and mismatching pairs of addresses in…
▽ More
Addresses occupy a niche location within the landscape of textual data, due to the positional importance carried by every word, and the geographical scope it refers to. The task of matching addresses happens everyday and is present in various fields like mail redirection, entity resolution, etc. Our work defines, and formalizes a framework to generate matching and mismatching pairs of addresses in the English language, and use it to evaluate various methods to automatically perform address matching. These methods vary widely from distance based approaches to deep learning models. By studying the Precision, Recall and Accuracy metrics of these approaches, we obtain an understanding of the best suited method for this setting of the address matching task.
△ Less
Submitted 14 March, 2024;
originally announced March 2024.
-
Motifs, Phrases, and Beyond: The Modelling of Structure in Symbolic Music Generation
Authors:
Keshav Bhandari,
Simon Colton
Abstract:
Modelling musical structure is vital yet challenging for artificial intelligence systems that generate symbolic music compositions. This literature review dissects the evolution of techniques for incorporating coherent structure, from symbolic approaches to foundational and transformative deep learning methods that harness the power of computation and data across a wide variety of training paradig…
▽ More
Modelling musical structure is vital yet challenging for artificial intelligence systems that generate symbolic music compositions. This literature review dissects the evolution of techniques for incorporating coherent structure, from symbolic approaches to foundational and transformative deep learning methods that harness the power of computation and data across a wide variety of training paradigms. In the later stages, we review an emerging technique which we refer to as "sub-task decomposition" that involves decomposing music generation into separate high-level structural planning and content creation stages. Such systems incorporate some form of musical knowledge or neuro-symbolic methods by extracting melodic skeletons or structural templates to guide the generation. Progress is evident in capturing motifs and repetitions across all three eras reviewed, yet modelling the nuanced development of themes across extended compositions in the style of human composers remains difficult. We outline several key future directions to realize the synergistic benefits of combining approaches from all eras examined.
△ Less
Submitted 12 March, 2024;
originally announced March 2024.