-
N$^2$: A Unified Python Package and Test Bench for Nearest Neighbor-Based Matrix Completion
Authors:
Caleb Chin,
Aashish Khubchandani,
Harshvardhan Maskara,
Kyuseong Choi,
Jacob Feitelberg,
Albert Gong,
Manit Paul,
Tathagata Sadhukhan,
Anish Agarwal,
Raaz Dwivedi
Abstract:
Nearest neighbor (NN) methods have re-emerged as competitive tools for matrix completion, offering strong empirical performance and recent theoretical guarantees, including entry-wise error bounds, confidence intervals, and minimax optimality. Despite their simplicity, recent work has shown that NN approaches are robust to a range of missingness patterns and effective across diverse applications.…
▽ More
Nearest neighbor (NN) methods have re-emerged as competitive tools for matrix completion, offering strong empirical performance and recent theoretical guarantees, including entry-wise error bounds, confidence intervals, and minimax optimality. Despite their simplicity, recent work has shown that NN approaches are robust to a range of missingness patterns and effective across diverse applications. This paper introduces N$^2$, a unified Python package and testbed that consolidates a broad class of NN-based methods through a modular, extensible interface. Built for both researchers and practitioners, N$^2$ supports rapid experimentation and benchmarking. Using this framework, we introduce a new NN variant that achieves state-of-the-art results in several settings. We also release a benchmark suite of real-world datasets, from healthcare and recommender systems to causal inference and LLM evaluation, designed to stress-test matrix completion methods beyond synthetic scenarios. Our experiments demonstrate that while classical methods excel on idealized data, NN-based techniques consistently outperform them in real-world settings.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
Adaptively-weighted Nearest Neighbors for Matrix Completion
Authors:
Tathagata Sadhukhan,
Manit Paul,
Raaz Dwivedi
Abstract:
In this technical note, we introduce and analyze AWNN: an adaptively weighted nearest neighbor method for performing matrix completion. Nearest neighbor (NN) methods are widely used in missing data problems across multiple disciplines such as in recommender systems and for performing counterfactual inference in panel data settings. Prior works have shown that in addition to being very intuitive an…
▽ More
In this technical note, we introduce and analyze AWNN: an adaptively weighted nearest neighbor method for performing matrix completion. Nearest neighbor (NN) methods are widely used in missing data problems across multiple disciplines such as in recommender systems and for performing counterfactual inference in panel data settings. Prior works have shown that in addition to being very intuitive and easy to implement, NN methods enjoy nice theoretical guarantees. However, the performance of majority of the NN methods rely on the appropriate choice of the radii and the weights assigned to each member in the nearest neighbor set and despite several works on nearest neighbor methods in the past two decades, there does not exist a systematic approach of choosing the radii and the weights without relying on methods like cross-validation. AWNN addresses this challenge by judiciously balancing the bias variance trade off inherent in weighted nearest-neighbor regression. We provide theoretical guarantees for the proposed method under minimal assumptions and support the theory via synthetic experiments.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
RoadSocial: A Diverse VideoQA Dataset and Benchmark for Road Event Understanding from Social Video Narratives
Authors:
Chirag Parikh,
Deepti Rawat,
Rakshitha R. T.,
Tathagata Ghosh,
Ravi Kiran Sarvadevabhatla
Abstract:
We introduce RoadSocial, a large-scale, diverse VideoQA dataset tailored for generic road event understanding from social media narratives. Unlike existing datasets limited by regional bias, viewpoint bias and expert-driven annotations, RoadSocial captures the global complexity of road events with varied geographies, camera viewpoints (CCTV, handheld, drones) and rich social discourse. Our scalabl…
▽ More
We introduce RoadSocial, a large-scale, diverse VideoQA dataset tailored for generic road event understanding from social media narratives. Unlike existing datasets limited by regional bias, viewpoint bias and expert-driven annotations, RoadSocial captures the global complexity of road events with varied geographies, camera viewpoints (CCTV, handheld, drones) and rich social discourse. Our scalable semi-automatic annotation framework leverages Text LLMs and Video LLMs to generate comprehensive question-answer pairs across 12 challenging QA tasks, pushing the boundaries of road event understanding. RoadSocial is derived from social media videos spanning 14M frames and 414K social comments, resulting in a dataset with 13.2K videos, 674 tags and 260K high-quality QA pairs. We evaluate 18 Video LLMs (open-source and proprietary, driving-specific and general-purpose) on our road event understanding benchmark. We also demonstrate RoadSocial's utility in improving road event understanding capabilities of general-purpose Video LLMs.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
Pushing the Boundary of Quantum Advantage in Hard Combinatorial Optimization with Probabilistic Computers
Authors:
Shuvro Chowdhury,
Navid Anjum Aadit,
Andrea Grimaldi,
Eleonora Raimondo,
Atharva Raut,
P. Aaron Lott,
Johan H. Mentink,
Marek M. Rams,
Federico Ricci-Tersenghi,
Massimo Chiappini,
Luke S. Theogarajan,
Tathagata Srimani,
Giovanni Finocchio,
Masoud Mohseni,
Kerem Y. Camsari
Abstract:
Recent demonstrations on specialized benchmarks have reignited excitement for quantum computers, yet whether they can deliver an advantage for practical real-world problems remains an open question. Here, we show that probabilistic computers (p-computers) when co-designed with hardware to implement powerful Monte Carlo algorithms can surpass state-of-the-art quantum annealers <a href="https://www.…
▽ More
Recent demonstrations on specialized benchmarks have reignited excitement for quantum computers, yet whether they can deliver an advantage for practical real-world problems remains an open question. Here, we show that probabilistic computers (p-computers) when co-designed with hardware to implement powerful Monte Carlo algorithms can surpass state-of-the-art quantum annealers <a href="https://www.nature.com/articles/s41586-023-05867-2" target="_blank">[King et al., Nature (2023)]</a> in solving certain hard optimization problems. We focus on two key algorithms: discrete-time simulated quantum annealing (DT-SQA) and adaptive parallel tempering (APT), both applied to 3D spin glasses. For DT-SQA, we find that increasing the number of replicas improves residual energy scaling, while parallelizing fewer replicas across independent runs also achieves comparable scaling. Both strategies align with the theoretical expectations from extreme value theory. In addition, APT outperforms DT-SQA when supported by non-local isoenergetic cluster moves. Finite-size scaling analysis suggests a universal behavior that explains the superior performance of APT over both DT-SQA and quantum annealing. We show that these algorithms are readily implementable in modern hardware thanks to the mature semiconductor technology. Unlike software simulations, replicas can be monolithically housed on a single chip and a large number of spins can be updated in parallel and asynchronously, similar to a quantum annealer. We project that custom Field Programmable Gate Arrays (FPGA) or specialized chips leveraging massive parallelism can further accelerate these algorithms by orders of magnitude, while drastically improving energy efficiency. Our results raise the bar for a practical quantum advantage in optimization and present p-computers as scalable, energy-efficient hardware for real-world optimization problems.
△ Less
Submitted 7 April, 2025; v1 submitted 13 March, 2025;
originally announced March 2025.
-
Scalable Connectivity for Ising Machines: Dense to Sparse
Authors:
M Mahmudul Hasan Sajeeb,
Navid Anjum Aadit,
Shuvro Chowdhury,
Tong Wu,
Cesely Smith,
Dhruv Chinmay,
Atharva Raut,
Kerem Y. Camsari,
Corentin Delacour,
Tathagata Srimani
Abstract:
In recent years, hardware implementations of Ising machines have emerged as a viable alternative to quantum computing for solving hard optimization problems among other applications. Unlike quantum hardware, dense connectivity can be achieved in classical systems. However, we show that dense connectivity leads to severe frequency slowdowns and interconnect congestion scaling unfavorably with syste…
▽ More
In recent years, hardware implementations of Ising machines have emerged as a viable alternative to quantum computing for solving hard optimization problems among other applications. Unlike quantum hardware, dense connectivity can be achieved in classical systems. However, we show that dense connectivity leads to severe frequency slowdowns and interconnect congestion scaling unfavorably with system sizes. As a scalable solution, we propose a systematic sparsification method for dense graphs by introducing copy nodes to limit the number of neighbors per graph node. In addition to solving interconnect congestion, this approach enables constant frequency scaling where all spins in a network can be updated in constant time. On the other hand, sparsification introduces new difficulties, such as constraint-breaking between copied spins and increased convergence times to solve optimization problems, especially if exact ground states are sought. Relaxing the exact solution requirements, we find that the overheads in convergence times are milder. We demonstrate these ideas by designing probabilistic bit Ising machines using ASAP7 (a predictive 7nm FinFET technology model) process design kits as well as Field Programmable Gate Array (FPGA)-based implementations. Finally, we show how formulating problems in naturally sparse networks (e.g., by invertible logic) sidesteps challenges introduced by sparsification methods. Our results are applicable to a broad family of Ising machines using different hardware implementations.
△ Less
Submitted 2 June, 2025; v1 submitted 2 March, 2025;
originally announced March 2025.
-
Bridging Language Barriers in Healthcare: A Study on Arabic LLMs
Authors:
Nada Saadi,
Tathagata Raha,
Clément Christophe,
Marco AF Pimentel,
Ronnie Rajan,
Praveen K Kanithi
Abstract:
This paper investigates the challenges of developing large language models (LLMs) proficient in both multilingual understanding and medical knowledge. We demonstrate that simply translating medical data does not guarantee strong performance on clinical tasks in the target language. Our experiments reveal that the optimal language mix in training data varies significantly across different medical t…
▽ More
This paper investigates the challenges of developing large language models (LLMs) proficient in both multilingual understanding and medical knowledge. We demonstrate that simply translating medical data does not guarantee strong performance on clinical tasks in the target language. Our experiments reveal that the optimal language mix in training data varies significantly across different medical tasks. We find that larger models with carefully calibrated language ratios achieve superior performance on native-language clinical tasks. Furthermore, our results suggest that relying solely on fine-tuning may not be the most effective approach for incorporating new language knowledge into LLMs. Instead, data and computationally intensive pretraining methods may still be necessary to achieve optimal performance in multilingual medical settings. These findings provide valuable guidance for building effective and inclusive medical AI systems for diverse linguistic communities.
△ Less
Submitted 16 January, 2025;
originally announced January 2025.
-
Tube Loss: A Novel Approach for Prediction Interval Estimation and probabilistic forecasting
Authors:
Pritam Anand,
Tathagata Bandyopadhyay,
Suresh Chandra
Abstract:
This paper proposes a novel loss function, called 'Tube Loss', for simultaneous estimation of bounds of a Prediction Interval (PI) in the regression setup. The PIs obtained by minimizing the empirical risk based on the Tube Loss are shown to be of better quality than the PIs obtained by the existing methods in the following sense. First, it yields intervals that attain the prespecified confidence…
▽ More
This paper proposes a novel loss function, called 'Tube Loss', for simultaneous estimation of bounds of a Prediction Interval (PI) in the regression setup. The PIs obtained by minimizing the empirical risk based on the Tube Loss are shown to be of better quality than the PIs obtained by the existing methods in the following sense. First, it yields intervals that attain the prespecified confidence level t $\in$ (0,1) asymptotically. A theoretical proof of this fact is given. Secondly, the user is allowed to move the interval up or down by controlling the value of a parameter. This helps the user to choose a PI capturing denser regions of the probability distribution of the response variable inside the interval, and thus, sharpening its width. This is shown to be especially useful when the conditional distribution of the response variable is skewed. Further, the Tube Loss based PI estimation method can trade-off between the coverage and the average width by solving a single optimization problem. It enables further reduction of the average width of PI through re-calibration. Also, unlike a few existing PI estimation methods the gradient descent (GD) method can be used for minimization of empirical risk. Through extensive experiments, we demonstrate the effectiveness of Tube Loss-based PI estimation in both kernel machines and neural networks. Additionally, we show that Tube Loss-based deep probabilistic forecasting models achieve superior performance compared to existing probabilistic forecasting techniques across several benchmark and wind datasets. Finally, we empirically validate the advantages of the Tube loss approach within the conformal prediction framework. Codes are available at https://github.com/ltpritamanand/Tube$\_$loss.
△ Less
Submitted 17 May, 2025; v1 submitted 8 December, 2024;
originally announced December 2024.
-
On adaptivity and minimax optimality of two-sided nearest neighbors
Authors:
Tathagata Sadhukhan,
Manit Paul,
Raaz Dwivedi
Abstract:
Nearest neighbor (NN) algorithms have been extensively used for missing data problems in recommender systems and sequential decision-making systems. Prior theoretical analysis has established favorable guarantees for NN when the underlying data is sufficiently smooth and the missingness probabilities are lower bounded. Here we analyze NN with non-smooth non-linear functions with vast amounts of mi…
▽ More
Nearest neighbor (NN) algorithms have been extensively used for missing data problems in recommender systems and sequential decision-making systems. Prior theoretical analysis has established favorable guarantees for NN when the underlying data is sufficiently smooth and the missingness probabilities are lower bounded. Here we analyze NN with non-smooth non-linear functions with vast amounts of missingness. In particular, we consider matrix completion settings where the entries of the underlying matrix follow a latent non-linear factor model, with the non-linearity belonging to a \Holder function class that is less smooth than Lipschitz. Our results establish following favorable properties for a suitable two-sided NN: (1) The mean squared error (MSE) of NN adapts to the smoothness of the non-linearity, (2) under certain regularity conditions, the NN error rate matches the rate obtained by an oracle equipped with the knowledge of both the row and column latent factors, and finally (3) NN's MSE is non-trivial for a wide range of settings even when several matrix entries might be missing deterministically. We support our theoretical findings via extensive numerical simulations and a case study with data from a mobile health study, HeartSteps.
△ Less
Submitted 19 November, 2024;
originally announced November 2024.
-
Massimo: Public Queue Monitoring and Management using Mass-Spring Model
Authors:
Abhijeet Kumar,
Unnati Singh,
Rajdeep Chatterjee,
Tathagata Bandyopadhyay
Abstract:
An efficient system of a queue control and regulation in public spaces is very important in order to avoid the traffic jams and to improve the customer satisfaction. This article offers a detailed road map based on a merger of intelligent systems and creating an efficient systems of queues in public places. Through the utilization of different technologies i.e. computer vision, machine learning al…
▽ More
An efficient system of a queue control and regulation in public spaces is very important in order to avoid the traffic jams and to improve the customer satisfaction. This article offers a detailed road map based on a merger of intelligent systems and creating an efficient systems of queues in public places. Through the utilization of different technologies i.e. computer vision, machine learning algorithms, deep learning our system provide accurate information about the place is crowded or not and the necessary efforts to be taken.
△ Less
Submitted 21 October, 2024;
originally announced October 2024.
-
Skill vs. Chance Quantification for Popular Card & Board Games
Authors:
Tathagata Banerjee,
Anushka De,
Subhamoy Maitra,
Diganta Mukherjee
Abstract:
This paper presents a data-driven statistical framework to quantify the role of skill in games, addressing the long-standing question of whether success in a game is predominantly driven by skill or chance. We analyze player level data from four popular games Chess, Rummy, Ludo, and Teen Patti, using empirical win statistics across varying levels of experience. By modeling win rate as a function o…
▽ More
This paper presents a data-driven statistical framework to quantify the role of skill in games, addressing the long-standing question of whether success in a game is predominantly driven by skill or chance. We analyze player level data from four popular games Chess, Rummy, Ludo, and Teen Patti, using empirical win statistics across varying levels of experience. By modeling win rate as a function of experience through a regression framework and employing empirical bootstrap resampling, we estimate the degree to which outcomes improve with repeated play. To summarize these dynamics, we propose a flexible skill score that emphasizes learning over initial performance, aligning with practical and regulatory interpretations of skill. Our results reveal a clear ranking, with Chess showing the highest skill component and Teen Patti the lowest, while Rummy and Ludo fall in between. The proposed framework is transparent, reproducible, and adaptable to other game formats and outcome metrics, offering potential applications in legal classification, game design, and player performance analysis.
△ Less
Submitted 27 May, 2025; v1 submitted 18 October, 2024;
originally announced October 2024.
-
Named Clinical Entity Recognition Benchmark
Authors:
Wadood M Abdul,
Marco AF Pimentel,
Muhammad Umar Salman,
Tathagata Raha,
Clément Christophe,
Praveen K Kanithi,
Nasir Hayat,
Ronnie Rajan,
Shadab Khan
Abstract:
This technical report introduces a Named Clinical Entity Recognition Benchmark for evaluating language models in healthcare, addressing the crucial natural language processing (NLP) task of extracting structured information from clinical narratives to support applications like automated coding, clinical trial cohort identification, and clinical decision support.
The leaderboard provides a standa…
▽ More
This technical report introduces a Named Clinical Entity Recognition Benchmark for evaluating language models in healthcare, addressing the crucial natural language processing (NLP) task of extracting structured information from clinical narratives to support applications like automated coding, clinical trial cohort identification, and clinical decision support.
The leaderboard provides a standardized platform for assessing diverse language models, including encoder and decoder architectures, on their ability to identify and classify clinical entities across multiple medical domains. A curated collection of openly available clinical datasets is utilized, encompassing entities such as diseases, symptoms, medications, procedures, and laboratory measurements. Importantly, these entities are standardized according to the Observational Medical Outcomes Partnership (OMOP) Common Data Model, ensuring consistency and interoperability across different healthcare systems and datasets, and a comprehensive evaluation of model performance. Performance of models is primarily assessed using the F1-score, and it is complemented by various assessment modes to provide comprehensive insights into model performance. The report also includes a brief analysis of models evaluated to date, highlighting observed trends and limitations.
By establishing this benchmarking framework, the leaderboard aims to promote transparency, facilitate comparative analyses, and drive innovation in clinical entity recognition tasks, addressing the need for robust evaluation methods in healthcare NLP.
△ Less
Submitted 7 October, 2024;
originally announced October 2024.
-
Omni 3D: BEOL-Compatible 3D Logic with Omnipresent Power, Signal, and Clock
Authors:
Suhyeong Choi,
Carlo Gilardi,
Paul Gutwin,
Robert M. Radway,
Tathagata Srimani,
Subhasish Mitra
Abstract:
This paper presents Omni 3D - a 3D-stacked device architecture that is naturally enabled by back-end-of-line (BEOL)-compatible transistors. Omni 3D arbitrarily interleaves metal layers for both signal/power with FETs in 3D (i.e., nFETs and pFETs are stacked in 3D). Thus, signal/power routing layers have fine-grained, all-sided access to the FET active regions maximizing 3D standard cell design fle…
▽ More
This paper presents Omni 3D - a 3D-stacked device architecture that is naturally enabled by back-end-of-line (BEOL)-compatible transistors. Omni 3D arbitrarily interleaves metal layers for both signal/power with FETs in 3D (i.e., nFETs and pFETs are stacked in 3D). Thus, signal/power routing layers have fine-grained, all-sided access to the FET active regions maximizing 3D standard cell design flexibility. This is in sharp contrast to approaches such as back-side power delivery networks (BSPDNs), complementary FETs (CFETs), and stacked FETs. Importantly, the routing flexibility of Omni 3D is enabled by double-side routing and an interleaved metal (IM) layer for inter- and intra-cell routing, respectively. In this work, we explore Omni 3D variants (e.g., both with and without the IM layer) and optimize these variants using a virtual-source BEOL-FET compact model. We establish a physical design flow that efficiently utilizes the double-side routing in Omni 3D and perform a thorough design-technology-co-optimization (DTCO) of Omni 3D device architecture on several design points. From our design flow, we project 2.0x improvement in the energy-delay product and 1.5x reduction in area compared to the state-of-the-art CFETs with BSPDNs.
△ Less
Submitted 25 September, 2024;
originally announced September 2024.
-
Beyond Fine-tuning: Unleashing the Potential of Continuous Pretraining for Clinical LLMs
Authors:
Clément Christophe,
Tathagata Raha,
Svetlana Maslenkova,
Muhammad Umar Salman,
Praveen K Kanithi,
Marco AF Pimentel,
Shadab Khan
Abstract:
Large Language Models (LLMs) have demonstrated significant potential in transforming clinical applications. In this study, we investigate the efficacy of four techniques in adapting LLMs for clinical use-cases: continuous pretraining, instruct fine-tuning, NEFTune, and prompt engineering. We employ these methods on Mistral 7B and Mixtral 8x7B models, leveraging a large-scale clinical pretraining d…
▽ More
Large Language Models (LLMs) have demonstrated significant potential in transforming clinical applications. In this study, we investigate the efficacy of four techniques in adapting LLMs for clinical use-cases: continuous pretraining, instruct fine-tuning, NEFTune, and prompt engineering. We employ these methods on Mistral 7B and Mixtral 8x7B models, leveraging a large-scale clinical pretraining dataset of 50 billion tokens and an instruct fine-tuning dataset of 500 million tokens. Our evaluation across various clinical tasks reveals the impact of each technique. While continuous pretraining beyond 250 billion tokens yields marginal improvements on its own, it establishes a strong foundation for instruct fine-tuning. Notably, NEFTune, designed primarily to enhance generation quality, surprisingly demonstrates additional gains on our benchmark. Complex prompt engineering methods further enhance performance. These findings show the importance of tailoring fine-tuning strategies and exploring innovative techniques to optimize LLM performance in the clinical domain.
△ Less
Submitted 23 September, 2024;
originally announced September 2024.
-
Next-generation Probabilistic Computing Hardware with 3D MOSAICs, Illusion Scale-up, and Co-design
Authors:
Tathagata Srimani,
Robert Radway,
Masoud Mohseni,
Kerem Çamsarı,
Subhasish Mitra
Abstract:
The vast majority of 21st century AI workloads are based on gradient-based deterministic algorithms such as backpropagation. One of the key reasons for the dominance of deterministic ML algorithms is the emergence of powerful hardware accelerators (GPU and TPU) that have enabled the wide-scale adoption and implementation of these algorithms. Meanwhile, discrete and probabilistic Monte Carlo algori…
▽ More
The vast majority of 21st century AI workloads are based on gradient-based deterministic algorithms such as backpropagation. One of the key reasons for the dominance of deterministic ML algorithms is the emergence of powerful hardware accelerators (GPU and TPU) that have enabled the wide-scale adoption and implementation of these algorithms. Meanwhile, discrete and probabilistic Monte Carlo algorithms have long been recognized as one of the most successful algorithms in all of computing with a wide range of applications. Specifically, Markov Chain Monte Carlo (MCMC) algorithm families have emerged as the most widely used and effective method for discrete combinatorial optimization and probabilistic sampling problems. We adopt a hardware-centric perspective on probabilistic computing, outlining the challenges and potential future directions to advance this field. We identify two critical research areas: 3D integration using MOSAICs (Monolithic/Stacked/Assembled ICs) and the concept of Illusion, a hardware-agnostic distributed computing framework designed to scale probabilistic accelerators.
△ Less
Submitted 11 September, 2024;
originally announced September 2024.
-
Overcoming Ambient Drift and Negative-Bias Temperature Instability in Foundry Carbon Nanotube Transistors
Authors:
Andrew Yu,
Tathagata Srimani,
Max Shulaker
Abstract:
Back-end-of-line (BEOL) logic integration is emerging as a complementary scaling path to supplement front-end-of-line (FEOL) Silicon. Among various options for BEOL logic, Carbon Nanotube Field-Effect Transistors (CNFETs) have been integrated within commercial silicon foundries, and complex CNFET circuits (e.g., RISC-V core, SRAM arrays) have been demonstrated. However, there lacks comprehensive s…
▽ More
Back-end-of-line (BEOL) logic integration is emerging as a complementary scaling path to supplement front-end-of-line (FEOL) Silicon. Among various options for BEOL logic, Carbon Nanotube Field-Effect Transistors (CNFETs) have been integrated within commercial silicon foundries, and complex CNFET circuits (e.g., RISC-V core, SRAM arrays) have been demonstrated. However, there lacks comprehensive studies that analyze the ambient drift (i.e., air-stability) and reliability of CNFETs. Here, for the first time, we thoroughly characterize and demonstrate how to overcome ambient drift and negative bias temperature instability (NBTI) in CNFETs using the following techniques: (1) Silicon Nitride encapsulation to limit ambient atmosphere induced threshold voltage shift (~8x reduction of median VT shift over 90 days) and (2) AC/pulsed operation to significantly improve CNFET NBTI vs. DC operation across a wide frequency range (e.g., 20% duty cycle AC operation at 10 MHz could extend CNFET NBTI time-to-failure by >10000x vs. DC for a target VT shift tolerance < 100 mV with gate stress bias VGS,stress = -1.2 V at 125 C).
△ Less
Submitted 14 February, 2025; v1 submitted 17 September, 2024;
originally announced September 2024.
-
MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications
Authors:
Praveen K Kanithi,
Clément Christophe,
Marco AF Pimentel,
Tathagata Raha,
Nada Saadi,
Hamza Javed,
Svetlana Maslenkova,
Nasir Hayat,
Ronnie Rajan,
Shadab Khan
Abstract:
The rapid development of Large Language Models (LLMs) for healthcare applications has spurred calls for holistic evaluation beyond frequently-cited benchmarks like USMLE, to better reflect real-world performance. While real-world assessments are valuable indicators of utility, they often lag behind the pace of LLM evolution, likely rendering findings obsolete upon deployment. This temporal disconn…
▽ More
The rapid development of Large Language Models (LLMs) for healthcare applications has spurred calls for holistic evaluation beyond frequently-cited benchmarks like USMLE, to better reflect real-world performance. While real-world assessments are valuable indicators of utility, they often lag behind the pace of LLM evolution, likely rendering findings obsolete upon deployment. This temporal disconnect necessitates a comprehensive upfront evaluation that can guide model selection for specific clinical applications. We introduce MEDIC, a framework assessing LLMs across five critical dimensions of clinical competence: medical reasoning, ethics and bias, data and language understanding, in-context learning, and clinical safety. MEDIC features a novel cross-examination framework quantifying LLM performance across areas like coverage and hallucination detection, without requiring reference outputs. We apply MEDIC to evaluate LLMs on medical question-answering, safety, summarization, note generation, and other tasks. Our results show performance disparities across model sizes, baseline vs medically finetuned models, and have implications on model selection for applications requiring specific model strengths, such as low hallucination or lower cost of inference. MEDIC's multifaceted evaluation reveals these performance trade-offs, bridging the gap between theoretical capabilities and practical implementation in healthcare settings, ensuring that the most promising models are identified and adapted for diverse healthcare applications.
△ Less
Submitted 11 September, 2024;
originally announced September 2024.
-
Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement
Authors:
Tathagata Bandyopadhyay
Abstract:
Recently, attention-based transformers have become a de facto standard in many deep learning applications including natural language processing, computer vision, signal processing, etc.. In this paper, we propose a transformer-based end-to-end model to extract a target speaker's speech from a monaural multi-speaker mixed audio signal. Unlike existing speaker extraction methods, we introduce two ad…
▽ More
Recently, attention-based transformers have become a de facto standard in many deep learning applications including natural language processing, computer vision, signal processing, etc.. In this paper, we propose a transformer-based end-to-end model to extract a target speaker's speech from a monaural multi-speaker mixed audio signal. Unlike existing speaker extraction methods, we introduce two additional objectives to impose speaker embedding consistency and waveform encoder invertibility and jointly train both speaker encoder and speech separator to better capture the speaker conditional embedding. Furthermore, we leverage a multi-scale discriminator to refine the perceptual quality of the extracted speech. Our experiments show that the use of a dual path transformer in the separator backbone along with proposed training paradigm improves the CNN baseline by $3.12$ dB points. Finally, we compare our approach with recent state-of-the-arts and show that our model outperforms existing methods by $4.1$ dB points on an average without creating additional data dependency.
△ Less
Submitted 2 September, 2024;
originally announced September 2024.
-
Skill Dominance Analysis of Two(Four) player, Three(Five) dice Variant of the Ludo Game
Authors:
Tathagata Banerjee,
Diganta Mukherjee
Abstract:
This paper examines two different variants of the Ludo game, involving multiple dice and a fixed number of total turns. Within each variant, multiple game lengths (total no. of turns) are considered. To compare the two variants, a set of intuitive, rule-based strategies is designed, representing different broad methods of strategic play. Game play is simulated between bots (automated software appl…
▽ More
This paper examines two different variants of the Ludo game, involving multiple dice and a fixed number of total turns. Within each variant, multiple game lengths (total no. of turns) are considered. To compare the two variants, a set of intuitive, rule-based strategies is designed, representing different broad methods of strategic play. Game play is simulated between bots (automated software applications executing repetitive tasks over a network) following these strategies. The expected results are computed using certain game theoretic and probabilistic explanations, helping to understand the performance of the different strategies. The different strategies are further analyzed using win percentage in a large number of simulations, and Nash Equilibrium strategies are computed for both variants for a varying number of total turns. The Nash Equilibrium strategies across different game lengths are compared. A clear distinction between performances of strategies is observed, with more sophisticated strategies beating the naive one. A gradual shift in optimal strategy profiles is observed with changing game length, and certain sophisticated strategies even confound each other's performance while playing against each other.
△ Less
Submitted 11 November, 2024; v1 submitted 31 August, 2024;
originally announced September 2024.
-
Med42-v2: A Suite of Clinical LLMs
Authors:
Clément Christophe,
Praveen K Kanithi,
Tathagata Raha,
Shadab Khan,
Marco AF Pimentel
Abstract:
Med42-v2 introduces a suite of clinical large language models (LLMs) designed to address the limitations of generic models in healthcare settings. These models are built on Llama3 architecture and fine-tuned using specialized clinical data. They underwent multi-stage preference alignment to effectively respond to natural prompts. While generic models are often preference-aligned to avoid answering…
▽ More
Med42-v2 introduces a suite of clinical large language models (LLMs) designed to address the limitations of generic models in healthcare settings. These models are built on Llama3 architecture and fine-tuned using specialized clinical data. They underwent multi-stage preference alignment to effectively respond to natural prompts. While generic models are often preference-aligned to avoid answering clinical queries as a precaution, Med42-v2 is specifically trained to overcome this limitation, enabling its use in clinical settings. Med42-v2 models demonstrate superior performance compared to the original Llama3 models in both 8B and 70B parameter configurations and GPT-4 across various medical benchmarks. These LLMs are developed to understand clinical queries, perform reasoning tasks, and provide valuable assistance in clinical environments. The models are now publicly available at \href{https://huggingface.co/m42-health}{https://huggingface.co/m42-health}.
△ Less
Submitted 12 August, 2024;
originally announced August 2024.
-
Beyond Metrics: A Critical Analysis of the Variability in Large Language Model Evaluation Frameworks
Authors:
Marco AF Pimentel,
Clément Christophe,
Tathagata Raha,
Prateek Munjal,
Praveen K Kanithi,
Shadab Khan
Abstract:
As large language models (LLMs) continue to evolve, the need for robust and standardized evaluation benchmarks becomes paramount. Evaluating the performance of these models is a complex challenge that requires careful consideration of various linguistic tasks, model architectures, and benchmarking methodologies. In recent years, various frameworks have emerged as noteworthy contributions to the fi…
▽ More
As large language models (LLMs) continue to evolve, the need for robust and standardized evaluation benchmarks becomes paramount. Evaluating the performance of these models is a complex challenge that requires careful consideration of various linguistic tasks, model architectures, and benchmarking methodologies. In recent years, various frameworks have emerged as noteworthy contributions to the field, offering comprehensive evaluation tests and benchmarks for assessing the capabilities of LLMs across diverse domains. This paper provides an exploration and critical analysis of some of these evaluation methodologies, shedding light on their strengths, limitations, and impact on advancing the state-of-the-art in natural language processing.
△ Less
Submitted 28 July, 2024;
originally announced July 2024.
-
Investigating Annotator Bias in Large Language Models for Hate Speech Detection
Authors:
Amit Das,
Zheng Zhang,
Najib Hasan,
Souvika Sarkar,
Fatemeh Jamshidi,
Tathagata Bhattacharya,
Mostafa Rahgouy,
Nilanjana Raychawdhary,
Dongji Feng,
Vinija Jain,
Aman Chadha,
Mary Sandage,
Lauramarie Pope,
Gerry Dozier,
Cheryl Seals
Abstract:
Data annotation, the practice of assigning descriptive labels to raw data, is pivotal in optimizing the performance of machine learning models. However, it is a resource-intensive process susceptible to biases introduced by annotators. The emergence of sophisticated Large Language Models (LLMs) presents a unique opportunity to modernize and streamline this complex procedure. While existing researc…
▽ More
Data annotation, the practice of assigning descriptive labels to raw data, is pivotal in optimizing the performance of machine learning models. However, it is a resource-intensive process susceptible to biases introduced by annotators. The emergence of sophisticated Large Language Models (LLMs) presents a unique opportunity to modernize and streamline this complex procedure. While existing research extensively evaluates the efficacy of LLMs, as annotators, this paper delves into the biases present in LLMs when annotating hate speech data. Our research contributes to understanding biases in four key categories: gender, race, religion, and disability with four LLMs: GPT-3.5, GPT-4o, Llama-3.1 and Gemma-2. Specifically targeting highly vulnerable groups within these categories, we analyze annotator biases. Furthermore, we conduct a comprehensive examination of potential factors contributing to these biases by scrutinizing the annotated data. We introduce our custom hate speech detection dataset, HateBiasNet, to conduct this research. Additionally, we perform the same experiments on the ETHOS (Mollas et al. 2022) dataset also for comparative analysis. This paper serves as a crucial resource, guiding researchers and practitioners in harnessing the potential of LLMs for data annotation, thereby fostering advancements in this critical field.
△ Less
Submitted 16 November, 2024; v1 submitted 16 June, 2024;
originally announced June 2024.
-
Facts-and-Feelings: Capturing both Objectivity and Subjectivity in Table-to-Text Generation
Authors:
Tathagata Dey,
Pushpak Bhattacharyya
Abstract:
Table-to-text generation, a long-standing challenge in natural language generation, has remained unexplored through the lens of subjectivity. Subjectivity here encompasses the comprehension of information derived from the table that cannot be described solely by objective data. Given the absence of pre-existing datasets, we introduce the Ta2TS dataset with 3849 data instances. We perform the task…
▽ More
Table-to-text generation, a long-standing challenge in natural language generation, has remained unexplored through the lens of subjectivity. Subjectivity here encompasses the comprehension of information derived from the table that cannot be described solely by objective data. Given the absence of pre-existing datasets, we introduce the Ta2TS dataset with 3849 data instances. We perform the task of fine-tuning sequence-to-sequence models on the linearized tables and prompting on popular large language models. We analyze the results from a quantitative and qualitative perspective to ensure the capture of subjectivity and factual consistency. The analysis shows the fine-tuned LMs can perform close to the prompted LLMs. Both the models can capture the tabular data, generating texts with 85.15% BERTScore and 26.28% Meteor score. To the best of our knowledge, we provide the first-of-its-kind dataset on tables with multiple genres and subjectivity included and present the first comprehensive analysis and comparison of different LLM performances on this task.
△ Less
Submitted 15 June, 2024;
originally announced June 2024.
-
iREL at SemEval-2024 Task 9: Improving Conventional Prompting Methods for Brain Teasers
Authors:
Harshit Gupta,
Manav Chaudhary,
Tathagata Raha,
Shivansh Subramanian,
Vasudeva Varma
Abstract:
This paper describes our approach for SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense. The BRAINTEASER task comprises multiple-choice Question Answering designed to evaluate the models' lateral thinking capabilities. It consists of Sentence Puzzle and Word Puzzle subtasks that require models to defy default common-sense associations and exhibit unconventional thinking. We propo…
▽ More
This paper describes our approach for SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense. The BRAINTEASER task comprises multiple-choice Question Answering designed to evaluate the models' lateral thinking capabilities. It consists of Sentence Puzzle and Word Puzzle subtasks that require models to defy default common-sense associations and exhibit unconventional thinking. We propose a unique strategy to improve the performance of pre-trained language models, notably the Gemini 1.0 Pro Model, in both subtasks. We employ static and dynamic few-shot prompting techniques and introduce a model-generated reasoning strategy that utilizes the LLM's reasoning capabilities to improve performance. Our approach demonstrated significant improvements, showing that it performed better than the baseline models by a considerable margin but fell short of performing as well as the human annotators, thus highlighting the efficacy of the proposed strategies.
△ Less
Submitted 25 May, 2024;
originally announced May 2024.
-
Med42 -- Evaluating Fine-Tuning Strategies for Medical LLMs: Full-Parameter vs. Parameter-Efficient Approaches
Authors:
Clément Christophe,
Praveen K Kanithi,
Prateek Munjal,
Tathagata Raha,
Nasir Hayat,
Ronnie Rajan,
Ahmed Al-Mahrooqi,
Avani Gupta,
Muhammad Umar Salman,
Gurpreet Gosal,
Bhargav Kanakiya,
Charles Chen,
Natalia Vassilieva,
Boulbaba Ben Amor,
Marco AF Pimentel,
Shadab Khan
Abstract:
This study presents a comprehensive analysis and comparison of two predominant fine-tuning methodologies - full-parameter fine-tuning and parameter-efficient tuning - within the context of medical Large Language Models (LLMs). We developed and refined a series of LLMs, based on the Llama-2 architecture, specifically designed to enhance medical knowledge retrieval, reasoning, and question-answering…
▽ More
This study presents a comprehensive analysis and comparison of two predominant fine-tuning methodologies - full-parameter fine-tuning and parameter-efficient tuning - within the context of medical Large Language Models (LLMs). We developed and refined a series of LLMs, based on the Llama-2 architecture, specifically designed to enhance medical knowledge retrieval, reasoning, and question-answering capabilities. Our experiments systematically evaluate the effectiveness of these tuning strategies across various well-known medical benchmarks. Notably, our medical LLM Med42 showed an accuracy level of 72% on the US Medical Licensing Examination (USMLE) datasets, setting a new standard in performance for openly available medical LLMs. Through this comparative analysis, we aim to identify the most effective and efficient method for fine-tuning LLMs in the medical domain, thereby contributing significantly to the advancement of AI-driven healthcare applications.
△ Less
Submitted 23 April, 2024;
originally announced April 2024.
-
OffensiveLang: A Community Based Implicit Offensive Language Dataset
Authors:
Amit Das,
Mostafa Rahgouy,
Dongji Feng,
Zheng Zhang,
Tathagata Bhattacharya,
Nilanjana Raychawdhary,
Fatemeh Jamshidi,
Vinija Jain,
Aman Chadha,
Mary Sandage,
Lauramarie Pope,
Gerry Dozier,
Cheryl Seals
Abstract:
The widespread presence of hateful languages on social media has resulted in adverse effects on societal well-being. As a result, addressing this issue with high priority has become very important. Hate speech or offensive languages exist in both explicit and implicit forms, with the latter being more challenging to detect. Current research in this domain encounters several challenges. Firstly, th…
▽ More
The widespread presence of hateful languages on social media has resulted in adverse effects on societal well-being. As a result, addressing this issue with high priority has become very important. Hate speech or offensive languages exist in both explicit and implicit forms, with the latter being more challenging to detect. Current research in this domain encounters several challenges. Firstly, the existing datasets primarily rely on the collection of texts containing explicit offensive keywords, making it challenging to capture implicitly offensive contents that are devoid of these keywords. Secondly, common methodologies tend to focus solely on textual analysis, neglecting the valuable insights that community information can provide. In this research paper, we introduce a novel dataset OffensiveLang, a community based implicit offensive language dataset generated by ChatGPT 3.5 containing data for 38 different target groups. Despite limitations in generating offensive texts using ChatGPT due to ethical constraints, we present a prompt-based approach that effectively generates implicit offensive languages. To ensure data quality, we evaluate the dataset with human. Additionally, we employ a prompt-based zero-shot method with ChatGPT and compare the detection results between human annotation and ChatGPT annotation. We utilize existing state-of-the-art models to see how effective they are in detecting such languages. The dataset is available here: https://github.com/AmitDasRup123/OffensiveLang
△ Less
Submitted 14 December, 2024; v1 submitted 4 March, 2024;
originally announced March 2024.
-
TESS: A Multi-intent Parser for Conversational Multi-Agent Systems with Decentralized Natural Language Understanding Models
Authors:
Burak Aksar,
Yara Rizk,
Tathagata Chakraborti
Abstract:
Chatbots have become one of the main pathways for the delivery of business automation tools. Multi-agent systems offer a framework for designing chatbots at scale, making it easier to support complex conversations that span across multiple domains as well as enabling developers to maintain and expand their capabilities incrementally over time. However, multi-agent systems complicate the natural la…
▽ More
Chatbots have become one of the main pathways for the delivery of business automation tools. Multi-agent systems offer a framework for designing chatbots at scale, making it easier to support complex conversations that span across multiple domains as well as enabling developers to maintain and expand their capabilities incrementally over time. However, multi-agent systems complicate the natural language understanding (NLU) of user intents, especially when they rely on decentralized NLU models: some utterances (termed single intent) may invoke a single agent while others (termed multi-intent) may explicitly invoke multiple agents. Without correctly parsing multi-intent inputs, decentralized NLU approaches will not achieve high prediction accuracy. In this paper, we propose an efficient parsing and orchestration pipeline algorithm to service multi-intent utterances from the user in the context of a multi-agent system. Our proposed approach achieved comparable performance to competitive deep learning models on three different datasets while being up to 48 times faster.
△ Less
Submitted 18 December, 2023;
originally announced December 2023.
-
Enabling Robots to Identify Missing Steps in Robot Tasks for Guided Learning from Demonstration
Authors:
Maximilian Diehl,
Tathagata Chakraborti,
Karinne Ramirez-Amaro
Abstract:
Learning from Demonstration (LfD) systems are commonly used to teach robots new tasks by generating a set of skills from user-provided demonstrations. These skills can then be sequenced by planning algorithms to execute complex tasks. However, LfD systems typically require a full demonstration of the entire task, even when parts of it are already known to the robot. This limitation comes from the…
▽ More
Learning from Demonstration (LfD) systems are commonly used to teach robots new tasks by generating a set of skills from user-provided demonstrations. These skills can then be sequenced by planning algorithms to execute complex tasks. However, LfD systems typically require a full demonstration of the entire task, even when parts of it are already known to the robot. This limitation comes from the system's inability to recognize which sub-tasks are already familiar, leading to a repetitive and burdensome demonstration process for users. In this paper, we introduce a new method for guided demonstrations that reduces this burden, by helping the robot to identify which parts of the task it already knows, considering the overall task goal and the robot's existing skills. In particular, through a combinatorial search, the method finds the smallest necessary change in the initial task conditions that allows the robot to solve the task with its current knowledge. This state is referred to as the excuse state. The human demonstrator is then only required to teach how to reach the excuse state (missing sub-task), rather than demonstrating the entire task. Empirical results and a pilot user study show that our method reduces demonstration time by 61% and decreases the size of demonstrations by 72%.
△ Less
Submitted 11 December, 2024; v1 submitted 30 November, 2023;
originally announced November 2023.
-
Can LLMs Fix Issues with Reasoning Models? Towards More Likely Models for AI Planning
Authors:
Turgay Caglar,
Sirine Belhaj,
Tathagata Chakraborti,
Michael Katz,
Sarath Sreedharan
Abstract:
This is the first work to look at the application of large language models (LLMs) for the purpose of model space edits in automated planning tasks. To set the stage for this union, we explore two different flavors of model space problems that have been studied in the AI planning literature and explore the effect of an LLM on those tasks. We empirically demonstrate how the performance of an LLM con…
▽ More
This is the first work to look at the application of large language models (LLMs) for the purpose of model space edits in automated planning tasks. To set the stage for this union, we explore two different flavors of model space problems that have been studied in the AI planning literature and explore the effect of an LLM on those tasks. We empirically demonstrate how the performance of an LLM contrasts with combinatorial search (CS) -- an approach that has been traditionally used to solve model space tasks in planning, both with the LLM in the role of a standalone model space reasoner as well as in the role of a statistical signal in concert with the CS approach as part of a two-stage process. Our experiments show promising results suggesting further forays of LLMs into the exciting world of model space reasoning for planning tasks in the future.
△ Less
Submitted 4 March, 2024; v1 submitted 22 November, 2023;
originally announced November 2023.
-
Development of Machine Vision Approach for Mechanical Component Identification based on its Dimension and Pitch
Authors:
Toshit Jain,
Faisel Mushtaq,
K Ramesh,
Sandip Deshmukh,
Tathagata Ray,
Chandu Parimi,
Praveen Tandon,
Pramod Kumar Jha
Abstract:
In this work, a highly customizable and scalable vision based system for automation of mechanical assembly lines is described. The proposed system calculates the features that are required to classify and identify the different kinds of bolts that are used in the assembly line. The system describes a novel method of calculating the pitch of the bolt in addition to bolt identification and calculati…
▽ More
In this work, a highly customizable and scalable vision based system for automation of mechanical assembly lines is described. The proposed system calculates the features that are required to classify and identify the different kinds of bolts that are used in the assembly line. The system describes a novel method of calculating the pitch of the bolt in addition to bolt identification and calculating the dimensions of the bolts. This identification and classification system is extremely lightweight and can be run on bare minimum hardware. The system is very fast in the order of milliseconds, hence the system can be used successfully even if the components are steadily moving on a conveyor. The results show that our system can correctly identify the parts in our dataset with 98% accuracy using the calculated features.
△ Less
Submitted 3 October, 2023;
originally announced October 2023.
-
TOBY: A Tool for Exploring Data in Academic Survey Papers
Authors:
Tathagata Chakraborti,
Jungkoo Kang,
Christian Muise,
Sarath Sreedharan,
Michael Walker,
Daniel Szafir,
Tom Williams
Abstract:
This paper describes TOBY, a visualization tool that helps a user explore the contents of an academic survey paper. The visualization consists of four components: a hierarchical view of taxonomic data in the survey, a document similarity view in the space of taxonomic classes, a network view of citations, and a new paper recommendation tool. In this paper, we will discuss these features in the con…
▽ More
This paper describes TOBY, a visualization tool that helps a user explore the contents of an academic survey paper. The visualization consists of four components: a hierarchical view of taxonomic data in the survey, a document similarity view in the space of taxonomic classes, a network view of citations, and a new paper recommendation tool. In this paper, we will discuss these features in the context of three separate deployments of the tool.
△ Less
Submitted 12 June, 2023;
originally announced June 2023.
-
Neural models for Factual Inconsistency Classification with Explanations
Authors:
Tathagata Raha,
Mukund Choudhary,
Abhinav Menon,
Harshit Gupta,
KV Aditya Srivatsa,
Manish Gupta,
Vasudeva Varma
Abstract:
Factual consistency is one of the most important requirements when editing high quality documents. It is extremely important for automatic text generation systems like summarization, question answering, dialog modeling, and language modeling. Still, automated factual inconsistency detection is rather under-studied. Existing work has focused on (a) finding fake news keeping a knowledge base in cont…
▽ More
Factual consistency is one of the most important requirements when editing high quality documents. It is extremely important for automatic text generation systems like summarization, question answering, dialog modeling, and language modeling. Still, automated factual inconsistency detection is rather under-studied. Existing work has focused on (a) finding fake news keeping a knowledge base in context, or (b) detecting broad contradiction (as part of natural language inference literature). However, there has been no work on detecting and explaining types of factual inconsistencies in text, without any knowledge base in context. In this paper, we leverage existing work in linguistics to formally define five types of factual inconsistencies. Based on this categorization, we contribute a novel dataset, FICLE (Factual Inconsistency CLassification with Explanation), with ~8K samples where each sample consists of two sentences (claim and context) annotated with type and span of inconsistency. When the inconsistency relates to an entity type, it is labeled as well at two levels (coarse and fine-grained). Further, we leverage this dataset to train a pipeline of four neural models to predict inconsistency type with explanations, given a (claim, context) sentence pair. Explanations include inconsistent claim fact triple, inconsistent context span, inconsistent claim component, coarse and fine-grained inconsistent entity types. The proposed system first predicts inconsistent spans from claim and context; and then uses them to predict inconsistency types and inconsistent entity types (when inconsistency is due to entities). We experiment with multiple Transformer-based natural language classification as well as generative models, and find that DeBERTa performs the best. Our proposed methods provide a weighted F1 of ~87% for inconsistency type classification across the five classes.
△ Less
Submitted 15 June, 2023;
originally announced June 2023.
-
MACQ: A Holistic View of Model Acquisition Techniques
Authors:
Ethan Callanan,
Rebecca De Venezia,
Victoria Armstrong,
Alison Paredes,
Tathagata Chakraborti,
Christian Muise
Abstract:
For over three decades, the planning community has explored countless methods for data-driven model acquisition. These range in sophistication (e.g., simple set operations to full-blown reformulations), methodology (e.g., logic-based vs. planing-based), and assumptions (e.g., fully vs. partially observable). With no fewer than 43 publications in the space, it can be overwhelming to understand what…
▽ More
For over three decades, the planning community has explored countless methods for data-driven model acquisition. These range in sophistication (e.g., simple set operations to full-blown reformulations), methodology (e.g., logic-based vs. planing-based), and assumptions (e.g., fully vs. partially observable). With no fewer than 43 publications in the space, it can be overwhelming to understand what approach could or should be applied in a new setting. We present a holistic characterization of the action model acquisition space and further introduce a unifying framework for automated action model acquisition. We have re-implemented some of the landmark approaches in the area, and our characterization of all the techniques offers deep insight into the research opportunities that remain; i.e., those settings where no technique is capable of solving.
△ Less
Submitted 13 June, 2022;
originally announced June 2022.
-
Semi-Supervised Cascaded Clustering for Classification of Noisy Label Data
Authors:
Ashit Gupta,
Anirudh Deodhar,
Tathagata Mukherjee,
Venkataramana Runkana
Abstract:
The performance of supervised classification techniques often deteriorates when the data has noisy labels. Even the semi-supervised classification approaches have largely focused only on the problem of handling missing labels. Most of the approaches addressing the noisy label data rely on deep neural networks (DNN) that require huge datasets for classification tasks. This poses a serious challenge…
▽ More
The performance of supervised classification techniques often deteriorates when the data has noisy labels. Even the semi-supervised classification approaches have largely focused only on the problem of handling missing labels. Most of the approaches addressing the noisy label data rely on deep neural networks (DNN) that require huge datasets for classification tasks. This poses a serious challenge especially in process and manufacturing industries, where the data is limited and labels are noisy. We propose a semi-supervised cascaded clustering (SSCC) algorithm to extract patterns and generate a cascaded tree of classes in such datasets. A novel cluster evaluation matrix (CEM) with configurable hyperparameters is introduced to localize and eliminate the noisy labels and invoke a pruning criterion on cascaded clustering. The algorithm reduces the dependency on expensive human expertise for assessing the accuracy of labels. A classifier generated based on SSCC is found to be accurate and consistent even when trained on noisy label datasets. It performed better in comparison with the support vector machines (SVM) when tested on multiple noisy-label datasets, including an industrial dataset. The proposed approach can be effectively used for deriving actionable insights in industrial settings with minimal human expertise.
△ Less
Submitted 4 May, 2022;
originally announced May 2022.
-
Virtual, Augmented, and Mixed Reality for Human-Robot Interaction: A Survey and Virtual Design Element Taxonomy
Authors:
Michael Walker,
Thao Phung,
Tathagata Chakraborti,
Tom Williams,
Daniel Szafir
Abstract:
Virtual, Augmented, and Mixed Reality for Human-Robot Interaction (VAM-HRI) has been gaining considerable attention in research in recent years. However, the HRI community lacks a set of shared terminology and framework for characterizing aspects of mixed reality interfaces, presenting serious problems for future research. Therefore, it is important to have a common set of terms and concepts that…
▽ More
Virtual, Augmented, and Mixed Reality for Human-Robot Interaction (VAM-HRI) has been gaining considerable attention in research in recent years. However, the HRI community lacks a set of shared terminology and framework for characterizing aspects of mixed reality interfaces, presenting serious problems for future research. Therefore, it is important to have a common set of terms and concepts that can be used to precisely describe and organize the diverse array of work being done within the field. In this paper, we present a novel taxonomic framework for different types of VAM-HRI interfaces, composed of four main categories of virtual design elements (VDEs). We present and justify our taxonomy and explain how its elements have been developed over the last 30 years as well as the current directions VAM-HRI is headed in the coming decade.
△ Less
Submitted 22 February, 2022;
originally announced February 2022.
-
COVID-19 India Dataset: Parsing COVID-19 Data in Daily Health Bulletins from States in India
Authors:
Mayank Agarwal,
Tathagata Chakraborti,
Sachin Grover,
Arunima Chaudhary
Abstract:
While India has been one of the hotspots of COVID-19, data about the pandemic from the country has proved to be largely inaccessible at scale. Much of the data exists in unstructured form on the web, and limited aspects of such data are available through public APIs maintained manually through volunteer effort. This has proved to be difficult both in terms of ease of access to detailed data and wi…
▽ More
While India has been one of the hotspots of COVID-19, data about the pandemic from the country has proved to be largely inaccessible at scale. Much of the data exists in unstructured form on the web, and limited aspects of such data are available through public APIs maintained manually through volunteer effort. This has proved to be difficult both in terms of ease of access to detailed data and with regards to the maintenance of manual data-keeping over time. This paper reports on our effort at automating the extraction of such data from public health bulletins with the help of a combination of classical PDF parsers and state-of-the-art machine learning techniques. In this paper, we will describe the automated data-extraction technique, the nature of the generated data, and exciting avenues of ongoing work.
△ Less
Submitted 6 December, 2021; v1 submitted 27 September, 2021;
originally announced October 2021.
-
Predicting Mood Disorder Symptoms with Remotely Collected Videos Using an Interpretable Multimodal Dynamic Attention Fusion Network
Authors:
Tathagata Banerjee,
Matthew Kollada,
Pablo Gersberg,
Oscar Rodriguez,
Jane Tiller,
Andrew E Jaffe,
John Reynders
Abstract:
We developed a novel, interpretable multimodal classification method to identify symptoms of mood disorders viz. depression, anxiety and anhedonia using audio, video and text collected from a smartphone application. We used CNN-based unimodal encoders to learn dynamic embeddings for each modality and then combined these through a transformer encoder. We applied these methods to a novel dataset - c…
▽ More
We developed a novel, interpretable multimodal classification method to identify symptoms of mood disorders viz. depression, anxiety and anhedonia using audio, video and text collected from a smartphone application. We used CNN-based unimodal encoders to learn dynamic embeddings for each modality and then combined these through a transformer encoder. We applied these methods to a novel dataset - collected by a smartphone application - on 3002 participants across up to three recording sessions. Our method demonstrated better multimodal classification performance compared to existing methods that employed static embeddings. Lastly, we used SHapley Additive exPlanations (SHAP) to prioritize important features in our model that could serve as potential digital markers.
△ Less
Submitted 7 September, 2021;
originally announced September 2021.
-
NeurIPS 2020 NLC2CMD Competition: Translating Natural Language to Bash Commands
Authors:
Mayank Agarwal,
Tathagata Chakraborti,
Quchen Fu,
David Gros,
Xi Victoria Lin,
Jaron Maene,
Kartik Talamadupula,
Zhongwei Teng,
Jules White
Abstract:
The NLC2CMD Competition hosted at NeurIPS 2020 aimed to bring the power of natural language processing to the command line. Participants were tasked with building models that can transform descriptions of command line tasks in English to their Bash syntax. This is a report on the competition with details of the task, metrics, data, attempted solutions, and lessons learned.
The NLC2CMD Competition hosted at NeurIPS 2020 aimed to bring the power of natural language processing to the command line. Participants were tasked with building models that can transform descriptions of command line tasks in English to their Bash syntax. This is a report on the competition with details of the task, metrics, data, attempted solutions, and lessons learned.
△ Less
Submitted 8 August, 2021; v1 submitted 3 March, 2021;
originally announced March 2021.
-
Identifying COVID-19 Fake News in Social Media
Authors:
Tathagata Raha,
Vijayasaradhi Indurthi,
Aayush Upadhyaya,
Jeevesh Kataria,
Pramud Bommakanti,
Vikram Keswani,
Vasudeva Varma
Abstract:
The evolution of social media platforms have empowered everyone to access information easily. Social media users can easily share information with the rest of the world. This may sometimes encourage spread of fake news, which can result in undesirable consequences. In this work, we train models which can identify health news related to COVID-19 pandemic as real or fake. Our models achieve a high F…
▽ More
The evolution of social media platforms have empowered everyone to access information easily. Social media users can easily share information with the rest of the world. This may sometimes encourage spread of fake news, which can result in undesirable consequences. In this work, we train models which can identify health news related to COVID-19 pandemic as real or fake. Our models achieve a high F1-score of 98.64%. Our models achieve second place on the leaderboard, tailing the first position with a very narrow margin 0.05% points.
△ Less
Submitted 1 February, 2021; v1 submitted 28 January, 2021;
originally announced January 2021.
-
Task Adaptive Pretraining of Transformers for Hostility Detection
Authors:
Tathagata Raha,
Sayar Ghosh Roy,
Ujwal Narayan,
Zubair Abid,
Vasudeva Varma
Abstract:
Identifying adverse and hostile content on the web and more particularly, on social media, has become a problem of paramount interest in recent years. With their ever increasing popularity, fine-tuning of pretrained Transformer-based encoder models with a classifier head are gradually becoming the new baseline for natural language classification tasks. In our work, we explore the gains attributed…
▽ More
Identifying adverse and hostile content on the web and more particularly, on social media, has become a problem of paramount interest in recent years. With their ever increasing popularity, fine-tuning of pretrained Transformer-based encoder models with a classifier head are gradually becoming the new baseline for natural language classification tasks. In our work, we explore the gains attributed to Task Adaptive Pretraining (TAPT) prior to fine-tuning of Transformer-based architectures. We specifically study two problems, namely, (a) Coarse binary classification of Hindi Tweets into Hostile or Not, and (b) Fine-grained multi-label classification of Tweets into four categories: hate, fake, offensive, and defamation. Building up on an architecture which takes emojis and segmented hashtags into consideration for classification, we are able to experimentally showcase the performance upgrades due to TAPT. Our system (with team name 'iREL IIIT') ranked first in the 'Hostile Post Detection in Hindi' shared task with an F1 score of 97.16% for coarse-grained detection and a weighted F1 score of 62.96% for fine-grained multi-label classification on the provided blind test corpora.
△ Less
Submitted 9 January, 2021;
originally announced January 2021.
-
Leveraging Multilingual Transformers for Hate Speech Detection
Authors:
Sayar Ghosh Roy,
Ujwal Narayan,
Tathagata Raha,
Zubair Abid,
Vasudeva Varma
Abstract:
Detecting and classifying instances of hate in social media text has been a problem of interest in Natural Language Processing in the recent years. Our work leverages state of the art Transformer language models to identify hate speech in a multilingual setting. Capturing the intent of a post or a comment on social media involves careful evaluation of the language style, semantic content and addit…
▽ More
Detecting and classifying instances of hate in social media text has been a problem of interest in Natural Language Processing in the recent years. Our work leverages state of the art Transformer language models to identify hate speech in a multilingual setting. Capturing the intent of a post or a comment on social media involves careful evaluation of the language style, semantic content and additional pointers such as hashtags and emojis. In this paper, we look at the problem of identifying whether a Twitter post is hateful and offensive or not. We further discriminate the detected toxic content into one of the following three classes: (a) Hate Speech (HATE), (b) Offensive (OFFN) and (c) Profane (PRFN). With a pre-trained multilingual Transformer-based text encoder at the base, we are able to successfully identify and classify hate speech from multiple languages. On the provided testing corpora, we achieve Macro F1 scores of 90.29, 81.87 and 75.40 for English, German and Hindi respectively while performing hate speech detection and of 60.70, 53.28 and 49.74 during fine-grained classification. In our experiments, we show the efficacy of Perspective API features for hate speech classification and the effects of exploiting a multilingual training scheme. A feature selection study is provided to illustrate impacts of specific features upon the architecture's classification head.
△ Less
Submitted 8 January, 2021;
originally announced January 2021.
-
A Bayesian Account of Measures of Interpretability in Human-AI Interaction
Authors:
Sarath Sreedharan,
Anagha Kulkarni,
Tathagata Chakraborti,
David E. Smith,
Subbarao Kambhampati
Abstract:
Existing approaches for the design of interpretable agent behavior consider different measures of interpretability in isolation. In this paper we posit that, in the design and deployment of human-aware agents in the real world, notions of interpretability are just some among many considerations; and the techniques developed in isolation lack two key properties to be useful when considered together…
▽ More
Existing approaches for the design of interpretable agent behavior consider different measures of interpretability in isolation. In this paper we posit that, in the design and deployment of human-aware agents in the real world, notions of interpretability are just some among many considerations; and the techniques developed in isolation lack two key properties to be useful when considered together: they need to be able to 1) deal with their mutually competing properties; and 2) an open world where the human is not just there to interpret behavior in one specific form. To this end, we consider three well-known instances of interpretable behavior studied in existing literature -- namely, explicability, legibility, and predictability -- and propose a revised model where all these behaviors can be meaningfully modeled together. We will highlight interesting consequences of this unified model and motivate, through results of a user study, why this revision is necessary.
△ Less
Submitted 21 November, 2020;
originally announced November 2020.
-
Explainable Composition of Aggregated Assistants
Authors:
Sarath Sreedharan,
Tathagata Chakraborti,
Yara Rizk,
Yasaman Khazaeni
Abstract:
A new design of an AI assistant that has become increasingly popular is that of an "aggregated assistant" -- realized as an orchestrated composition of several individual skills or agents that can each perform atomic tasks. In this paper, we will talk about the role of planning in the automated composition of such assistants and explore how concepts in automated planning can help to establish tran…
▽ More
A new design of an AI assistant that has become increasingly popular is that of an "aggregated assistant" -- realized as an orchestrated composition of several individual skills or agents that can each perform atomic tasks. In this paper, we will talk about the role of planning in the automated composition of such assistants and explore how concepts in automated planning can help to establish transparency of the inner workings of the assistant to the end-user.
△ Less
Submitted 20 November, 2020;
originally announced November 2020.
-
Development of POS tagger for English-Bengali Code-Mixed data
Authors:
Tathagata Raha,
Sainik Kumar Mahata,
Dipankar Das,
Sivaji Bandyopadhyay
Abstract:
Code-mixed texts are widespread nowadays due to the advent of social media. Since these texts combine two languages to formulate a sentence, it gives rise to various research problems related to Natural Language Processing. In this paper, we try to excavate one such problem, namely, Parts of Speech tagging of code-mixed texts. We have built a system that can POS tag English-Bengali code-mixed data…
▽ More
Code-mixed texts are widespread nowadays due to the advent of social media. Since these texts combine two languages to formulate a sentence, it gives rise to various research problems related to Natural Language Processing. In this paper, we try to excavate one such problem, namely, Parts of Speech tagging of code-mixed texts. We have built a system that can POS tag English-Bengali code-mixed data where the Bengali words were written in Roman script. Our approach initially involves the collection and cleaning of English-Bengali code-mixed tweets. These tweets were used as a development dataset for building our system. The proposed system is a modular approach that starts by tagging individual tokens with their respective languages and then passes them to different POS taggers, designed for different languages (English and Bengali, in our case). Tags given by the two systems are later joined together and the final result is then mapped to a universal POS tag set. Our system was checked using 100 manually POS tagged code-mixed sentences and it returned an accuracy of 75.29%
△ Less
Submitted 28 July, 2020;
originally announced July 2020.
-
From Robotic Process Automation to Intelligent Process Automation: Emerging Trends
Authors:
Tathagata Chakraborti,
Vatche Isahagian,
Rania Khalaf,
Yasaman Khazaeni,
Vinod Muthusamy,
Yara Rizk,
Merve Unuvar
Abstract:
In this survey, we study how recent advances in machine intelligence are disrupting the world of business processes. Over the last decade, there has been steady progress towards the automation of business processes under the umbrella of ``robotic process automation'' (RPA). However, we are currently at an inflection point in this evolution, as a new paradigm called ``Intelligent Process Automation…
▽ More
In this survey, we study how recent advances in machine intelligence are disrupting the world of business processes. Over the last decade, there has been steady progress towards the automation of business processes under the umbrella of ``robotic process automation'' (RPA). However, we are currently at an inflection point in this evolution, as a new paradigm called ``Intelligent Process Automation'' (IPA) emerges, bringing machine learning (ML) and artificial intelligence (AI) technologies to bear in order to improve business process outcomes. The purpose of this paper is to provide a survey of this emerging theme and identify key open research challenges at the intersection of AI and business processes. We hope that this emerging theme will spark engaging conversations at the RPA Forum.
△ Less
Submitted 26 July, 2020;
originally announced July 2020.
-
Designing Environments Conducive to Interpretable Robot Behavior
Authors:
Anagha Kulkarni,
Sarath Sreedharan,
Sarah Keren,
Tathagata Chakraborti,
David Smith,
Subbarao Kambhampati
Abstract:
Designing robots capable of generating interpretable behavior is a prerequisite for achieving effective human-robot collaboration. This means that the robots need to be capable of generating behavior that aligns with human expectations and, when required, provide explanations to the humans in the loop. However, exhibiting such behavior in arbitrary environments could be quite expensive for robots,…
▽ More
Designing robots capable of generating interpretable behavior is a prerequisite for achieving effective human-robot collaboration. This means that the robots need to be capable of generating behavior that aligns with human expectations and, when required, provide explanations to the humans in the loop. However, exhibiting such behavior in arbitrary environments could be quite expensive for robots, and in some cases, the robot may not even be able to exhibit the expected behavior. Given structured environments (like warehouses and restaurants), it may be possible to design the environment so as to boost the interpretability of the robot's behavior or to shape the human's expectations of the robot's behavior. In this paper, we investigate the opportunities and limitations of environment design as a tool to promote a type of interpretable behavior -- known in the literature as explicable behavior. We formulate a novel environment design framework that considers design over multiple tasks and over a time horizon. In addition, we explore the longitudinal aspect of explicable behavior and the trade-off that arises between the cost of design and the cost of generating explicable behavior over a time horizon.
△ Less
Submitted 2 August, 2020; v1 submitted 1 July, 2020;
originally announced July 2020.
-
The Emerging Landscape of Explainable AI Planning and Decision Making
Authors:
Tathagata Chakraborti,
Sarath Sreedharan,
Subbarao Kambhampati
Abstract:
In this paper, we provide a comprehensive outline of the different threads of work in Explainable AI Planning (XAIP) that has emerged as a focus area in the last couple of years and contrast that with earlier efforts in the field in terms of techniques, target users, and delivery mechanisms. We hope that the survey will provide guidance to new researchers in automated planning towards the role of…
▽ More
In this paper, we provide a comprehensive outline of the different threads of work in Explainable AI Planning (XAIP) that has emerged as a focus area in the last couple of years and contrast that with earlier efforts in the field in terms of techniques, target users, and delivery mechanisms. We hope that the survey will provide guidance to new researchers in automated planning towards the role of explanations in the effective design of human-in-the-loop systems, as well as provide the established researcher with some perspective on the evolution of the exciting world of explainable planning.
△ Less
Submitted 26 February, 2020;
originally announced February 2020.
-
Project CLAI: Instrumenting the Command Line as a New Environment for AI Agents
Authors:
Mayank Agarwal,
Jorge J. Barroso,
Tathagata Chakraborti,
Eli M. Dow,
Kshitij Fadnis,
Borja Godoy,
Madhavan Pallan,
Kartik Talamadupula
Abstract:
This whitepaper reports on Project CLAI (Command Line AI), which aims to bring the power of AI to the command line interface (CLI). The CLAI platform sets up the CLI as a new environment for AI researchers to conquer by surfacing the command line as a generic environment that researchers can interface to using a simple sense-act API, much like the traditional AI agent architecture. In this paper,…
▽ More
This whitepaper reports on Project CLAI (Command Line AI), which aims to bring the power of AI to the command line interface (CLI). The CLAI platform sets up the CLI as a new environment for AI researchers to conquer by surfacing the command line as a generic environment that researchers can interface to using a simple sense-act API, much like the traditional AI agent architecture. In this paper, we discuss the design and implementation of the platform in detail, through illustrative use cases of new end user interaction patterns enabled by this design, and through quantitative evaluation of the system footprint of a CLAI-enabled terminal. We also report on some early user feedback on CLAI's features from an internal survey.
△ Less
Submitted 17 June, 2020; v1 submitted 31 January, 2020;
originally announced February 2020.
-
A Unified Conversational Assistant Framework for Business Process Automation
Authors:
Yara Rizk,
Abhishek Bhandwalder,
Scott Boag,
Tathagata Chakraborti,
Vatche Isahagian,
Yasaman Khazaeni,
Falk Pollock,
Merve Unuvar
Abstract:
Business process automation is a booming multi-billion-dollar industry that promises to remove menial tasks from workers' plates -- through the introduction of autonomous agents -- and free up their time and brain power for more creative and engaging tasks. However, an essential component to the successful deployment of such autonomous agents is the ability of business users to monitor their perfo…
▽ More
Business process automation is a booming multi-billion-dollar industry that promises to remove menial tasks from workers' plates -- through the introduction of autonomous agents -- and free up their time and brain power for more creative and engaging tasks. However, an essential component to the successful deployment of such autonomous agents is the ability of business users to monitor their performance and customize their execution. A simple and user-friendly interface with a low learning curve is necessary to increase the adoption of such agents in banking, insurance, retail and other domains. As a result, proactive chatbots will play a crucial role in the business automation space. Not only can they respond to users' queries and perform actions on their behalf but also initiate communication with the users to inform them of the system's behavior. This will provide business users a natural language interface to interact with, monitor and control autonomous agents. In this work, we present a multi-agent orchestration framework to develop such proactive chatbots by discussing the types of skills that can be composed into agents and how to orchestrate these agents. Two use cases on a travel preapproval business process and a loan application business process are adopted to qualitatively analyze the proposed framework based on four criteria: performance, coding overhead, scalability, and agent overlap.
△ Less
Submitted 7 January, 2020;
originally announced January 2020.
-
D3BA: A Tool for Optimizing Business Processes Using Non-Deterministic Planning
Authors:
Tathagata Chakraborti,
Yasaman Khazaeni
Abstract:
This paper builds upon recent work in the declarative design of dialogue agents and proposes an exciting new tool -- D3BA -- Declarative Design for Digital Business Automation, built to optimize business processes using the power of AI planning. The tool provides a powerful framework to build, optimize, and maintain complex business processes and optimize them by composing with services that autom…
▽ More
This paper builds upon recent work in the declarative design of dialogue agents and proposes an exciting new tool -- D3BA -- Declarative Design for Digital Business Automation, built to optimize business processes using the power of AI planning. The tool provides a powerful framework to build, optimize, and maintain complex business processes and optimize them by composing with services that automate one or more subtasks. We illustrate salient features of this composition technique, compare with other philosophies of composition, and highlight exciting opportunities for research in this emerging field of business process automation.
△ Less
Submitted 4 February, 2020; v1 submitted 8 January, 2020;
originally announced January 2020.
-
A Generalizable Method for Automated Quality Control of Functional Neuroimaging Datasets
Authors:
Matthew Kollada,
Qingzhu Gao,
Monika S Mellem,
Tathagata Banerjee,
William J Martin
Abstract:
Over the last twenty five years, advances in the collection and analysis of fMRI data have enabled new insights into the brain basis of human health and disease. Individual behavioral variation can now be visualized at a neural level as patterns of connectivity among brain regions. Functional brain imaging is enhancing our understanding of clinical psychiatric disorders by revealing ties between r…
▽ More
Over the last twenty five years, advances in the collection and analysis of fMRI data have enabled new insights into the brain basis of human health and disease. Individual behavioral variation can now be visualized at a neural level as patterns of connectivity among brain regions. Functional brain imaging is enhancing our understanding of clinical psychiatric disorders by revealing ties between regional and network abnormalities and psychiatric symptoms. Initial success in this arena has recently motivated collection of larger datasets which are needed to leverage fMRI to generate brain-based biomarkers to support development of precision medicines. Despite methodological advances and enhanced computational power, evaluating the quality of fMRI scans remains a critical step in the analytical framework. Before analysis can be performed, expert reviewers visually inspect raw scans and preprocessed derivatives to determine viability of the data. This Quality Control (QC) process is labor intensive, and the inability to automate at large scale has proven to be a limiting factor in clinical neuroscience fMRI research. We present a novel method for automating the QC of fMRI scans. We train machine learning classifiers using features derived from brain MR images to predict the "quality" of those images, based on the ground truth of an expert's opinion. We emphasize the importance of these classifiers' ability to generalize their predictions across data from different studies. To address this, we propose a novel approach entitled "FMRI preprocessing Log mining for Automated, Generalizable Quality Control" (FLAG-QC), in which features derived from mining runtime logs are used to train the classifier. We show that classifiers trained on FLAG-QC features perform much better (AUC=0.79) than previously proposed feature sets (AUC=0.56) when testing their ability to generalize across studies.
△ Less
Submitted 20 December, 2019;
originally announced December 2019.