-
Health Sentinel: An AI Pipeline For Real-time Disease Outbreak Detection
Authors:
Devesh Pant,
Rishi Raj Grandhe,
Vipin Samaria,
Mukul Paul,
Sudhir Kumar,
Saransh Khanna,
Jatin Agrawal,
Jushaan Singh Kalra,
Akhil VSSG,
Satish V Khalikar,
Vipin Garg,
Himanshu Chauhan,
Pranay Verma,
Neha Khandelwal,
Soma S Dhavala,
Minesh Mathew
Abstract:
Early detection of disease outbreaks is crucial to ensure timely intervention by the health authorities. Due to the challenges associated with traditional indicator-based surveillance, monitoring informal sources such as online media has become increasingly popular. However, owing to the number of online articles getting published everyday, manual screening of the articles is impractical. To addre…
▽ More
Early detection of disease outbreaks is crucial to ensure timely intervention by the health authorities. Due to the challenges associated with traditional indicator-based surveillance, monitoring informal sources such as online media has become increasingly popular. However, owing to the number of online articles getting published everyday, manual screening of the articles is impractical. To address this, we propose Health Sentinel. It is a multi-stage information extraction pipeline that uses a combination of ML and non-ML methods to extract events-structured information concerning disease outbreaks or other unusual health events-from online articles. The extracted events are made available to the Media Scanning and Verification Cell (MSVC) at the National Centre for Disease Control (NCDC), Delhi for analysis, interpretation and further dissemination to local agencies for timely intervention. From April 2022 till date, Health Sentinel has processed over 300 million news articles and identified over 95,000 unique health events across India of which over 3,500 events were shortlisted by the public health experts at NCDC as potential outbreaks.
△ Less
Submitted 24 June, 2025;
originally announced June 2025.
-
MoR: Better Handling Diverse Queries with a Mixture of Sparse, Dense, and Human Retrievers
Authors:
Jushaan Singh Kalra,
Xinran Zhao,
To Eun Kim,
Fengyu Cai,
Fernando Diaz,
Tongshuang Wu
Abstract:
Retrieval-augmented Generation (RAG) is powerful, but its effectiveness hinges on which retrievers we use and how. Different retrievers offer distinct, often complementary signals: BM25 captures lexical matches; dense retrievers, semantic similarity. Yet in practice, we typically fix a single retriever based on heuristics, which fails to generalize across diverse information needs. Can we dynamica…
▽ More
Retrieval-augmented Generation (RAG) is powerful, but its effectiveness hinges on which retrievers we use and how. Different retrievers offer distinct, often complementary signals: BM25 captures lexical matches; dense retrievers, semantic similarity. Yet in practice, we typically fix a single retriever based on heuristics, which fails to generalize across diverse information needs. Can we dynamically select and integrate multiple retrievers for each individual query, without the need for manual selection? In our work, we validate this intuition with quantitative analysis and introduce mixture of retrievers: a zero-shot, weighted combination of heterogeneous retrievers. Extensive experiments show that such mixtures are effective and efficient: Despite totaling just 0.8B parameters, this mixture outperforms every individual retriever and even larger 7B models by +10.8% and +3.9% on average, respectively. Further analysis also shows that this mixture framework can help incorporate specialized non-oracle human information sources as retrievers to achieve good collaboration, with a 58.9% relative performance improvement over simulated humans alone.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
Implications of Annotation Artifacts in Edge Probing Test Datasets
Authors:
Sagnik Ray Choudhury,
Jushaan Kalra
Abstract:
Edge probing tests are classification tasks that test for grammatical knowledge encoded in token representations coming from contextual encoders such as large language models (LLMs). Many LLM encoders have shown high performance in EP tests, leading to conjectures about their ability to encode linguistic knowledge. However, a large body of research claims that the tests necessarily do not measure…
▽ More
Edge probing tests are classification tasks that test for grammatical knowledge encoded in token representations coming from contextual encoders such as large language models (LLMs). Many LLM encoders have shown high performance in EP tests, leading to conjectures about their ability to encode linguistic knowledge. However, a large body of research claims that the tests necessarily do not measure the LLM's capacity to encode knowledge, but rather reflect the classifiers' ability to learn the problem. Much of this criticism stems from the fact that often the classifiers have very similar accuracy when an LLM vs a random encoder is used. Consequently, several modifications to the tests have been suggested, including information theoretic probes. We show that commonly used edge probing test datasets have various biases including memorization. When these biases are removed, the LLM encoders do show a significant difference from the random ones, even with the simple non-information theoretic probes.
△ Less
Submitted 20 October, 2023;
originally announced October 2023.
-
Schrödinger Spectrum based Continuous Cuff-less Blood Pressure Estimation using Clinically Relevant Features from PPG Signal and its Second Derivative
Authors:
Aayushman Ghosh,
Sayan Sarkar,
Jayant Kalra
Abstract:
The presented study aims to estimate blood pressure (BP) using photoplethysmogram (PPG) signals while employing multiple machine learning models. The study proposes a novel algorithm for signal reconstruction, which utilizes the semi-classical signal analysis (SCSA) technique. The proposed algorithm optimises the semi-classical constant and eliminates the trade-off between complexity and accuracy…
▽ More
The presented study aims to estimate blood pressure (BP) using photoplethysmogram (PPG) signals while employing multiple machine learning models. The study proposes a novel algorithm for signal reconstruction, which utilizes the semi-classical signal analysis (SCSA) technique. The proposed algorithm optimises the semi-classical constant and eliminates the trade-off between complexity and accuracy in reconstruction. The reconstructed signals' spectral features are extracted and incorporated with clinically relevant PPG and its second derivative's (SDPPG) morphological features. The developed method was assessed using a publicly available virtual in-silico dataset with more than 4000 subjects, and the Multi-Parameter Intelligent Monitoring in Intensive Care Units dataset. Results showed that the method attained a mean absolute error of 5.37 and 2.96 mmHg for systolic and diastolic BP, respectively, using the CatBoost supervisory algorithm. This approach met the standards set by the Advancement of Medical Instrumentation, and achieved Grade A for all BP categories in the British Hypertension Society protocol. The proposed framework performs well even when applied to a combined database of the MIMIC-III and the Queensland dataset. This study also evaluates the proposed method's performance in a non-clinical setting with noisy and deformed PPG signals, to validate the efficacy of the SCSA method. The noise stress tests showed that the algorithm maintained its key feature detection, signal reconstruction capability, and estimation accuracy up to a 10 dB SNR ratio. It is believed that the proposed cuff-less BP estimation technique has the potential to perform well on resource-constrained settings due to its straightforward implementation approach.
△ Less
Submitted 20 January, 2023;
originally announced January 2023.
-
Battling Hateful Content in Indic Languages HASOC '21
Authors:
Aditya Kadam,
Anmol Goel,
Jivitesh Jain,
Jushaan Singh Kalra,
Mallika Subramanian,
Manvith Reddy,
Prashant Kodali,
T. H. Arjun,
Manish Shrivastava,
Ponnurangam Kumaraguru
Abstract:
The extensive rise in consumption of online social media (OSMs) by a large number of people poses a critical problem of curbing the spread of hateful content on these platforms. With the growing usage of OSMs in multiple languages, the task of detecting and characterizing hate becomes more complex. The subtle variations of code-mixed texts along with switching scripts only add to the complexity. T…
▽ More
The extensive rise in consumption of online social media (OSMs) by a large number of people poses a critical problem of curbing the spread of hateful content on these platforms. With the growing usage of OSMs in multiple languages, the task of detecting and characterizing hate becomes more complex. The subtle variations of code-mixed texts along with switching scripts only add to the complexity. This paper presents a solution for the HASOC 2021 Multilingual Twitter Hate-Speech Detection challenge by team PreCog IIIT Hyderabad. We adopt a multilingual transformer based approach and describe our architecture for all 6 subtasks as part of the challenge. Out of the 6 teams that participated in all the subtasks, our submissions rank 3rd overall.
△ Less
Submitted 5 November, 2021; v1 submitted 25 October, 2021;
originally announced October 2021.
-
Nutritional Profile Estimation in Cooking Recipes
Authors:
Jushaan Kalra,
Devansh Batra,
Nirav Diwan,
Ganesh Bagler
Abstract:
The availability of an accurate nutrition profile of recipes is an important feature for food databases with several applications including nutritional assistance, recommendation systems, and dietary analytics. Often in online databases, recipes are obtained from diverse sources in an attempt to maximize the number of recipes and variety of the dataset. This leads to an incomplete and often unreli…
▽ More
The availability of an accurate nutrition profile of recipes is an important feature for food databases with several applications including nutritional assistance, recommendation systems, and dietary analytics. Often in online databases, recipes are obtained from diverse sources in an attempt to maximize the number of recipes and variety of the dataset. This leads to an incomplete and often unreliable set of nutritional details. We propose a scalable method for nutritional profile estimation of recipes from their ingredients section using a standard reliable database for the nutritional values. Previous studies have testified the efficiency of string-matching methods on small datasets. To demonstrate the effectiveness of our procedure, we apply the proposed method on a large dataset, RecipeDB, which contains recipes from multiple data sources, using the United States Department of Agriculture Standard Reference (USDA-SR) Database as a reference for computing nutritional profiles. We evaluate our method by calculating the average error across our database of recipes (36 calories per serving) which is well within the range of errors attributable to physical variations.
△ Less
Submitted 26 April, 2020;
originally announced April 2020.
-
Hierarchical Clustering of World Cuisines
Authors:
Tript Sharma,
Utkarsh Upadhyay,
Jushaan Kalra,
Sakshi Arora,
Saad Ahmad,
Bhavay Aggarwal,
Ganesh Bagler
Abstract:
Cultures across the world have evolved to have unique patterns despite shared ingredients and cooking techniques. Using data obtained from RecipeDB, an online resource for recipes, we extract patterns in 26 world cuisines and further probe for their inter-relatedness. By application of frequent itemset mining and ingredient authenticity we characterize the quintessential patterns in the cuisines a…
▽ More
Cultures across the world have evolved to have unique patterns despite shared ingredients and cooking techniques. Using data obtained from RecipeDB, an online resource for recipes, we extract patterns in 26 world cuisines and further probe for their inter-relatedness. By application of frequent itemset mining and ingredient authenticity we characterize the quintessential patterns in the cuisines and build a hierarchical tree of the world cuisines. This tree provides interesting insights into the evolution of cuisines and their geographical as well as historical relatedness.
△ Less
Submitted 25 April, 2020;
originally announced April 2020.