-
Health Sentinel: An AI Pipeline For Real-time Disease Outbreak Detection
Authors:
Devesh Pant,
Rishi Raj Grandhe,
Vipin Samaria,
Mukul Paul,
Sudhir Kumar,
Saransh Khanna,
Jatin Agrawal,
Jushaan Singh Kalra,
Akhil VSSG,
Satish V Khalikar,
Vipin Garg,
Himanshu Chauhan,
Pranay Verma,
Neha Khandelwal,
Soma S Dhavala,
Minesh Mathew
Abstract:
Early detection of disease outbreaks is crucial to ensure timely intervention by the health authorities. Due to the challenges associated with traditional indicator-based surveillance, monitoring informal sources such as online media has become increasingly popular. However, owing to the number of online articles getting published everyday, manual screening of the articles is impractical. To addre…
▽ More
Early detection of disease outbreaks is crucial to ensure timely intervention by the health authorities. Due to the challenges associated with traditional indicator-based surveillance, monitoring informal sources such as online media has become increasingly popular. However, owing to the number of online articles getting published everyday, manual screening of the articles is impractical. To address this, we propose Health Sentinel. It is a multi-stage information extraction pipeline that uses a combination of ML and non-ML methods to extract events-structured information concerning disease outbreaks or other unusual health events-from online articles. The extracted events are made available to the Media Scanning and Verification Cell (MSVC) at the National Centre for Disease Control (NCDC), Delhi for analysis, interpretation and further dissemination to local agencies for timely intervention. From April 2022 till date, Health Sentinel has processed over 300 million news articles and identified over 95,000 unique health events across India of which over 3,500 events were shortlisted by the public health experts at NCDC as potential outbreaks.
△ Less
Submitted 24 June, 2025;
originally announced June 2025.
-
MoR: Better Handling Diverse Queries with a Mixture of Sparse, Dense, and Human Retrievers
Authors:
Jushaan Singh Kalra,
Xinran Zhao,
To Eun Kim,
Fengyu Cai,
Fernando Diaz,
Tongshuang Wu
Abstract:
Retrieval-augmented Generation (RAG) is powerful, but its effectiveness hinges on which retrievers we use and how. Different retrievers offer distinct, often complementary signals: BM25 captures lexical matches; dense retrievers, semantic similarity. Yet in practice, we typically fix a single retriever based on heuristics, which fails to generalize across diverse information needs. Can we dynamica…
▽ More
Retrieval-augmented Generation (RAG) is powerful, but its effectiveness hinges on which retrievers we use and how. Different retrievers offer distinct, often complementary signals: BM25 captures lexical matches; dense retrievers, semantic similarity. Yet in practice, we typically fix a single retriever based on heuristics, which fails to generalize across diverse information needs. Can we dynamically select and integrate multiple retrievers for each individual query, without the need for manual selection? In our work, we validate this intuition with quantitative analysis and introduce mixture of retrievers: a zero-shot, weighted combination of heterogeneous retrievers. Extensive experiments show that such mixtures are effective and efficient: Despite totaling just 0.8B parameters, this mixture outperforms every individual retriever and even larger 7B models by +10.8% and +3.9% on average, respectively. Further analysis also shows that this mixture framework can help incorporate specialized non-oracle human information sources as retrievers to achieve good collaboration, with a 58.9% relative performance improvement over simulated humans alone.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
Implications of Annotation Artifacts in Edge Probing Test Datasets
Authors:
Sagnik Ray Choudhury,
Jushaan Kalra
Abstract:
Edge probing tests are classification tasks that test for grammatical knowledge encoded in token representations coming from contextual encoders such as large language models (LLMs). Many LLM encoders have shown high performance in EP tests, leading to conjectures about their ability to encode linguistic knowledge. However, a large body of research claims that the tests necessarily do not measure…
▽ More
Edge probing tests are classification tasks that test for grammatical knowledge encoded in token representations coming from contextual encoders such as large language models (LLMs). Many LLM encoders have shown high performance in EP tests, leading to conjectures about their ability to encode linguistic knowledge. However, a large body of research claims that the tests necessarily do not measure the LLM's capacity to encode knowledge, but rather reflect the classifiers' ability to learn the problem. Much of this criticism stems from the fact that often the classifiers have very similar accuracy when an LLM vs a random encoder is used. Consequently, several modifications to the tests have been suggested, including information theoretic probes. We show that commonly used edge probing test datasets have various biases including memorization. When these biases are removed, the LLM encoders do show a significant difference from the random ones, even with the simple non-information theoretic probes.
△ Less
Submitted 20 October, 2023;
originally announced October 2023.
-
Battling Hateful Content in Indic Languages HASOC '21
Authors:
Aditya Kadam,
Anmol Goel,
Jivitesh Jain,
Jushaan Singh Kalra,
Mallika Subramanian,
Manvith Reddy,
Prashant Kodali,
T. H. Arjun,
Manish Shrivastava,
Ponnurangam Kumaraguru
Abstract:
The extensive rise in consumption of online social media (OSMs) by a large number of people poses a critical problem of curbing the spread of hateful content on these platforms. With the growing usage of OSMs in multiple languages, the task of detecting and characterizing hate becomes more complex. The subtle variations of code-mixed texts along with switching scripts only add to the complexity. T…
▽ More
The extensive rise in consumption of online social media (OSMs) by a large number of people poses a critical problem of curbing the spread of hateful content on these platforms. With the growing usage of OSMs in multiple languages, the task of detecting and characterizing hate becomes more complex. The subtle variations of code-mixed texts along with switching scripts only add to the complexity. This paper presents a solution for the HASOC 2021 Multilingual Twitter Hate-Speech Detection challenge by team PreCog IIIT Hyderabad. We adopt a multilingual transformer based approach and describe our architecture for all 6 subtasks as part of the challenge. Out of the 6 teams that participated in all the subtasks, our submissions rank 3rd overall.
△ Less
Submitted 5 November, 2021; v1 submitted 25 October, 2021;
originally announced October 2021.
-
Hierarchical Clustering of World Cuisines
Authors:
Tript Sharma,
Utkarsh Upadhyay,
Jushaan Kalra,
Sakshi Arora,
Saad Ahmad,
Bhavay Aggarwal,
Ganesh Bagler
Abstract:
Cultures across the world have evolved to have unique patterns despite shared ingredients and cooking techniques. Using data obtained from RecipeDB, an online resource for recipes, we extract patterns in 26 world cuisines and further probe for their inter-relatedness. By application of frequent itemset mining and ingredient authenticity we characterize the quintessential patterns in the cuisines a…
▽ More
Cultures across the world have evolved to have unique patterns despite shared ingredients and cooking techniques. Using data obtained from RecipeDB, an online resource for recipes, we extract patterns in 26 world cuisines and further probe for their inter-relatedness. By application of frequent itemset mining and ingredient authenticity we characterize the quintessential patterns in the cuisines and build a hierarchical tree of the world cuisines. This tree provides interesting insights into the evolution of cuisines and their geographical as well as historical relatedness.
△ Less
Submitted 25 April, 2020;
originally announced April 2020.