-
Echo-DND: A dual noise diffusion model for robust and precise left ventricle segmentation in echocardiography
Authors:
Abdur Rahman,
Keerthiveena Balraj,
Manojkumar Ramteke,
Anurag Singh Rathore
Abstract:
Recent advancements in diffusion probabilistic models (DPMs) have revolutionized image processing, demonstrating significant potential in medical applications. Accurate segmentation of the left ventricle (LV) in echocardiograms is crucial for diagnostic procedures and necessary treatments. However, ultrasound images are notoriously noisy with low contrast and ambiguous LV boundaries, thereby compl…
▽ More
Recent advancements in diffusion probabilistic models (DPMs) have revolutionized image processing, demonstrating significant potential in medical applications. Accurate segmentation of the left ventricle (LV) in echocardiograms is crucial for diagnostic procedures and necessary treatments. However, ultrasound images are notoriously noisy with low contrast and ambiguous LV boundaries, thereby complicating the segmentation process. To address these challenges, this paper introduces Echo-DND, a novel dual-noise diffusion model specifically designed for this task. Echo-DND leverages a unique combination of Gaussian and Bernoulli noises. It also incorporates a multi-scale fusion conditioning module to improve segmentation precision. Furthermore, it utilizes spatial coherence calibration to maintain spatial integrity in segmentation masks. The model's performance was rigorously validated on the CAMUS and EchoNet-Dynamic datasets. Extensive evaluations demonstrate that the proposed framework outperforms existing SOTA models. It achieves high Dice scores of 0.962 and 0.939 on these datasets, respectively. The proposed Echo-DND model establishes a new standard in echocardiogram segmentation, and its architecture holds promise for broader applicability in other medical imaging tasks, potentially improving diagnostic accuracy across various medical domains. Project page: https://abdur75648.github.io/Echo-DND
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
A Foundation Model for Spatial Proteomics
Authors:
Muhammad Shaban,
Yuzhou Chang,
Huaying Qiu,
Yao Yu Yeo,
Andrew H. Song,
Guillaume Jaume,
Yuchen Wang,
Luca L. Weishaupt,
Tong Ding,
Anurag Vaidya,
Abdallah Lamane,
Daniel Shao,
Mohammed Zidane,
Yunhao Bai,
Paige McCallum,
Shuli Luo,
Wenrui Wu,
Yang Wang,
Precious Cramer,
Chi Ngai Chan,
Pierre Stephan,
Johanna Schaffenrath,
Jia Le Lee,
Hendrik A. Michel,
Caiwei Tian
, et al. (35 additional authors not shown)
Abstract:
Foundation models have begun to transform image analysis by acting as pretrained generalist backbones that can be adapted to many tasks even when post-training data are limited, yet their impact on spatial proteomics, imaging that maps proteins at single-cell resolution, remains limited. Here, we introduce KRONOS, a foundation model built for spatial proteomics. KRONOS was trained in a self-superv…
▽ More
Foundation models have begun to transform image analysis by acting as pretrained generalist backbones that can be adapted to many tasks even when post-training data are limited, yet their impact on spatial proteomics, imaging that maps proteins at single-cell resolution, remains limited. Here, we introduce KRONOS, a foundation model built for spatial proteomics. KRONOS was trained in a self-supervised manner on over 47 million image patches covering 175 protein markers, 16 tissue types, and 8 fluorescence-based imaging platforms. We introduce key architectural adaptations to address the high-dimensional, multi-channel, and heterogeneous nature of multiplex imaging. We demonstrate that KRONOS learns biologically meaningful representations across multiple scales, ranging from cellular and microenvironment to tissue levels, enabling it to address diverse downstream tasks, including cell phenotyping, region classification, and patient stratification. Evaluated across 11 independent cohorts, KRONOS achieves state-of-the-art performance across cell phenotyping, treatment response prediction, and retrieval tasks, and is highly data-efficient. KRONOS also introduces the paradigm of segmentation-free patch-level processing for efficient and scalable spatial proteomics analysis, allowing cross-institutional comparisons, and as an image reverse search engine for spatial patterns. Together, these results position KRONOS as a flexible and scalable tool for spatial proteomics. The model is publicly accessible at https://github.com/mahmoodlab/KRONOS.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
Continual Learning in Vision-Language Models via Aligned Model Merging
Authors:
Ghada Sokar,
Gintare Karolina Dziugaite,
Anurag Arnab,
Ahmet Iscen,
Pablo Samuel Castro,
Cordelia Schmid
Abstract:
Continual learning is conventionally tackled through sequential fine-tuning, a process that, while enabling adaptation, inherently favors plasticity over the stability needed to retain prior knowledge. While existing approaches attempt to mitigate catastrophic forgetting, a bias towards recent tasks persists as they build upon this sequential nature. In this work we present a new perspective based…
▽ More
Continual learning is conventionally tackled through sequential fine-tuning, a process that, while enabling adaptation, inherently favors plasticity over the stability needed to retain prior knowledge. While existing approaches attempt to mitigate catastrophic forgetting, a bias towards recent tasks persists as they build upon this sequential nature. In this work we present a new perspective based on model merging to maintain stability while still retaining plasticity. Rather than just sequentially updating the model weights, we propose merging newly trained task parameters with previously learned ones, promoting a better balance. To maximize the effectiveness of the merging process, we propose a simple mechanism that promotes learning aligned weights with previous ones, thereby avoiding interference when merging. We evaluate this approach on large Vision-Language Models (VLMs), and demonstrate its effectiveness in reducing forgetting, increasing robustness to various task orders and similarities, and improving generalization.
△ Less
Submitted 30 May, 2025;
originally announced June 2025.
-
Multilingual Information Retrieval with a Monolingual Knowledge Base
Authors:
Yingying Zhuang,
Aman Gupta,
Anurag Beniwal
Abstract:
Multilingual information retrieval has emerged as powerful tools for expanding knowledge sharing across languages. On the other hand, resources on high quality knowledge base are often scarce and in limited languages, therefore an effective embedding model to transform sentences from different languages into a feature vector space same as the knowledge base language becomes the key ingredient for…
▽ More
Multilingual information retrieval has emerged as powerful tools for expanding knowledge sharing across languages. On the other hand, resources on high quality knowledge base are often scarce and in limited languages, therefore an effective embedding model to transform sentences from different languages into a feature vector space same as the knowledge base language becomes the key ingredient for cross language knowledge sharing, especially to transfer knowledge available in high-resource languages to low-resource ones. In this paper we propose a novel strategy to fine-tune multilingual embedding models with weighted sampling for contrastive learning, enabling multilingual information retrieval with a monolingual knowledge base. We demonstrate that the weighted sampling strategy produces performance gains compared to standard ones by up to 31.03\% in MRR and up to 33.98\% in Recall@3. Additionally, our proposed methodology is language agnostic and applicable for both multilingual and code switching use cases.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
Lessons Learned from the URGENT 2024 Speech Enhancement Challenge
Authors:
Wangyou Zhang,
Kohei Saijo,
Samuele Cornell,
Robin Scheibler,
Chenda Li,
Zhaoheng Ni,
Anurag Kumar,
Marvin Sach,
Wei Wang,
Yihui Fu,
Shinji Watanabe,
Tim Fingscheidt,
Yanmin Qian
Abstract:
The URGENT 2024 Challenge aims to foster speech enhancement (SE) techniques with great universality, robustness, and generalizability, featuring a broader task definition, large-scale multi-domain data, and comprehensive evaluation metrics. Nourished by the challenge outcomes, this paper presents an in-depth analysis of two key, yet understudied, issues in SE system development: data cleaning and…
▽ More
The URGENT 2024 Challenge aims to foster speech enhancement (SE) techniques with great universality, robustness, and generalizability, featuring a broader task definition, large-scale multi-domain data, and comprehensive evaluation metrics. Nourished by the challenge outcomes, this paper presents an in-depth analysis of two key, yet understudied, issues in SE system development: data cleaning and evaluation metrics. We highlight several overlooked problems in traditional SE pipelines: (1) mismatches between declared and effective audio bandwidths, along with label noise even in various "high-quality" speech corpora; (2) lack of both effective SE systems to conquer the hardest conditions (e.g., speech overlap, strong noise / reverberation) and reliable measure of speech sample difficulty; (3) importance of combining multifaceted metrics for a comprehensive evaluation correlating well with human judgment. We hope that this endeavor can inspire improved SE pipeline designs in the future.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
Legal Compliance Evaluation of Smart Contracts Generated By Large Language Models
Authors:
Chanuka Wijayakoon,
Hai Dong,
H. M. N. Dilum Bandara,
Zahir Tari,
Anurag Soin
Abstract:
Smart contracts can implement and automate parts of legal contracts, but ensuring their legal compliance remains challenging. Existing approaches such as formal specification, verification, and model-based development require expertise in both legal and software development domains, as well as extensive manual effort. Given the recent advances of Large Language Models (LLMs) in code generation, we…
▽ More
Smart contracts can implement and automate parts of legal contracts, but ensuring their legal compliance remains challenging. Existing approaches such as formal specification, verification, and model-based development require expertise in both legal and software development domains, as well as extensive manual effort. Given the recent advances of Large Language Models (LLMs) in code generation, we investigate their ability to generate legally compliant smart contracts directly from natural language legal contracts, addressing these challenges. We propose a novel suite of metrics to quantify legal compliance based on modeling both legal and smart contracts as processes and comparing their behaviors. We select four LLMs, generate 20 smart contracts based on five legal contracts, and analyze their legal compliance. We find that while all LLMs generate syntactically correct code, there is significant variance in their legal compliance with larger models generally showing higher levels of compliance. We also evaluate the proposed metrics against properties of software metrics, showing they provide fine-grained distinctions, enable nuanced comparisons, and are applicable across domains for code from any source, LLM or developer. Our results suggest that LLMs can assist in generating starter code for legally compliant smart contracts with strict reviews, and the proposed metrics provide a foundation for automated and self-improving development workflows.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
REIC: RAG-Enhanced Intent Classification at Scale
Authors:
Ziji Zhang,
Michael Yang,
Zhiyu Chen,
Yingying Zhuang,
Shu-Ting Pi,
Qun Liu,
Rajashekar Maragoud,
Vy Nguyen,
Anurag Beniwal
Abstract:
Accurate intent classification is critical for efficient routing in customer service, ensuring customers are connected with the most suitable agents while reducing handling times and operational costs. However, as companies expand their product lines, intent classification faces scalability challenges due to the increasing number of intents and variations in taxonomy across different verticals. In…
▽ More
Accurate intent classification is critical for efficient routing in customer service, ensuring customers are connected with the most suitable agents while reducing handling times and operational costs. However, as companies expand their product lines, intent classification faces scalability challenges due to the increasing number of intents and variations in taxonomy across different verticals. In this paper, we introduce REIC, a Retrieval-augmented generation Enhanced Intent Classification approach, which addresses these challenges effectively. REIC leverages retrieval-augmented generation (RAG) to dynamically incorporate relevant knowledge, enabling precise classification without the need for frequent retraining. Through extensive experiments on real-world datasets, we demonstrate that REIC outperforms traditional fine-tuning, zero-shot, and few-shot methods in large-scale customer service settings. Our results highlight its effectiveness in both in-domain and out-of-domain scenarios, demonstrating its potential for real-world deployment in adaptive and large-scale intent classification systems.
△ Less
Submitted 30 May, 2025;
originally announced June 2025.
-
Progressive Data Dropout: An Embarrassingly Simple Approach to Faster Training
Authors:
Shriram M S,
Xinyue Hao,
Shihao Hou,
Yang Lu,
Laura Sevilla-Lara,
Anurag Arnab,
Shreyank N Gowda
Abstract:
The success of the machine learning field has reliably depended on training on large datasets. While effective, this trend comes at an extraordinary cost. This is due to two deeply intertwined factors: the size of models and the size of datasets. While promising research efforts focus on reducing the size of models, the other half of the equation remains fairly mysterious. Indeed, it is surprising…
▽ More
The success of the machine learning field has reliably depended on training on large datasets. While effective, this trend comes at an extraordinary cost. This is due to two deeply intertwined factors: the size of models and the size of datasets. While promising research efforts focus on reducing the size of models, the other half of the equation remains fairly mysterious. Indeed, it is surprising that the standard approach to training remains to iterate over and over, uniformly sampling the training dataset. In this paper we explore a series of alternative training paradigms that leverage insights from hard-data-mining and dropout, simple enough to implement and use that can become the new training standard. The proposed Progressive Data Dropout reduces the number of effective epochs to as little as 12.4% of the baseline. This savings actually do not come at any cost for accuracy. Surprisingly, the proposed method improves accuracy by up to 4.82%. Our approach requires no changes to model architecture or optimizer, and can be applied across standard training pipelines, thus posing an excellent opportunity for wide adoption. Code can be found here: https://github.com/bazyagami/LearningWithRevision
△ Less
Submitted 6 June, 2025; v1 submitted 28 May, 2025;
originally announced May 2025.
-
Amulet: Putting Complex Multi-Turn Conversations on the Stand with LLM Juries
Authors:
Sahana Ramnath,
Anurag Mudgil,
Brihi Joshi,
Skyler Hallinan,
Xiang Ren
Abstract:
Today, large language models are widely used as judges to evaluate responses from other language models. Hence, it is imperative to benchmark and improve these LLM-judges on real-world language model usage: a typical human-assistant conversation is lengthy, and shows significant diversity in topics, intents, and requirements across turns, e.g. social interactions, task requests, feedback. We prese…
▽ More
Today, large language models are widely used as judges to evaluate responses from other language models. Hence, it is imperative to benchmark and improve these LLM-judges on real-world language model usage: a typical human-assistant conversation is lengthy, and shows significant diversity in topics, intents, and requirements across turns, e.g. social interactions, task requests, feedback. We present Amulet, a framework that leverages pertinent linguistic concepts of dialog-acts and maxims to improve the accuracy of LLM-judges on preference data with complex, multi-turn conversational context. Amulet presents valuable insights about (a) the communicative structures and intents present in the conversation (dialog acts), and (b) the satisfaction of conversational principles (maxims) by the preference responses, and uses them to make judgments. On four challenging datasets, Amulet shows that (a) humans frequently (60 to 70 percent of the time) change their intents from one turn of the conversation to the next, and (b) in 75 percent of instances, the preference responses can be differentiated via dialog acts and/or maxims, reiterating the latter's significance in judging such data. Amulet can be used either as a judge by applying the framework to a single LLM, or integrated into a jury with different LLM judges; our judges and juries show strong improvements on relevant baselines for all four datasets.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
Mechanistic Interpretability of GPT-like Models on Summarization Tasks
Authors:
Anurag Mishra
Abstract:
Mechanistic interpretability research seeks to reveal the inner workings of large language models, yet most work focuses on classification or generative tasks rather than summarization. This paper presents an interpretability framework for analyzing how GPT-like models adapt to summarization tasks. We conduct differential analysis between pre-trained and fine-tuned models, quantifying changes in a…
▽ More
Mechanistic interpretability research seeks to reveal the inner workings of large language models, yet most work focuses on classification or generative tasks rather than summarization. This paper presents an interpretability framework for analyzing how GPT-like models adapt to summarization tasks. We conduct differential analysis between pre-trained and fine-tuned models, quantifying changes in attention patterns and internal activations. By identifying specific layers and attention heads that undergo significant transformation, we locate the "summarization circuit" within the model architecture. Our findings reveal that middle layers (particularly 2, 3, and 5) exhibit the most dramatic changes, with 62% of attention heads showing decreased entropy, indicating a shift toward focused information selection. We demonstrate that targeted LoRA adaptation of these identified circuits achieves significant performance improvement over standard LoRA fine-tuning while requiring fewer training epochs. This work bridges the gap between black-box evaluation and mechanistic understanding, providing insights into how neural networks perform information selection and compression during summarization.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Optimizing LLM-Based Multi-Agent System with Textual Feedback: A Case Study on Software Development
Authors:
Ming Shen,
Raphael Shu,
Anurag Pratik,
James Gung,
Yubin Ge,
Monica Sunkara,
Yi Zhang
Abstract:
We have seen remarkable progress in large language models (LLMs) empowered multi-agent systems solving complex tasks necessitating cooperation among experts with diverse skills. However, optimizing LLM-based multi-agent systems remains challenging. In this work, we perform an empirical case study on group optimization of role-based multi-agent systems utilizing natural language feedback for challe…
▽ More
We have seen remarkable progress in large language models (LLMs) empowered multi-agent systems solving complex tasks necessitating cooperation among experts with diverse skills. However, optimizing LLM-based multi-agent systems remains challenging. In this work, we perform an empirical case study on group optimization of role-based multi-agent systems utilizing natural language feedback for challenging software development tasks under various evaluation dimensions. We propose a two-step agent prompts optimization pipeline: identifying underperforming agents with their failure explanations utilizing textual feedback and then optimizing system prompts of identified agents utilizing failure explanations. We then study the impact of various optimization settings on system performance with two comparison groups: online against offline optimization and individual against group optimization. For group optimization, we study two prompting strategies: one-pass and multi-pass prompting optimizations. Overall, we demonstrate the effectiveness of our optimization method for role-based multi-agent systems tackling software development tasks evaluated on diverse evaluation dimensions, and we investigate the impact of diverse optimization settings on group behaviors of the multi-agent systems to provide practical insights for future development.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
Information Extraction from Visually Rich Documents using LLM-based Organization of Documents into Independent Textual Segments
Authors:
Aniket Bhattacharyya,
Anurag Tripathi,
Ujjal Das,
Archan Karmakar,
Amit Pathak,
Maneesh Gupta
Abstract:
Information extraction (IE) from Visually Rich Documents (VRDs) containing layout features along with text is a critical and well-studied task. Specialized non-LLM NLP-based solutions typically involve training models using both textual and geometric information to label sequences/tokens as named entities or answers to specific questions. However, these approaches lack reasoning, are not able to i…
▽ More
Information extraction (IE) from Visually Rich Documents (VRDs) containing layout features along with text is a critical and well-studied task. Specialized non-LLM NLP-based solutions typically involve training models using both textual and geometric information to label sequences/tokens as named entities or answers to specific questions. However, these approaches lack reasoning, are not able to infer values not explicitly present in documents, and do not generalize well to new formats. Generative LLM-based approaches proposed recently are capable of reasoning, but struggle to comprehend clues from document layout especially in previously unseen document formats, and do not show competitive performance in heterogeneous VRD benchmark datasets. In this paper, we propose BLOCKIE, a novel LLM-based approach that organizes VRDs into localized, reusable semantic textual segments called $\textit{semantic blocks}$, which are processed independently. Through focused and more generalizable reasoning,our approach outperforms the state-of-the-art on public VRD benchmarks by 1-3% in F1 scores, is resilient to document formats previously not encountered and shows abilities to correctly extract information not explicitly present in documents.
△ Less
Submitted 18 May, 2025;
originally announced May 2025.
-
Learning to Highlight Audio by Watching Movies
Authors:
Chao Huang,
Ruohan Gao,
J. M. F. Tsang,
Jan Kurcius,
Cagdas Bilen,
Chenliang Xu,
Anurag Kumar,
Sanjeel Parekh
Abstract:
Recent years have seen a significant increase in video content creation and consumption. Crafting engaging content requires the careful curation of both visual and audio elements. While visual cue curation, through techniques like optimal viewpoint selection or post-editing, has been central to media production, its natural counterpart, audio, has not undergone equivalent advancements. This often…
▽ More
Recent years have seen a significant increase in video content creation and consumption. Crafting engaging content requires the careful curation of both visual and audio elements. While visual cue curation, through techniques like optimal viewpoint selection or post-editing, has been central to media production, its natural counterpart, audio, has not undergone equivalent advancements. This often results in a disconnect between visual and acoustic saliency. To bridge this gap, we introduce a novel task: visually-guided acoustic highlighting, which aims to transform audio to deliver appropriate highlighting effects guided by the accompanying video, ultimately creating a more harmonious audio-visual experience. We propose a flexible, transformer-based multimodal framework to solve this task. To train our model, we also introduce a new dataset -- the muddy mix dataset, leveraging the meticulous audio and video crafting found in movies, which provides a form of free supervision. We develop a pseudo-data generation process to simulate poorly mixed audio, mimicking real-world scenarios through a three-step process -- separation, adjustment, and remixing. Our approach consistently outperforms several baselines in both quantitative and subjective evaluation. We also systematically study the impact of different types of contextual guidance and difficulty levels of the dataset. Our project page is here: https://wikichao.github.io/VisAH/.
△ Less
Submitted 17 May, 2025;
originally announced May 2025.
-
When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs
Authors:
Xiaomin Li,
Zhou Yu,
Zhiwei Zhang,
Xupeng Chen,
Ziji Zhang,
Yingying Zhuang,
Narayanan Sadagopan,
Anurag Beniwal
Abstract:
Reasoning-enhanced large language models (RLLMs), whether explicitly trained for reasoning or prompted via chain-of-thought (CoT), have achieved state-of-the-art performance on many complex reasoning tasks. However, we uncover a surprising and previously overlooked phenomenon: explicit CoT reasoning can significantly degrade instruction-following accuracy. Evaluating 15 models on two benchmarks: I…
▽ More
Reasoning-enhanced large language models (RLLMs), whether explicitly trained for reasoning or prompted via chain-of-thought (CoT), have achieved state-of-the-art performance on many complex reasoning tasks. However, we uncover a surprising and previously overlooked phenomenon: explicit CoT reasoning can significantly degrade instruction-following accuracy. Evaluating 15 models on two benchmarks: IFEval (with simple, rule-verifiable constraints) and ComplexBench (with complex, compositional constraints), we consistently observe performance drops when CoT prompting is applied. Through large-scale case studies and an attention-based analysis, we identify common patterns where reasoning either helps (e.g., with formatting or lexical precision) or hurts (e.g., by neglecting simple constraints or introducing unnecessary content). We propose a metric, constraint attention, to quantify model focus during generation and show that CoT reasoning often diverts attention away from instruction-relevant tokens. To mitigate these effects, we introduce and evaluate four strategies: in-context learning, self-reflection, self-selective reasoning, and classifier-selective reasoning. Our results demonstrate that selective reasoning strategies, particularly classifier-selective reasoning, can substantially recover lost performance. To our knowledge, this is the first work to systematically expose reasoning-induced failures in instruction-following and offer practical mitigation strategies.
△ Less
Submitted 20 May, 2025; v1 submitted 16 May, 2025;
originally announced May 2025.
-
The generalized trifference problem
Authors:
Anurag Bishnoi,
Bartłomiej Kielak,
Benedek Kovács,
Zoltán Lóránt Nagy,
Gábor Somlai,
Máté Vizer,
Zeyu Zheng
Abstract:
We study the problem of finding the largest number $T(n, m)$ of ternary vectors of length $n$ such that for any three distinct vectors there are at least $m$ coordinates where they pairwise differ.
For $m = 1$, this is the classical trifference problem which is wide open.
We prove upper and lower bounds on $T(n, m)$ for various ranges of the parameter $m$ and determine the phase transition thr…
▽ More
We study the problem of finding the largest number $T(n, m)$ of ternary vectors of length $n$ such that for any three distinct vectors there are at least $m$ coordinates where they pairwise differ.
For $m = 1$, this is the classical trifference problem which is wide open.
We prove upper and lower bounds on $T(n, m)$ for various ranges of the parameter $m$ and determine the phase transition threshold on $m=m(n)$ where $T(n, m)$ jumps from constant to exponential in $n$.
By relating the linear version of this problem to a problem on blocking sets in finite geometry, we give explicit constructions and probabilistic lower bounds.
We also compute the exact values of this function and its linear variation for small parameters.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Separating Two Points with Obstacles in the Plane: Improved Upper and Lower Bounds
Authors:
Jack Spalding-Jamieson,
Anurag Murty Naredla
Abstract:
Given two points in the plane, and a set of "obstacles" given as curves through the plane with assigned weights, we consider the point-separation problem, which asks for the minimum-weight subset of the obstacles separating the two points. A few computational models for this problem have been previously studied. We give a unified approach to this problem in all models via a reduction to a particul…
▽ More
Given two points in the plane, and a set of "obstacles" given as curves through the plane with assigned weights, we consider the point-separation problem, which asks for the minimum-weight subset of the obstacles separating the two points. A few computational models for this problem have been previously studied. We give a unified approach to this problem in all models via a reduction to a particular shortest-path problem, and obtain improved running times in essentially all cases. In addition, we also give fine-grained lower bounds for many cases.
△ Less
Submitted 24 April, 2025;
originally announced April 2025.
-
AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis
Authors:
Khiem Vuong,
Anurag Ghosh,
Deva Ramanan,
Srinivasa Narasimhan,
Shubham Tulsiani
Abstract:
We explore the task of geometric reconstruction of images captured from a mixture of ground and aerial views. Current state-of-the-art learning-based approaches fail to handle the extreme viewpoint variation between aerial-ground image pairs. Our hypothesis is that the lack of high-quality, co-registered aerial-ground datasets for training is a key reason for this failure. Such data is difficult t…
▽ More
We explore the task of geometric reconstruction of images captured from a mixture of ground and aerial views. Current state-of-the-art learning-based approaches fail to handle the extreme viewpoint variation between aerial-ground image pairs. Our hypothesis is that the lack of high-quality, co-registered aerial-ground datasets for training is a key reason for this failure. Such data is difficult to assemble precisely because it is difficult to reconstruct in a scalable way. To overcome this challenge, we propose a scalable framework combining pseudo-synthetic renderings from 3D city-wide meshes (e.g., Google Earth) with real, ground-level crowd-sourced images (e.g., MegaDepth). The pseudo-synthetic data simulates a wide range of aerial viewpoints, while the real, crowd-sourced images help improve visual fidelity for ground-level images where mesh-based renderings lack sufficient detail, effectively bridging the domain gap between real images and pseudo-synthetic renderings. Using this hybrid dataset, we fine-tune several state-of-the-art algorithms and achieve significant improvements on real-world, zero-shot aerial-ground tasks. For example, we observe that baseline DUSt3R localizes fewer than 5% of aerial-ground pairs within 5 degrees of camera rotation error, while fine-tuning with our data raises accuracy to nearly 56%, addressing a major failure point in handling large viewpoint changes. Beyond camera estimation and scene reconstruction, our dataset also improves performance on downstream tasks like novel-view synthesis in challenging aerial-ground scenarios, demonstrating the practical value of our approach in real-world applications.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
Trajectory Adaptation using Large Language Models
Authors:
Anurag Maurya,
Tashmoy Ghosh,
Ravi Prakash
Abstract:
Adapting robot trajectories based on human instructions as per new situations is essential for achieving more intuitive and scalable human-robot interactions. This work proposes a flexible language-based framework to adapt generic robotic trajectories produced by off-the-shelf motion planners like RRT, A-star, etc, or learned from human demonstrations. We utilize pre-trained LLMs to adapt trajecto…
▽ More
Adapting robot trajectories based on human instructions as per new situations is essential for achieving more intuitive and scalable human-robot interactions. This work proposes a flexible language-based framework to adapt generic robotic trajectories produced by off-the-shelf motion planners like RRT, A-star, etc, or learned from human demonstrations. We utilize pre-trained LLMs to adapt trajectory waypoints by generating code as a policy for dense robot manipulation, enabling more complex and flexible instructions than current methods. This approach allows us to incorporate a broader range of commands, including numerical inputs. Compared to state-of-the-art feature-based sequence-to-sequence models which require training, our method does not require task-specific training and offers greater interpretability and more effective feedback mechanisms. We validate our approach through simulation experiments on the robotic manipulator, aerial vehicle, and ground robot in the Pybullet and Gazebo simulation environments, demonstrating that LLMs can successfully adapt trajectories to complex human instructions.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
Hearing Anywhere in Any Environment
Authors:
Xiulong Liu,
Anurag Kumar,
Paul Calamia,
Sebastia V. Amengual,
Calvin Murdock,
Ishwarya Ananthabhotla,
Philip Robinson,
Eli Shlizerman,
Vamsi Krishna Ithapu,
Ruohan Gao
Abstract:
In mixed reality applications, a realistic acoustic experience in spatial environments is as crucial as the visual experience for achieving true immersion. Despite recent advances in neural approaches for Room Impulse Response (RIR) estimation, most existing methods are limited to the single environment on which they are trained, lacking the ability to generalize to new rooms with different geomet…
▽ More
In mixed reality applications, a realistic acoustic experience in spatial environments is as crucial as the visual experience for achieving true immersion. Despite recent advances in neural approaches for Room Impulse Response (RIR) estimation, most existing methods are limited to the single environment on which they are trained, lacking the ability to generalize to new rooms with different geometries and surface materials. We aim to develop a unified model capable of reconstructing the spatial acoustic experience of any environment with minimum additional measurements. To this end, we present xRIR, a framework for cross-room RIR prediction. The core of our generalizable approach lies in combining a geometric feature extractor, which captures spatial context from panorama depth images, with a RIR encoder that extracts detailed acoustic features from only a few reference RIR samples. To evaluate our method, we introduce ACOUSTICROOMS, a new dataset featuring high-fidelity simulation of over 300,000 RIRs from 260 rooms. Experiments show that our method strongly outperforms a series of baselines. Furthermore, we successfully perform sim-to-real transfer by evaluating our model on four real-world environments, demonstrating the generalizability of our approach and the realism of our dataset.
△ Less
Submitted 4 June, 2025; v1 submitted 14 April, 2025;
originally announced April 2025.
-
High-Performance Gradient Evaluation for Complex Soft Materials Using MPI-based DFS Algorithm
Authors:
Anurag Bhattacharyya
Abstract:
This article presents a depth-first search (DFS)-based algorithm for evaluating sensitivity gradients in the topology optimization of soft materials exhibiting complex deformation behavior. The algorithm is formulated using a time-dependent adjoint sensitivity approach and is implemented within a PETSc-based C++ MPI framework for efficient parallel computing. It has been found that on a single pro…
▽ More
This article presents a depth-first search (DFS)-based algorithm for evaluating sensitivity gradients in the topology optimization of soft materials exhibiting complex deformation behavior. The algorithm is formulated using a time-dependent adjoint sensitivity approach and is implemented within a PETSc-based C++ MPI framework for efficient parallel computing. It has been found that on a single processor, the sensitivity analysis for these complex materials can take approximately 45 minutes. This necessitates the use of high-performance computing (HPC) to achieve feasible optimization times. This work provides insights into the algorithmic framework and its application to large-scale generative design for physics integrated simulation of soft materials under complex loading conditions.
△ Less
Submitted 18 March, 2025;
originally announced April 2025.
-
LiDAR-based Object Detection with Real-time Voice Specifications
Authors:
Anurag Kulkarni
Abstract:
This paper presents a LiDAR-based object detection system with real-time voice specifications, integrating KITTI's 3D point clouds and RGB images through a multi-modal PointNet framework. It achieves 87.0% validation accuracy on a 3000-sample subset, surpassing a 200-sample baseline of 67.5% by combining spatial and visual data, addressing class imbalance with weighted loss, and refining training…
▽ More
This paper presents a LiDAR-based object detection system with real-time voice specifications, integrating KITTI's 3D point clouds and RGB images through a multi-modal PointNet framework. It achieves 87.0% validation accuracy on a 3000-sample subset, surpassing a 200-sample baseline of 67.5% by combining spatial and visual data, addressing class imbalance with weighted loss, and refining training via adaptive techniques. A Tkinter prototype provides natural Indian male voice output using Edge TTS (en-IN-PrabhatNeural), alongside 3D visualizations and real-time feedback, enhancing accessibility and safety in autonomous navigation, assistive technology, and beyond. The study offers a detailed methodology, comprehensive experimental analysis, and a broad review of applications and challenges, establishing this work as a scalable advancement in human-computer interaction and environmental perception, aligned with current research trends.
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
Contrasting Low and High-Resolution Features for HER2 Scoring using Deep Learning
Authors:
Ekansh Chauhan,
Anila Sharma,
Amit Sharma,
Vikas Nishadham,
Asha Ghughtyal,
Ankur Kumar,
Gurudutt Gupta,
Anurag Mehta,
C. V. Jawahar,
P. K. Vinod
Abstract:
Breast cancer, the most common malignancy among women, requires precise detection and classification for effective treatment. Immunohistochemistry (IHC) biomarkers like HER2, ER, and PR are critical for identifying breast cancer subtypes. However, traditional IHC classification relies on pathologists' expertise, making it labor-intensive and subject to significant inter-observer variability. To ad…
▽ More
Breast cancer, the most common malignancy among women, requires precise detection and classification for effective treatment. Immunohistochemistry (IHC) biomarkers like HER2, ER, and PR are critical for identifying breast cancer subtypes. However, traditional IHC classification relies on pathologists' expertise, making it labor-intensive and subject to significant inter-observer variability. To address these challenges, this study introduces the India Pathology Breast Cancer Dataset (IPD-Breast), comprising of 1,272 IHC slides (HER2, ER, and PR) aimed at automating receptor status classification. The primary focus is on developing predictive models for HER2 3-way classification (0, Low, High) to enhance prognosis. Evaluation of multiple deep learning models revealed that an end-to-end ConvNeXt network utilizing low-resolution IHC images achieved an AUC, F1, and accuracy of 91.79%, 83.52%, and 83.56%, respectively, for 3-way classification, outperforming patch-based methods by over 5.35% in F1 score. This study highlights the potential of simple yet effective deep learning techniques to significantly improve accuracy and reproducibility in breast cancer classification, supporting their integration into clinical workflows for better patient outcomes.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
Data to Decisions: A Computational Framework to Identify skill requirements from Advertorial Data
Authors:
Aakash Singh,
Anurag Kanaujia,
Vivek Kumar Singh
Abstract:
Among the factors of production, human capital or skilled manpower is the one that keeps evolving and adapts to changing conditions and resources. This adaptability makes human capital the most crucial factor in ensuring a sustainable growth of industry/sector. As new technologies are developed and adopted, the new generations are required to acquire skills in newer technologies in order to be emp…
▽ More
Among the factors of production, human capital or skilled manpower is the one that keeps evolving and adapts to changing conditions and resources. This adaptability makes human capital the most crucial factor in ensuring a sustainable growth of industry/sector. As new technologies are developed and adopted, the new generations are required to acquire skills in newer technologies in order to be employable. At the same time professionals are required to upskill and reskill themselves to remain relevant in the industry. There is however no straightforward method to identify the skill needs of the industry at a given point of time. Therefore, this paper proposes a data to decision framework that can successfully identify the desired skill set in a given area by analysing the advertorial data collected from popular online job portals and supplied as input to the framework. The proposed framework uses techniques of statistical analysis, data mining and natural language processing for the purpose. The applicability of the framework is demonstrated on CS&IT job advertisement data from India. The analytical results not only provide useful insights about current state of skill needs in CS&IT industry but also provide practical implications to prospective job applicants, training agencies, and institutions of higher education & professional training.
△ Less
Submitted 21 March, 2025;
originally announced March 2025.
-
Truthful Elicitation of Imprecise Forecasts
Authors:
Anurag Singh,
Siu Lun Chau,
Krikamol Muandet
Abstract:
The quality of probabilistic forecasts is crucial for decision-making under uncertainty. While proper scoring rules incentivize truthful reporting of precise forecasts, they fall short when forecasters face epistemic uncertainty about their beliefs, limiting their use in safety-critical domains where decision-makers (DMs) prioritize proper uncertainty management. To address this, we propose a fram…
▽ More
The quality of probabilistic forecasts is crucial for decision-making under uncertainty. While proper scoring rules incentivize truthful reporting of precise forecasts, they fall short when forecasters face epistemic uncertainty about their beliefs, limiting their use in safety-critical domains where decision-makers (DMs) prioritize proper uncertainty management. To address this, we propose a framework for scoring imprecise forecasts -- forecasts given as a set of beliefs. Despite existing impossibility results for deterministic scoring rules, we enable truthful elicitation by drawing connection to social choice theory and introducing a two-way communication framework where DMs first share their aggregation rules (e.g., averaging or min-max) used in downstream decisions for resolving forecast ambiguity. This, in turn, helps forecasters resolve indecision during elicitation. We further show that truthful elicitation of imprecise forecasts is achievable using proper scoring rules randomized over the aggregation procedure. Our approach allows DM to elicit and integrate the forecaster's epistemic uncertainty into their decision-making process, thus improving credibility.
△ Less
Submitted 20 March, 2025;
originally announced March 2025.
-
Zero-to-One IDV: A Conceptual Model for AI-Powered Identity Verification
Authors:
Aniket Vaidya,
Anurag Awasthi
Abstract:
In today's increasingly digital interactions, robust Identity Verification (IDV) is crucial for security and trust. Artificial Intelligence (AI) is transforming IDV, enhancing accuracy and fraud detection. This paper introduces ``Zero to One,'' a holistic conceptual framework for developing AI-powered IDV products. This paper outlines the foundational problem and research objectives that necessita…
▽ More
In today's increasingly digital interactions, robust Identity Verification (IDV) is crucial for security and trust. Artificial Intelligence (AI) is transforming IDV, enhancing accuracy and fraud detection. This paper introduces ``Zero to One,'' a holistic conceptual framework for developing AI-powered IDV products. This paper outlines the foundational problem and research objectives that necessitate a new framework for IDV in the age of AI. It details the evolution of identity verification and the current regulatory landscape to contextualize the need for a robust conceptual model. The core of the paper is the presentation of the ``Zero to One'' framework itself, dissecting its four essential components: Document Verification, Biometric Verification, Risk Assessment, and Orchestration. The paper concludes by discussing the implications of this conceptual model and suggesting future research directions focused on the framework's further development and application. The framework addresses security, privacy, UX, and regulatory compliance, offering a structured approach to building effective IDV solutions. Successful IDV platforms require a balanced conceptual understanding of verification methods, risk management, and operational scalability, with AI as a key enabler. This paper presents the ``Zero to One'' framework as a refined conceptual model, detailing verification layers, and AI's transformative role in shaping next-generation IDV products.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
R+R: Security Vulnerability Dataset Quality Is Critical
Authors:
Anurag Swarnim Yadav,
Joseph N. Wilson
Abstract:
Large Language Models (LLMs) are of great interest in vulnerability detection and repair. The effectiveness of these models hinges on the quality of the datasets used for both training and evaluation. Our investigation reveals that a number of studies featured in prominent software engineering conferences have employed datasets that are plagued by high duplication rates, questionable label accurac…
▽ More
Large Language Models (LLMs) are of great interest in vulnerability detection and repair. The effectiveness of these models hinges on the quality of the datasets used for both training and evaluation. Our investigation reveals that a number of studies featured in prominent software engineering conferences have employed datasets that are plagued by high duplication rates, questionable label accuracy, and incomplete samples. Using these datasets for experimentation will yield incorrect results that are significantly different from actual expected behavior. For example, the state-of-the-art VulRepair Model, which is reported to have 44% accuracy, on average yielded 9% accuracy when test-set duplicates were removed from its training set and 13% accuracy when training-set duplicates were removed from its test set. In an effort to tackle these data quality concerns, we have retrained models from several papers without duplicates and conducted an accuracy assessment of labels for the top ten most hazardous Common Weakness Enumerations (CWEs). Our findings indicate that 56% of the samples had incorrect labels and 44% comprised incomplete samples--only 31% were both accurate and complete. Finally, we employ transfer learning using a large deduplicated bugfix corpus to show that these models can exhibit better performance if given larger amounts of high-quality pre-training data, leading us to conclude that while previous studies have over-estimated performance due to poor dataset quality, this does not demonstrate that better performance is not possible.
△ Less
Submitted 8 March, 2025;
originally announced March 2025.
-
What Are You Doing? A Closer Look at Controllable Human Video Generation
Authors:
Emanuele Bugliarello,
Anurag Arnab,
Roni Paiss,
Pieter-Jan Kindermans,
Cordelia Schmid
Abstract:
High-quality benchmarks are crucial for driving progress in machine learning research. However, despite the growing interest in video generation, there is no comprehensive dataset to evaluate human generation. Humans can perform a wide variety of actions and interactions, but existing datasets, like TikTok and TED-Talks, lack the diversity and complexity to fully capture the capabilities of video…
▽ More
High-quality benchmarks are crucial for driving progress in machine learning research. However, despite the growing interest in video generation, there is no comprehensive dataset to evaluate human generation. Humans can perform a wide variety of actions and interactions, but existing datasets, like TikTok and TED-Talks, lack the diversity and complexity to fully capture the capabilities of video generation models. We close this gap by introducing `What Are You Doing?' (WYD): a new benchmark for fine-grained evaluation of controllable image-to-video generation of humans. WYD consists of 1{,}544 captioned videos that have been meticulously collected and annotated with 56 fine-grained categories. These allow us to systematically measure performance across 9 aspects of human generation, including actions, interactions and motion. We also propose and validate automatic metrics that leverage our annotations and better capture human evaluations. Equipped with our dataset and metrics, we perform in-depth analyses of seven state-of-the-art models in controllable image-to-video generation, showing how WYD provides novel insights about the capabilities of these models. We release our data and code to drive forward progress in human video generation modeling at https://github.com/google-deepmind/wyd-benchmark.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
Semantic Volume: Quantifying and Detecting both External and Internal Uncertainty in LLMs
Authors:
Xiaomin Li,
Zhou Yu,
Ziji Zhang,
Yingying Zhuang,
Swair Shah,
Narayanan Sadagopan,
Anurag Beniwal
Abstract:
Large language models (LLMs) have demonstrated remarkable performance across diverse tasks by encoding vast amounts of factual knowledge. However, they are still prone to hallucinations, generating incorrect or misleading information, often accompanied by high uncertainty. Existing methods for hallucination detection primarily focus on quantifying internal uncertainty, which arises from missing or…
▽ More
Large language models (LLMs) have demonstrated remarkable performance across diverse tasks by encoding vast amounts of factual knowledge. However, they are still prone to hallucinations, generating incorrect or misleading information, often accompanied by high uncertainty. Existing methods for hallucination detection primarily focus on quantifying internal uncertainty, which arises from missing or conflicting knowledge within the model. However, hallucinations can also stem from external uncertainty, where ambiguous user queries lead to multiple possible interpretations. In this work, we introduce Semantic Volume, a novel mathematical measure for quantifying both external and internal uncertainty in LLMs. Our approach perturbs queries and responses, embeds them in a semantic space, and computes the determinant of the Gram matrix of the embedding vectors, capturing their dispersion as a measure of uncertainty. Our framework provides a generalizable and unsupervised uncertainty detection method without requiring internal access to LLMs. We conduct extensive experiments on both external and internal uncertainty detection, demonstrating that our Semantic Volume method consistently outperforms existing baselines in both tasks. Additionally, we provide theoretical insights linking our measure to differential entropy, unifying and extending previous sampling-based uncertainty measures such as the semantic entropy. Semantic Volume is shown to be a robust and interpretable approach to improving the reliability of LLMs by systematically detecting uncertainty in both user queries and model responses.
△ Less
Submitted 5 May, 2025; v1 submitted 28 February, 2025;
originally announced February 2025.
-
Generative Modeling of Individual Behavior at Scale
Authors:
Nabil Omi,
Lucas Caccia,
Anurag Sarkar,
Jordan T. Ash,
Siddhartha Sen
Abstract:
There has been a growing interest in using AI to model human behavior, particularly in domains where humans interact with this technology. While most existing work models human behavior at an aggregate level, our goal is to model behavior at the individual level. Recent approaches to behavioral stylometry -- or the task of identifying a person from their actions alone -- have shown promise in doma…
▽ More
There has been a growing interest in using AI to model human behavior, particularly in domains where humans interact with this technology. While most existing work models human behavior at an aggregate level, our goal is to model behavior at the individual level. Recent approaches to behavioral stylometry -- or the task of identifying a person from their actions alone -- have shown promise in domains like chess, but these approaches are either not scalable (e.g., fine-tune a separate model for each person) or not generative, in that they cannot generate actions. We address these limitations by framing behavioral stylometry as a multi-task learning problem -- where each task represents a distinct person -- and use parameter-efficient fine-tuning (PEFT) methods to learn an explicit style vector for each person. Style vectors are generative: they selectively activate shared "skill" parameters to generate actions in the style of each person. They also induce a latent space that we can interpret and manipulate algorithmically. In particular, we develop a general technique for style steering that allows us to steer a player's style vector towards a desired property. We apply our approach to two very different games, at unprecedented scales: chess (47,864 players) and Rocket League (2,000 players). We also show generality beyond gaming by applying our method to image generation, where we learn style vectors for 10,177 celebrities and use these vectors to steer their images.
△ Less
Submitted 20 February, 2025;
originally announced February 2025.
-
A Study on Leveraging Search and Self-Feedback for Agent Reasoning
Authors:
Karthikeyan K,
Michelle Yuan,
Elman Mansimov,
Katerina Margatina,
Anurag Pratik,
Daniele Bonadiman,
Monica Sunkara,
Yi Zhang,
Yassine Benajiba
Abstract:
Recent works have demonstrated that incorporating search during inference can significantly improve reasoning capabilities of language agents. Some approaches may make use of the ground truth or rely on model's own generated feedback. The search algorithm uses this feedback to then produce values that will update its criterion for exploring and exploiting various reasoning paths. In this study, we…
▽ More
Recent works have demonstrated that incorporating search during inference can significantly improve reasoning capabilities of language agents. Some approaches may make use of the ground truth or rely on model's own generated feedback. The search algorithm uses this feedback to then produce values that will update its criterion for exploring and exploiting various reasoning paths. In this study, we investigate how search and model's self-feedback can be leveraged for reasoning tasks. First, we explore differences in ground-truth feedback and self-feedback during search for math reasoning. Second, we observe limitations in applying search techniques to more complex tasks like tool-calling and design domain-specific approaches to address these gaps. Our experiments reveal challenges related to generalization when solely relying on self-feedback during search. For search to work effectively, either access to the ground-truth is needed or feedback mechanisms need to be carefully designed for the specific task.
△ Less
Submitted 17 February, 2025;
originally announced February 2025.
-
Bayesian Optimization for Building Social-Influence-Free Consensus
Authors:
Masaki Adachi,
Siu Lun Chau,
Wenjie Xu,
Anurag Singh,
Michael A. Osborne,
Krikamol Muandet
Abstract:
We introduce Social Bayesian Optimization (SBO), a vote-efficient algorithm for consensus-building in collective decision-making. In contrast to single-agent scenarios, collective decision-making encompasses group dynamics that may distort agents' preference feedback, thereby impeding their capacity to achieve a social-influence-free consensus -- the most preferable decision based on the aggregate…
▽ More
We introduce Social Bayesian Optimization (SBO), a vote-efficient algorithm for consensus-building in collective decision-making. In contrast to single-agent scenarios, collective decision-making encompasses group dynamics that may distort agents' preference feedback, thereby impeding their capacity to achieve a social-influence-free consensus -- the most preferable decision based on the aggregated agent utilities. We demonstrate that under mild rationality axioms, reaching social-influence-free consensus using noisy feedback alone is impossible. To address this, SBO employs a dual voting system: cheap but noisy public votes (e.g., show of hands in a meeting), and more accurate, though expensive, private votes (e.g., one-to-one interview). We model social influence using an unknown social graph and leverage the dual voting system to efficiently learn this graph. Our theoretical findigns show that social graph estimation converges faster than the black-box estimation of agents' utilities, allowing us to reduce reliance on costly private votes early in the process. This enables efficient consensus-building primarily through noisy public votes, which are debiased based on the estimated social graph to infer social-influence-free feedback. We validate the efficacy of SBO across multiple real-world applications, including thermal comfort, team building, travel negotiation, and energy trading collaboration.
△ Less
Submitted 10 February, 2025;
originally announced February 2025.
-
From Image to Video: An Empirical Study of Diffusion Representations
Authors:
Pedro Vélez,
Luisa F. Polanía,
Yi Yang,
Chuhan Zhang,
Rishabh Kabra,
Anurag Arnab,
Mehdi S. M. Sajjadi
Abstract:
Diffusion models have revolutionized generative modeling, enabling unprecedented realism in image and video synthesis. This success has sparked interest in leveraging their representations for visual understanding tasks. While recent works have explored this potential for image generation, the visual understanding capabilities of video diffusion models remain largely uncharted. To address this gap…
▽ More
Diffusion models have revolutionized generative modeling, enabling unprecedented realism in image and video synthesis. This success has sparked interest in leveraging their representations for visual understanding tasks. While recent works have explored this potential for image generation, the visual understanding capabilities of video diffusion models remain largely uncharted. To address this gap, we systematically compare the same model architecture trained for video versus image generation, analyzing the performance of their latent representations on various downstream tasks including image classification, action recognition, depth estimation, and tracking. Results show that video diffusion models consistently outperform their image counterparts, though we find a striking range in the extent of this superiority. We further analyze features extracted from different layers and with varying noise levels, as well as the effect of model size and training budget on representation and generation quality. This work marks the first direct comparison of video and image diffusion objectives for visual understanding, offering insights into the role of temporal information in representation learning.
△ Less
Submitted 19 March, 2025; v1 submitted 10 February, 2025;
originally announced February 2025.
-
Accelerating Data Processing and Benchmarking of AI Models for Pathology
Authors:
Andrew Zhang,
Guillaume Jaume,
Anurag Vaidya,
Tong Ding,
Faisal Mahmood
Abstract:
Advances in foundation modeling have reshaped computational pathology. However, the increasing number of available models and lack of standardized benchmarks make it increasingly complex to assess their strengths, limitations, and potential for further development. To address these challenges, we introduce a new suite of software tools for whole-slide image processing, foundation model benchmarkin…
▽ More
Advances in foundation modeling have reshaped computational pathology. However, the increasing number of available models and lack of standardized benchmarks make it increasingly complex to assess their strengths, limitations, and potential for further development. To address these challenges, we introduce a new suite of software tools for whole-slide image processing, foundation model benchmarking, and curated publicly available tasks. We anticipate that these resources will promote transparency, reproducibility, and continued progress in the field.
△ Less
Submitted 10 February, 2025;
originally announced February 2025.
-
Faster Approximation Algorithms for k-Center via Data Reduction
Authors:
Arnold Filtser,
Shaofeng H. -C. Jiang,
Yi Li,
Anurag Murty Naredla,
Ioannis Psarros,
Qiaoyuan Yang,
Qin Zhang
Abstract:
We study efficient algorithms for the Euclidean $k$-Center problem, focusing on the regime of large $k$. We take the approach of data reduction by considering $α$-coreset, which is a small subset $S$ of the dataset $P$ such that any $β$-approximation on $S$ is an $(α+ β)$-approximation on $P$. We give efficient algorithms to construct coresets whose size is $k \cdot o(n)$, which immediately speeds…
▽ More
We study efficient algorithms for the Euclidean $k$-Center problem, focusing on the regime of large $k$. We take the approach of data reduction by considering $α$-coreset, which is a small subset $S$ of the dataset $P$ such that any $β$-approximation on $S$ is an $(α+ β)$-approximation on $P$. We give efficient algorithms to construct coresets whose size is $k \cdot o(n)$, which immediately speeds up existing approximation algorithms. Notably, we obtain a near-linear time $O(1)$-approximation when $k = n^c$ for any $0 < c < 1$. We validate the performance of our coresets on real-world datasets with large $k$, and we observe that the coreset speeds up the well-known Gonzalez algorithm by up to $4$ times, while still achieving similar clustering cost. Technically, one of our coreset results is based on a new efficient construction of consistent hashing with competitive parameters. This general tool may be of independent interest for algorithm design in high dimensional Euclidean spaces.
△ Less
Submitted 9 February, 2025;
originally announced February 2025.
-
End-to-end Training for Text-to-Image Synthesis using Dual-Text Embeddings
Authors:
Yeruru Asrar Ahmed,
Anurag Mittal
Abstract:
Text-to-Image (T2I) synthesis is a challenging task that requires modeling complex interactions between two modalities ( i.e., text and image). A common framework adopted in recent state-of-the-art approaches to achieving such multimodal interactions is to bootstrap the learning process with pre-trained image-aligned text embeddings trained using contrastive loss. Furthermore, these embeddings are…
▽ More
Text-to-Image (T2I) synthesis is a challenging task that requires modeling complex interactions between two modalities ( i.e., text and image). A common framework adopted in recent state-of-the-art approaches to achieving such multimodal interactions is to bootstrap the learning process with pre-trained image-aligned text embeddings trained using contrastive loss. Furthermore, these embeddings are typically trained generically and reused across various synthesis models. In contrast, we explore an approach to learning text embeddings specifically tailored to the T2I synthesis network, trained in an end-to-end fashion. Further, we combine generative and contrastive training and use two embeddings, one optimized to enhance the photo-realism of the generated images, and the other seeking to capture text-to-image alignment. A comprehensive set of experiments on three text-to-image benchmark datasets (Oxford-102, Caltech-UCSD, and MS-COCO) reveal that having two separate embeddings gives better results than using a shared one and that such an approach performs favourably in comparison with methods that use text representations from a pre-trained text encoder trained using a discriminative approach. Finally, we demonstrate that such learned embeddings can be used in other contexts as well, such as text-to-image manipulation.
△ Less
Submitted 3 February, 2025;
originally announced February 2025.
-
Efficient Audiovisual Speech Processing via MUTUD: Multimodal Training and Unimodal Deployment
Authors:
Joanna Hong,
Sanjeel Parekh,
Honglie Chen,
Jacob Donley,
Ke Tan,
Buye Xu,
Anurag Kumar
Abstract:
Building reliable speech systems often requires combining multiple modalities, like audio and visual cues. While such multimodal solutions frequently lead to improvements in performance and may even be critical in certain cases, they come with several constraints such as increased sensory requirements, computational cost, and modality synchronization, to mention a few. These challenges constrain t…
▽ More
Building reliable speech systems often requires combining multiple modalities, like audio and visual cues. While such multimodal solutions frequently lead to improvements in performance and may even be critical in certain cases, they come with several constraints such as increased sensory requirements, computational cost, and modality synchronization, to mention a few. These challenges constrain the direct uses of these multimodal solutions in real-world applications. In this work, we develop approaches where the learning happens with all available modalities but the deployment or inference is done with just one or reduced modalities. To do so, we propose a Multimodal Training and Unimodal Deployment (MUTUD) framework which includes a Temporally Aligned Modality feature Estimation (TAME) module that can estimate information from missing modality using modalities present during inference. This innovative approach facilitates the integration of information across different modalities, enhancing the overall inference process by leveraging the strengths of each modality to compensate for the absence of certain modalities during inference. We apply MUTUD to various audiovisual speech tasks and show that it can reduce the performance gap between the multimodal and corresponding unimodal models to a considerable extent. MUTUD can achieve this while reducing the model size and compute compared to multimodal models, in some cases by almost 80%.
△ Less
Submitted 30 January, 2025;
originally announced January 2025.
-
Molecular-driven Foundation Model for Oncologic Pathology
Authors:
Anurag Vaidya,
Andrew Zhang,
Guillaume Jaume,
Andrew H. Song,
Tong Ding,
Sophia J. Wagner,
Ming Y. Lu,
Paul Doucet,
Harry Robertson,
Cristina Almagro-Perez,
Richard J. Chen,
Dina ElHarouni,
Georges Ayoub,
Connor Bossi,
Keith L. Ligon,
Georg Gerber,
Long Phi Le,
Faisal Mahmood
Abstract:
Foundation models are reshaping computational pathology by enabling transfer learning, where models pre-trained on vast datasets can be adapted for downstream diagnostic, prognostic, and therapeutic response tasks. Despite these advances, foundation models are still limited in their ability to encode the entire gigapixel whole-slide images without additional training and often lack complementary m…
▽ More
Foundation models are reshaping computational pathology by enabling transfer learning, where models pre-trained on vast datasets can be adapted for downstream diagnostic, prognostic, and therapeutic response tasks. Despite these advances, foundation models are still limited in their ability to encode the entire gigapixel whole-slide images without additional training and often lack complementary multimodal data. Here, we introduce Threads, a slide-level foundation model capable of generating universal representations of whole-slide images of any size. Threads was pre-trained using a multimodal learning approach on a diverse cohort of 47,171 hematoxylin and eosin (H&E)-stained tissue sections, paired with corresponding genomic and transcriptomic profiles - the largest such paired dataset to be used for foundation model development to date. This unique training paradigm enables Threads to capture the tissue's underlying molecular composition, yielding powerful representations applicable to a wide array of downstream tasks. In extensive benchmarking across 54 oncology tasks, including clinical subtyping, grading, mutation prediction, immunohistochemistry status determination, treatment response prediction, and survival prediction, Threads outperformed all baselines while demonstrating remarkable generalizability and label efficiency. It is particularly well suited for predicting rare events, further emphasizing its clinical utility. We intend to make the model publicly available for the broader community.
△ Less
Submitted 27 January, 2025;
originally announced January 2025.
-
Leveraging GANs For Active Appearance Models Optimized Model Fitting
Authors:
Anurag Awasthi
Abstract:
Active Appearance Models (AAMs) are a well-established technique for fitting deformable models to images, but they are limited by linear appearance assumptions and can struggle with complex variations. In this paper, we explore if the AAM fitting process can benefit from a Generative Adversarial Network (GAN). We uses a U-Net based generator and a PatchGAN discriminator for GAN-augmented framework…
▽ More
Active Appearance Models (AAMs) are a well-established technique for fitting deformable models to images, but they are limited by linear appearance assumptions and can struggle with complex variations. In this paper, we explore if the AAM fitting process can benefit from a Generative Adversarial Network (GAN). We uses a U-Net based generator and a PatchGAN discriminator for GAN-augmented framework in an attempt to refine the appearance model during fitting. This approach attempts to addresses challenges such as non-linear appearance variations and occlusions that traditional AAM optimization methods may fail to handle. Limited experiments on face alignment datasets demonstrate that the GAN-enhanced AAM can achieve higher accuracy and faster convergence than classic approaches with some manual interventions. These results establish feasibility of GANs as a tool for improving deformable model fitting in challenging conditions while maintaining efficient performance, and establishes the need for more future work to evaluate this approach at scale.
△ Less
Submitted 7 April, 2025; v1 submitted 19 January, 2025;
originally announced January 2025.
-
Covering half-grids with lines and planes
Authors:
Anurag Bishnoi,
Shantanu Nene
Abstract:
We study hyperplane covering problems for finite grid-like structures in $\mathbb{R}^d$. We call a set $\mathcal{C}$ of points in $\mathbb{R}^2$ a conical grid if the line $y = a_i$ intersects $\mathcal{C}$ in exactly $i$ points, for some $a_1 > \cdots > a_n \in \mathbb{R}$. We prove that the number of lines required to cover every point of such a grid at least $k$ times is at least…
▽ More
We study hyperplane covering problems for finite grid-like structures in $\mathbb{R}^d$. We call a set $\mathcal{C}$ of points in $\mathbb{R}^2$ a conical grid if the line $y = a_i$ intersects $\mathcal{C}$ in exactly $i$ points, for some $a_1 > \cdots > a_n \in \mathbb{R}$. We prove that the number of lines required to cover every point of such a grid at least $k$ times is at least $nk\left(1-\frac{1}{e}-O(\frac{1}{n}) \right)$. If the grid $\mathcal{C}$ is obtained by cutting an $m \times n$ grid of points into a half along one of the diagonals, then we prove the lower bound of $mk\left(1-e^{-\frac{n}{m}}-O(\frac{n}{m^2})\right)$.
Motivated by the Alon-Füredi theorem on hyperplane coverings of grids that miss a point and its multiplicity variations, we study the problem of finding the minimum number of hyperplanes required to cover every point of an $n \times \cdots \times n$ half-grid in $\mathbb{R}^d$ at least $k$ times while missing a point $P$. For almost all such half-grids, with $P$ being the corner point, we prove asymptotically sharp upper and lower bounds for the covering number in dimensions $2$ and $3$. For $k = 1$, $d = 2$, and an arbitrary $P$, we determine this number exactly by using the polynomial method bound for grids.
△ Less
Submitted 26 January, 2025; v1 submitted 19 January, 2025;
originally announced January 2025.
-
SEAL: Speaker Error Correction using Acoustic-conditioned Large Language Models
Authors:
Anurag Kumar,
Rohit Paturi,
Amber Afshan,
Sundararajan Srinivasan
Abstract:
Speaker Diarization (SD) is a crucial component of modern end-to-end ASR pipelines. Traditional SD systems, which are typically audio-based and operate independently of ASR, often introduce speaker errors, particularly during speaker transitions and overlapping speech. Recently, language models including fine-tuned large language models (LLMs) have shown to be effective as a second-pass speaker er…
▽ More
Speaker Diarization (SD) is a crucial component of modern end-to-end ASR pipelines. Traditional SD systems, which are typically audio-based and operate independently of ASR, often introduce speaker errors, particularly during speaker transitions and overlapping speech. Recently, language models including fine-tuned large language models (LLMs) have shown to be effective as a second-pass speaker error corrector by leveraging lexical context in the transcribed output. In this work, we introduce a novel acoustic conditioning approach to provide more fine-grained information from the acoustic diarizer to the LLM. We also show that a simpler constrained decoding strategy reduces LLM hallucinations, while avoiding complicated post-processing. Our approach significantly reduces the speaker error rates by 24-43% across Fisher, Callhome, and RT03-CTS datasets, compared to the first-pass Acoustic SD.
△ Less
Submitted 14 January, 2025;
originally announced January 2025.
-
Fast, Secure, Adaptable: LionsOS Design, Implementation and Performance
Authors:
Gernot Heiser,
Ivan Velickovic,
Peter Chubb,
Alwin Joshy,
Anuraag Ganesh,
Bill Nguyen,
Cheng Li,
Courtney Darville,
Guangtao Zhu,
James Archer,
Jingyao Zhou,
Krishnan Winter,
Lucy Parker,
Szymon Duchniewicz,
Tianyi Bai
Abstract:
We present LionsOS, an operating system for security- and safety-critical embedded systems. LionsOS is based on the formally verified seL4 microkernel and designed with verification in mind. It uses a static architecture and features a highly modular design driven by strict separa- tion of concerns and a focus on simplicity. We demonstrate that LionsOS achieves excellent performance on system-call…
▽ More
We present LionsOS, an operating system for security- and safety-critical embedded systems. LionsOS is based on the formally verified seL4 microkernel and designed with verification in mind. It uses a static architecture and features a highly modular design driven by strict separa- tion of concerns and a focus on simplicity. We demonstrate that LionsOS achieves excellent performance on system-call intensive workloads.
△ Less
Submitted 27 May, 2025; v1 submitted 8 January, 2025;
originally announced January 2025.
-
Bridging Context Gaps: Enhancing Comprehension in Long-Form Social Conversations Through Contextualized Excerpts
Authors:
Shrestha Mohanty,
Sarah Xuan,
Jacob Jobraeel,
Anurag Kumar,
Deb Roy,
Jad Kabbara
Abstract:
We focus on enhancing comprehension in small-group recorded conversations, which serve as a medium to bring people together and provide a space for sharing personal stories and experiences on crucial social matters. One way to parse and convey information from these conversations is by sharing highlighted excerpts in subsequent conversations. This can help promote a collective understanding of rel…
▽ More
We focus on enhancing comprehension in small-group recorded conversations, which serve as a medium to bring people together and provide a space for sharing personal stories and experiences on crucial social matters. One way to parse and convey information from these conversations is by sharing highlighted excerpts in subsequent conversations. This can help promote a collective understanding of relevant issues, by highlighting perspectives and experiences to other groups of people who might otherwise be unfamiliar with and thus unable to relate to these experiences. The primary challenge that arises then is that excerpts taken from one conversation and shared in another setting might be missing crucial context or key elements that were previously introduced in the original conversation. This problem is exacerbated when conversations become lengthier and richer in themes and shared experiences. To address this, we explore how Large Language Models (LLMs) can enrich these excerpts by providing socially relevant context. We present approaches for effective contextualization to improve comprehension, readability, and empathy. We show significant improvements in understanding, as assessed through subjective and objective evaluations. While LLMs can offer valuable context, they struggle with capturing key social aspects. We release the Human-annotated Salient Excerpts (HSE) dataset to support future work. Additionally, we show how context-enriched excerpts can provide more focused and comprehensive conversation summaries.
△ Less
Submitted 27 December, 2024;
originally announced December 2024.
-
Advancing Explainability in Neural Machine Translation: Analytical Metrics for Attention and Alignment Consistency
Authors:
Anurag Mishra
Abstract:
Neural Machine Translation (NMT) models have shown remarkable performance but remain largely opaque in their decision making processes. The interpretability of these models, especially their internal attention mechanisms, is critical for building trust and verifying that these systems behave as intended. In this work, we introduce a systematic framework to quantitatively evaluate the explainabilit…
▽ More
Neural Machine Translation (NMT) models have shown remarkable performance but remain largely opaque in their decision making processes. The interpretability of these models, especially their internal attention mechanisms, is critical for building trust and verifying that these systems behave as intended. In this work, we introduce a systematic framework to quantitatively evaluate the explainability of an NMT model attention patterns by comparing them against statistical alignments and correlating them with standard machine translation quality metrics. We present a set of metrics attention entropy and alignment agreement and validate them on an English-German test subset from WMT14 using a pre trained mT5 model. Our results indicate that sharper attention distributions correlate with improved interpretability but do not always guarantee better translation quality. These findings advance our understanding of NMT explainability and guide future efforts toward building more transparent and reliable machine translation systems.
△ Less
Submitted 24 December, 2024;
originally announced December 2024.
-
SyncFlow: Toward Temporally Aligned Joint Audio-Video Generation from Text
Authors:
Haohe Liu,
Gael Le Lan,
Xinhao Mei,
Zhaoheng Ni,
Anurag Kumar,
Varun Nagaraja,
Wenwu Wang,
Mark D. Plumbley,
Yangyang Shi,
Vikas Chandra
Abstract:
Video and audio are closely correlated modalities that humans naturally perceive together. While recent advancements have enabled the generation of audio or video from text, producing both modalities simultaneously still typically relies on either a cascaded process or multi-modal contrastive encoders. These approaches, however, often lead to suboptimal results due to inherent information losses d…
▽ More
Video and audio are closely correlated modalities that humans naturally perceive together. While recent advancements have enabled the generation of audio or video from text, producing both modalities simultaneously still typically relies on either a cascaded process or multi-modal contrastive encoders. These approaches, however, often lead to suboptimal results due to inherent information losses during inference and conditioning. In this paper, we introduce SyncFlow, a system that is capable of simultaneously generating temporally synchronized audio and video from text. The core of SyncFlow is the proposed dual-diffusion-transformer (d-DiT) architecture, which enables joint video and audio modelling with proper information fusion. To efficiently manage the computational cost of joint audio and video modelling, SyncFlow utilizes a multi-stage training strategy that separates video and audio learning before joint fine-tuning. Our empirical evaluations demonstrate that SyncFlow produces audio and video outputs that are more correlated than baseline methods with significantly enhanced audio quality and audio-visual correspondence. Moreover, we demonstrate strong zero-shot capabilities of SyncFlow, including zero-shot video-to-audio generation and adaptation to novel video resolutions without further training.
△ Less
Submitted 3 December, 2024;
originally announced December 2024.
-
SMEVCA: Stable Matching-based EV Charging Assignment in Subscription-Based Models
Authors:
Arindam Khanda,
Anurag Satpathy,
Anusha Vangala,
Sajal K. Das
Abstract:
The rapid shift from internal combustion engine vehicles to battery-powered electric vehicles (EVs) presents considerable challenges, such as limited charging points (CPs), unpredictable wait times, and difficulty selecting appropriate CPs. To address these challenges, we propose a novel end-to-end framework called Stable Matching EV Charging Assignment (SMEVCA) that efficiently assigns charge-see…
▽ More
The rapid shift from internal combustion engine vehicles to battery-powered electric vehicles (EVs) presents considerable challenges, such as limited charging points (CPs), unpredictable wait times, and difficulty selecting appropriate CPs. To address these challenges, we propose a novel end-to-end framework called Stable Matching EV Charging Assignment (SMEVCA) that efficiently assigns charge-seeking EVs to CPs with assistance from roadside units (RSUs). The proposed framework operates within a subscription-based model, ensuring that the subscribed EVs complete their charging within a predefined time limit enforced by a service level agreement (SLA). The framework SMEVCA employs a stable, fast, and efficient EV-CP assignment formulated as a one-to-many matching game with preferences. The matching process identifies the preferred coalition (a subset of EVs assigned to the CPs) using two strategies: (1) Preferred Coalition Greedy (PCG) that offers an efficient, locally optimal heuristic solution and (2) Preferred Coalition Dynamic (PCD) that is more computation-intensive but delivers a globally optimal coalition. Extensive simulations reveal that PCG and PCD achieve a gain of 14.6% and 20.8% over random elimination for in-network charge transferred with only 3% and 0.1% EVs unserved within the RSUs vicinity.
△ Less
Submitted 13 December, 2024;
originally announced December 2024.
-
DISHONEST: Dissecting misInformation Spread using Homogeneous sOcial NEtworks and Semantic Topic classification
Authors:
Caleb Stam,
Emily Saldanha,
Mahantesh Halappanavar,
Anurag Acharya
Abstract:
The emergence of the COVID-19 pandemic resulted in a significant rise in the spread of misinformation on online platforms such as Twitter. Oftentimes this growth is blamed on the idea of the "echo chamber." However, the behavior said to characterize these echo chambers exists in two dimensions. The first is in a user's social interactions, where they are said to stick with the same clique of like-…
▽ More
The emergence of the COVID-19 pandemic resulted in a significant rise in the spread of misinformation on online platforms such as Twitter. Oftentimes this growth is blamed on the idea of the "echo chamber." However, the behavior said to characterize these echo chambers exists in two dimensions. The first is in a user's social interactions, where they are said to stick with the same clique of like-minded users. The second is in the content of their posts, where they are said to repeatedly espouse homogeneous ideas. In this study, we link the two by using Twitter's network of retweets to study social interactions and topic modeling to study tweet content. In order to measure the diversity of a user's interactions over time, we develop a novel metric to track the speed at which they travel through the social network. The application of these analysis methods to misinformation-focused data from the pandemic demonstrates correlation between social behavior and tweet content. We believe this correlation supports the common intuition about how antisocial users behave, and further suggests that it holds even in subcommunities already rife with misinformation.
△ Less
Submitted 12 December, 2024;
originally announced December 2024.
-
Enhancing LLMs for Physics Problem-Solving using Reinforcement Learning with Human-AI Feedback
Authors:
Avinash Anand,
Kritarth Prasad,
Chhavi Kirtani,
Ashwin R Nair,
Mohit Gupta,
Saloni Garg,
Anurag Gautam,
Snehal Buldeo,
Rajiv Ratn Shah
Abstract:
Large Language Models (LLMs) have demonstrated strong capabilities in text-based tasks but struggle with the complex reasoning required for physics problems, particularly in advanced arithmetic and conceptual understanding. While some research has explored ways to enhance LLMs in physics education using techniques such as prompt engineering and Retrieval Augmentation Generation (RAG), not enough e…
▽ More
Large Language Models (LLMs) have demonstrated strong capabilities in text-based tasks but struggle with the complex reasoning required for physics problems, particularly in advanced arithmetic and conceptual understanding. While some research has explored ways to enhance LLMs in physics education using techniques such as prompt engineering and Retrieval Augmentation Generation (RAG), not enough effort has been made in addressing their limitations in physics reasoning. This paper presents a novel approach to improving LLM performance on physics questions using Reinforcement Learning with Human and Artificial Intelligence Feedback (RLHAIF). We evaluate several reinforcement learning methods, including Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), and Remax optimization. These methods are chosen to investigate RL policy performance with different settings on the PhyQA dataset, which includes challenging physics problems from high school textbooks. Our RLHAIF model, tested on leading LLMs like LLaMA2 and Mistral, achieved superior results, notably with the MISTRAL-PPO model, demonstrating marked improvements in reasoning and accuracy. It achieved high scores, with a 58.67 METEOR score and a 0.74 Reasoning score, making it a strong example for future physics reasoning research in this area.
△ Less
Submitted 6 December, 2024;
originally announced December 2024.
-
Institutional Shifts in Contribution to Indian Research Output during the last two decades
Authors:
Vivek Kumar Singh,
Mousumi Karmakar,
Anurag Kanaujia
Abstract:
In the past few decades, India has emerged as a major knowledge producer, with research output being contributed by a diverse set of institutions ranging from centrally funded to state funded, and from public funded to private funded institutions. A significant change has been witnessed in Indian institutional actors during the last two decades, with various new private universities being set up a…
▽ More
In the past few decades, India has emerged as a major knowledge producer, with research output being contributed by a diverse set of institutions ranging from centrally funded to state funded, and from public funded to private funded institutions. A significant change has been witnessed in Indian institutional actors during the last two decades, with various new private universities being set up and several new IITs, NITs, IISERs being established. Therefore, it is important to identify whether the composition of the list of the top 100 research output producing institutions of India has changed significantly during the recent two decades. This study attempted to analyse the changes during the two 10-year periods (2004-13 and 2014-23). The institutions which retain their position within top 100 during both periods are identified, along with the change in their positions. Similarly, institutions that were there in top 100 list during first time period (2004-13) and go out of top 100 list during second time period (2014-23) are also identified. In the same line, the new entrant institutions in the top 100 list during second time period (2014-23) are identified too. The results obtained indicate towards an institutional shift in the contribution to Indian research output.
△ Less
Submitted 9 December, 2024;
originally announced December 2024.
-
INRFlow: Flow Matching for INRs in Ambient Space
Authors:
Yuyang Wang,
Anurag Ranjan,
Josh Susskind,
Miguel Angel Bautista
Abstract:
Flow matching models have emerged as a powerful method for generative modeling on domains like images or videos, and even on irregular or unstructured data like 3D point clouds or even protein structures. These models are commonly trained in two stages: first, a data compressor is trained, and in a subsequent training stage a flow matching generative model is trained in the latent space of the dat…
▽ More
Flow matching models have emerged as a powerful method for generative modeling on domains like images or videos, and even on irregular or unstructured data like 3D point clouds or even protein structures. These models are commonly trained in two stages: first, a data compressor is trained, and in a subsequent training stage a flow matching generative model is trained in the latent space of the data compressor. This two-stage paradigm sets obstacles for unifying models across data domains, as hand-crafted compressors architectures are used for different data modalities. To this end, we introduce INRFlow, a domain-agnostic approach to learn flow matching transformers directly in ambient space. Drawing inspiration from INRs, we introduce a conditionally independent point-wise training objective that enables INRFlow to make predictions continuously in coordinate space. Our empirical results demonstrate that INRFlow effectively handles different data modalities such as images, 3D point clouds and protein structure data, achieving strong performance in different domains and outperforming comparable approaches. INRFlow is a promising step towards domain-agnostic flow matching generative models that can be trivially adopted in different data domains.
△ Less
Submitted 28 May, 2025; v1 submitted 4 December, 2024;
originally announced December 2024.
-
The Reality of AI and Biorisk
Authors:
Aidan Peppin,
Anka Reuel,
Stephen Casper,
Elliot Jones,
Andrew Strait,
Usman Anwar,
Anurag Agrawal,
Sayash Kapoor,
Sanmi Koyejo,
Marie Pellat,
Rishi Bommasani,
Nick Frosst,
Sara Hooker
Abstract:
To accurately and confidently answer the question 'could an AI model or system increase biorisk', it is necessary to have both a sound theoretical threat model for how AI models or systems could increase biorisk and a robust method for testing that threat model. This paper provides an analysis of existing available research surrounding two AI and biorisk threat models: 1) access to information and…
▽ More
To accurately and confidently answer the question 'could an AI model or system increase biorisk', it is necessary to have both a sound theoretical threat model for how AI models or systems could increase biorisk and a robust method for testing that threat model. This paper provides an analysis of existing available research surrounding two AI and biorisk threat models: 1) access to information and planning via large language models (LLMs), and 2) the use of AI-enabled biological tools (BTs) in synthesizing novel biological artifacts. We find that existing studies around AI-related biorisk are nascent, often speculative in nature, or limited in terms of their methodological maturity and transparency. The available literature suggests that current LLMs and BTs do not pose an immediate risk, and more work is needed to develop rigorous approaches to understanding how future models could increase biorisks. We end with recommendations about how empirical work can be expanded to more precisely target biorisk and ensure rigor and validity of findings.
△ Less
Submitted 2 January, 2025; v1 submitted 2 December, 2024;
originally announced December 2024.