Search | arXiv e-print repository

EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception

Authors: Sanjoy Chowdhury, Subrata Biswas, Sayan Nag, Tushar Nagarajan, Calvin Murdock, Ishwarya Ananthabhotla, Yijun Qian, Vamsi Krishna Ithapu, Dinesh Manocha, Ruohan Gao

Abstract: Modern perception models, particularly those designed for multisensory egocentric tasks, have achieved remarkable performance but often come with substantial computational costs. These high demands pose challenges for real-world deployment, especially in resource-constrained environments. In this paper, we introduce EgoAdapt, a framework that adaptively performs cross-modal distillation and policy… ▽ More Modern perception models, particularly those designed for multisensory egocentric tasks, have achieved remarkable performance but often come with substantial computational costs. These high demands pose challenges for real-world deployment, especially in resource-constrained environments. In this paper, we introduce EgoAdapt, a framework that adaptively performs cross-modal distillation and policy learning to enable efficient inference across different egocentric perception tasks, including egocentric action recognition, active speaker localization, and behavior anticipation. Our proposed policy module is adaptable to task-specific action spaces, making it broadly applicable. Experimental results on three challenging egocentric datasets EPIC-Kitchens, EasyCom, and Aria Everyday Activities demonstrate that our method significantly enhances efficiency, reducing GMACs by up to 89.09%, parameters up to 82.02%, and energy up to 9.6x, while still on-par and in many cases outperforming, the performance of corresponding state-of-the-art models. △ Less

Submitted 26 June, 2025; originally announced June 2025.

Comments: Accepted at ICCV 2025

arXiv:2506.11946 [pdf, ps, other]

A visco-plastic constitutive model for accurate densification and shape predictions in powder metallurgy hot isostatic pressing

Authors: Subrato Sarkar, Jason R Mayeur, KPK Ajjarapu, Fred A List III, Soumya Nag, Ryan R Dehoff

Abstract: Powder metallurgy hot isostatic pressing (PM-HIP) is an advanced manufacturing process that produces near net shape parts with high material utilization and uniform microstructures. Despite being used frequently to produce small-scale components, the application of PM-HIP to large-scale components is limited due to inadequate understanding of its complex mechanisms that cause unpredictable post-HI… ▽ More Powder metallurgy hot isostatic pressing (PM-HIP) is an advanced manufacturing process that produces near net shape parts with high material utilization and uniform microstructures. Despite being used frequently to produce small-scale components, the application of PM-HIP to large-scale components is limited due to inadequate understanding of its complex mechanisms that cause unpredictable post-HIP shape distortions. A computational model can provide necessary information about the intermediate and final stages of the HIP process that can help understand it better and make accurate predictions. Generally, two types of computational models are employed for PM-HIP simulations, namely, plastic and visco-plastic models. Between these, the plastic model is preferred due to its cheaper calibration approach requiring less experimental data. However, the plastic model sometimes produces incorrect predictions when slight variations of the HIP conditions are encountered in practical situations. Therefore, this work presents a visco-plastic model that addresses these limitations of the plastic model. A novel modified calibration approach is employed for the visco-plastic model that utilizes less experimental data than existing approaches. With the new approach, the data requirement is same for both plastic and visco-plastic models. This also enables a quantitative comparison of plastic and visco-plastic models, which have been only qualitatively compared in the past. When calibrated with the same experimental data, both the models are found to produce similar results. The calibrated visco-plastic model is applied to several complex geometries, and the predictions are found to be in good agreement with experimental observations. △ Less

Submitted 13 June, 2025; originally announced June 2025.

arXiv:2506.07016 [pdf, ps, other]

MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks

Authors: Sanjoy Chowdhury, Mohamed Elmoghany, Yohan Abeysinghe, Junjie Fei, Sayan Nag, Salman Khan, Mohamed Elhoseiny, Dinesh Manocha

Abstract: Large multimodal models (LMMs) have shown remarkable progress in audio-visual understanding, yet they struggle with real-world scenarios that require complex reasoning across extensive video collections. Existing benchmarks for video question answering remain limited in scope, typically involving one clip per query, which falls short of representing the challenges of large-scale, audio-visual retr… ▽ More Large multimodal models (LMMs) have shown remarkable progress in audio-visual understanding, yet they struggle with real-world scenarios that require complex reasoning across extensive video collections. Existing benchmarks for video question answering remain limited in scope, typically involving one clip per query, which falls short of representing the challenges of large-scale, audio-visual retrieval and reasoning encountered in practical applications. To bridge this gap, we introduce a novel task named AV-HaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer. To this end, we present AVHaystacks, an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task. Additionally, we propose a model-agnostic, multi-agent framework MAGNET to address this challenge, achieving up to 89% and 65% relative improvements over baseline methods on BLEU@4 and GPT evaluation scores in QA task on our proposed AVHaystacks. To enable robust evaluation of multi-video retrieval and temporal grounding for optimal response generation, we introduce two new metrics, STEM, which captures alignment errors between a ground truth and a predicted step sequence and MTGS, to facilitate balanced and interpretable evaluation of segment-level grounding performance. Project: https://schowdhury671.github.io/magnet_project/ △ Less

Submitted 13 June, 2025; v1 submitted 8 June, 2025; originally announced June 2025.

Comments: Audio-visual learning, Audio-Visual RAG, Multi-Video Linkage

arXiv:2505.23907 [pdf, ps, other]

Cora: Correspondence-aware image editing using few step diffusion

Authors: Amirhossein Almohammadi, Aryan Mikaeili, Sauradip Nag, Negar Hassanpour, Andrea Tagliasacchi, Ali Mahdavi-Amiri

Abstract: Image editing is an important task in computer graphics, vision, and VFX, with recent diffusion-based methods achieving fast and high-quality results. However, edits requiring significant structural changes, such as non-rigid deformations, object modifications, or content generation, remain challenging. Existing few step editing approaches produce artifacts such as irrelevant texture or struggle t… ▽ More Image editing is an important task in computer graphics, vision, and VFX, with recent diffusion-based methods achieving fast and high-quality results. However, edits requiring significant structural changes, such as non-rigid deformations, object modifications, or content generation, remain challenging. Existing few step editing approaches produce artifacts such as irrelevant texture or struggle to preserve key attributes of the source image (e.g., pose). We introduce Cora, a novel editing framework that addresses these limitations by introducing correspondence-aware noise correction and interpolated attention maps. Our method aligns textures and structures between the source and target images through semantic correspondence, enabling accurate texture transfer while generating new content when necessary. Cora offers control over the balance between content generation and preservation. Extensive experiments demonstrate that, quantitatively and qualitatively, Cora excels in maintaining structure, textures, and identity across diverse edits, including pose changes, object addition, and texture refinements. User studies confirm that Cora delivers superior results, outperforming alternatives. △ Less

Submitted 29 May, 2025; originally announced May 2025.

Comments: Published in SIGGRAPH 2025

ACM Class: I.4.10; I.3.7; I.2.10

arXiv:2505.20737 [pdf, ps, other]

RRO: LLM Agent Optimization Through Rising Reward Trajectories

Authors: Zilong Wang, Jingfeng Yang, Sreyashi Nag, Samarth Varshney, Xianfeng Tang, Haoming Jiang, Jingbo Shang, Sheikh Muhammad Sarwar

Abstract: Large language models (LLMs) have exhibited extraordinary performance in a variety of tasks while it remains challenging for them to solve complex multi-step tasks as agents. In practice, agents sensitive to the outcome of certain key steps which makes them likely to fail the task because of a subtle mistake in the planning trajectory. Recent approaches resort to calibrating the reasoning process… ▽ More Large language models (LLMs) have exhibited extraordinary performance in a variety of tasks while it remains challenging for them to solve complex multi-step tasks as agents. In practice, agents sensitive to the outcome of certain key steps which makes them likely to fail the task because of a subtle mistake in the planning trajectory. Recent approaches resort to calibrating the reasoning process through reinforcement learning. They reward or penalize every reasoning step with process supervision, as known as Process Reward Models (PRMs). However, PRMs are difficult and costly to scale up with a large number of next action candidates since they require extensive computations to acquire the training data through the per-step trajectory exploration. To mitigate this issue, we focus on the relative reward trend across successive reasoning steps and propose maintaining an increasing reward in the collected trajectories for process supervision, which we term Reward Rising Optimization (RRO). Specifically, we incrementally augment the process supervision until identifying a step exhibiting positive reward differentials, i.e. rising rewards, relative to its preceding iteration. This method dynamically expands the search space for the next action candidates, efficiently capturing high-quality data. We provide mathematical groundings and empirical results on the WebShop and InterCode-SQL benchmarks, showing that our proposed RRO achieves superior performance while requiring much less exploration cost. △ Less

Submitted 27 May, 2025; originally announced May 2025.

Comments: preprint

arXiv:2505.18832 [pdf, other]

Localizing Knowledge in Diffusion Transformers

Authors: Arman Zarei, Samyadeep Basu, Keivan Rezaei, Zihao Lin, Sayan Nag, Soheil Feizi

Abstract: Understanding how knowledge is distributed across the layers of generative models is crucial for improving interpretability, controllability, and adaptation. While prior work has explored knowledge localization in UNet-based architectures, Diffusion Transformer (DiT)-based models remain underexplored in this context. In this paper, we propose a model- and knowledge-agnostic method to localize wher… ▽ More Understanding how knowledge is distributed across the layers of generative models is crucial for improving interpretability, controllability, and adaptation. While prior work has explored knowledge localization in UNet-based architectures, Diffusion Transformer (DiT)-based models remain underexplored in this context. In this paper, we propose a model- and knowledge-agnostic method to localize where specific types of knowledge are encoded within the DiT blocks. We evaluate our method on state-of-the-art DiT-based models, including PixArt-alpha, FLUX, and SANA, across six diverse knowledge categories. We show that the identified blocks are both interpretable and causally linked to the expression of knowledge in generated outputs. Building on these insights, we apply our localization framework to two key applications: model personalization and knowledge unlearning. In both settings, our localized fine-tuning approach enables efficient and targeted updates, reducing computational cost, improving task-specific performance, and better preserving general model behavior with minimal interference to unrelated or surrounding content. Overall, our findings offer new insights into the internal structure of DiTs and introduce a practical pathway for more interpretable, efficient, and controllable model editing. △ Less

Submitted 24 May, 2025; originally announced May 2025.

arXiv:2505.15196 [pdf, ps, other]

EcomScriptBench: A Multi-task Benchmark for E-commerce Script Planning via Step-wise Intention-Driven Product Association

Authors: Weiqi Wang, Limeng Cui, Xin Liu, Sreyashi Nag, Wenju Xu, Chen Luo, Sheikh Muhammad Sarwar, Yang Li, Hansu Gu, Hui Liu, Changlong Yu, Jiaxin Bai, Yifan Gao, Haiyang Zhang, Qi He, Shuiwang Ji, Yangqiu Song

Abstract: Goal-oriented script planning, or the ability to devise coherent sequences of actions toward specific goals, is commonly employed by humans to plan for typical activities. In e-commerce, customers increasingly seek LLM-based assistants to generate scripts and recommend products at each step, thereby facilitating convenient and efficient shopping experiences. However, this capability remains undere… ▽ More Goal-oriented script planning, or the ability to devise coherent sequences of actions toward specific goals, is commonly employed by humans to plan for typical activities. In e-commerce, customers increasingly seek LLM-based assistants to generate scripts and recommend products at each step, thereby facilitating convenient and efficient shopping experiences. However, this capability remains underexplored due to several challenges, including the inability of LLMs to simultaneously conduct script planning and product retrieval, difficulties in matching products caused by semantic discrepancies between planned actions and search queries, and a lack of methods and benchmark data for evaluation. In this paper, we step forward by formally defining the task of E-commerce Script Planning (EcomScript) as three sequential subtasks. We propose a novel framework that enables the scalable generation of product-enriched scripts by associating products with each step based on the semantic similarity between the actions and their purchase intentions. By applying our framework to real-world e-commerce data, we construct the very first large-scale EcomScript dataset, EcomScriptBench, which includes 605,229 scripts sourced from 2.4 million products. Human annotations are then conducted to provide gold labels for a sampled subset, forming an evaluation benchmark. Extensive experiments reveal that current (L)LMs face significant challenges with EcomScript tasks, even after fine-tuning, while injecting product purchase intentions improves their performance. △ Less

Submitted 21 May, 2025; originally announced May 2025.

Comments: ACL2025

arXiv:2505.10237 [pdf, ps, other]

Competition between the neutron-proton pair break-ups delineating the level structure of 202Po

Authors: Sahab Singh, D. Choudhury, B. Maheshwari, R. Roy, K. Yadav, R. Palit, B. Das, P. Dey, A. Kundu, Md. S. R. Laskar, D. Negi, V. Malik, S. Jadhav, B. S. Naidu, A. V. Thomas, D. L. Balabanski, A. Dhal, S. Bhattacharya, A. K. Singh, S. Bhattacharyya, S. Nag

Abstract: High-spin spectroscopic study of $^{202}$Po ($Z$ = 84, $N$ = 118) has been carried out using the $^{195}$Pt($^{12}$C, 5n)$^{202}$Po fusion-evaporation reaction. An extended level scheme has been proposed up to an excitation energy of $E_x\approx$ 8 MeV and angular momentum of 27$\hbar$, with the addition of 57 newly observed $γ$-ray transitions, along with the revisions in the placement of 8 alrea… ▽ More High-spin spectroscopic study of $^{202}$Po ($Z$ = 84, $N$ = 118) has been carried out using the $^{195}$Pt($^{12}$C, 5n)$^{202}$Po fusion-evaporation reaction. An extended level scheme has been proposed up to an excitation energy of $E_x\approx$ 8 MeV and angular momentum of 27$\hbar$, with the addition of 57 newly observed $γ$-ray transitions, along with the revisions in the placement of 8 already known transitions and the multipolarities of 4 of these transitions. The energy of the unobserved 8$^+ \rightarrow 6^+$ transition has been proposed to be 9.0(5) keV, which resolves the uncertainty in the excitation energy of the levels above the 6$^{+}$ state. Three new sequences of $M1$ transitions have also been identified in the high excitation energy regime and included in the proposed level scheme. The large-scale shell model calculations for $Z>82$ and $N<126$ valence space have been carried out using PBPOP interaction which explained the overall level scheme for both the positive and negative parity states. The calculations successfully reproduced the purity of the proton $πh_{9/2}$ dominated $8^+$ isomeric state, and also explained the missing $E2$ decay of the ${12}^+$ isomeric state in terms of changing nucleonic configurations. △ Less

Submitted 15 May, 2025; originally announced May 2025.

arXiv:2504.10724 [pdf, other]

HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving

Authors: Avinash Kumar, Shashank Nag, Jason Clemons, Lizy John, Poulami Das

Abstract: Deploying large language models (LLMs) presents critical challenges due to the inherent trade-offs associated with key performance metrics, such as latency, accuracy, and throughput. Typically, gains in one metric is accompanied with degradation in others. Early-Exit LLMs (EE-LLMs) efficiently navigate this trade-off space by skipping some of the later model layers when it confidently finds an out… ▽ More Deploying large language models (LLMs) presents critical challenges due to the inherent trade-offs associated with key performance metrics, such as latency, accuracy, and throughput. Typically, gains in one metric is accompanied with degradation in others. Early-Exit LLMs (EE-LLMs) efficiently navigate this trade-off space by skipping some of the later model layers when it confidently finds an output token early, thus reducing latency without impacting accuracy. However, as the early exits taken depend on the task and are unknown apriori to request processing, EE-LLMs conservatively load the entire model, limiting resource savings and throughput. Also, current frameworks statically select a model for a user task, limiting our ability to adapt to changing nature of the input queries. We propose HELIOS to address these challenges. First, HELIOS shortlists a set of candidate LLMs, evaluates them using a subset of prompts, gathering telemetry data in real-time. Second, HELIOS uses the early exit data from these evaluations to greedily load the selected model only up to a limited number of layers. This approach yields memory savings which enables us to process more requests at the same time, thereby improving throughput. Third, HELIOS monitors and periodically reassesses the performance of the candidate LLMs and if needed, switches to another model that can service incoming queries more efficiently (such as using fewer layers without lowering accuracy). Our evaluations show that HELIOS achieves 1.48$\times$ throughput, 1.10$\times$ energy-efficiency, 1.39$\times$ lower response time, and 3.7$\times$ improvements in inference batch sizes compared to the baseline, when optimizing for the respective service level objectives. △ Less

Submitted 14 April, 2025; originally announced April 2025.

arXiv:2504.09723 [pdf, other]

AgentA/B: Automated and Scalable Web A/BTesting with Interactive LLM Agents

Authors: Dakuo Wang, Ting-Yao Hsu, Yuxuan Lu, Hansu Gu, Limeng Cui, Yaochen Xie, William Headean, Bingsheng Yao, Akash Veeragouni, Jiapeng Liu, Sreyashi Nag, Jessie Wang

Abstract: A/B testing experiment is a widely adopted method for evaluating UI/UX design decisions in modern web applications. Yet, traditional A/B testing remains constrained by its dependence on the large-scale and live traffic of human participants, and the long time of waiting for the testing result. Through formative interviews with six experienced industry practitioners, we identified critical bottlene… ▽ More A/B testing experiment is a widely adopted method for evaluating UI/UX design decisions in modern web applications. Yet, traditional A/B testing remains constrained by its dependence on the large-scale and live traffic of human participants, and the long time of waiting for the testing result. Through formative interviews with six experienced industry practitioners, we identified critical bottlenecks in current A/B testing workflows. In response, we present AgentA/B, a novel system that leverages Large Language Model-based autonomous agents (LLM Agents) to automatically simulate user interaction behaviors with real webpages. AgentA/B enables scalable deployment of LLM agents with diverse personas, each capable of navigating the dynamic webpage and interactively executing multi-step interactions like search, clicking, filtering, and purchasing. In a demonstrative controlled experiment, we employ AgentA/B to simulate a between-subject A/B testing with 1,000 LLM agents Amazon.com, and compare agent behaviors with real human shopping behaviors at a scale. Our findings suggest AgentA/B can emulate human-like behavior patterns. △ Less

Submitted 21 April, 2025; v1 submitted 13 April, 2025; originally announced April 2025.

arXiv:2504.08366 [pdf, ps, other]

In-2-4D: Inbetweening from Two Single-View Images to 4D Generation

Authors: Sauradip Nag, Daniel Cohen-Or, Hao Zhang, Ali Mahdavi-Amiri

Abstract: We propose a new problem, In-2-4D, for generative 4D (i.e., 3D + motion) inbetweening from a minimalistic input setting: two single-view images capturing an object in two distinct motion states. Given two images representing the start and end states of an object in motion, our goal is to generate and reconstruct the motion in 4D. We utilize a video interpolation model to predict the motion, but la… ▽ More We propose a new problem, In-2-4D, for generative 4D (i.e., 3D + motion) inbetweening from a minimalistic input setting: two single-view images capturing an object in two distinct motion states. Given two images representing the start and end states of an object in motion, our goal is to generate and reconstruct the motion in 4D. We utilize a video interpolation model to predict the motion, but large frame-to-frame motions can lead to ambiguous interpretations. To overcome this, we employ a hierarchical approach to identify keyframes that are visually close to the input states and show significant motion, then generate smooth fragments between them. For each fragment, we construct the 3D representation of the keyframe using Gaussian Splatting. The temporal frames within the fragment guide the motion, enabling their transformation into dynamic Gaussians through a deformation field. To improve temporal consistency and refine 3D motion, we expand the self-attention of multi-view diffusion across timesteps and apply rigid transformation regularization. Finally, we merge the independently generated 3D motion segments by interpolating boundary deformation fields and optimizing them to align with the guiding video, ensuring smooth and flicker-free transitions. Through extensive qualitative and quantitiave experiments as well as a user study, we show the effectiveness of our method and its components. The project page is available at https://in-2-4d.github.io/ △ Less

Submitted 11 April, 2025; originally announced April 2025.

Comments: Technical Report

arXiv:2503.23219 [pdf, other]

Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs

Authors: Sanjoy Chowdhury, Hanan Gani, Nishit Anand, Sayan Nag, Ruohan Gao, Mohamed Elhoseiny, Salman Khan, Dinesh Manocha

Abstract: Recent advancements in reasoning optimization have greatly enhanced the performance of large language models (LLMs). However, existing work fails to address the complexities of audio-visual scenarios, underscoring the need for further research. In this paper, we introduce AURELIA, a novel actor-critic based audio-visual (AV) reasoning framework that distills structured, step-by-step reasoning into… ▽ More Recent advancements in reasoning optimization have greatly enhanced the performance of large language models (LLMs). However, existing work fails to address the complexities of audio-visual scenarios, underscoring the need for further research. In this paper, we introduce AURELIA, a novel actor-critic based audio-visual (AV) reasoning framework that distills structured, step-by-step reasoning into AVLLMs at test time, improving their ability to process complex multi-modal inputs without additional training or fine-tuning. To further advance AVLLM reasoning skills, we present AVReasonBench, a challenging benchmark comprising 4500 audio-visual questions, each paired with detailed step-by-step reasoning. Our benchmark spans six distinct tasks, including AV-GeoIQ, which evaluates AV reasoning combined with geographical and cultural knowledge. Evaluating 18 AVLLMs on AVReasonBench reveals significant limitations in their multi-modal reasoning capabilities. Using AURELIA, we achieve up to a 100% relative improvement, demonstrating its effectiveness. This performance gain highlights the potential of reasoning-enhanced data generation for advancing AVLLMs in real-world applications. Our code and data will be publicly released at: https: //github.com/schowdhury671/aurelia. △ Less

Submitted 29 March, 2025; originally announced March 2025.

arXiv:2503.15742 [pdf, other]

Uncertainty-Aware Diffusion Guided Refinement of 3D Scenes

Authors: Sarosij Bose, Arindam Dutta, Sayak Nag, Junge Zhang, Jiachen Li, Konstantinos Karydis, Amit K. Roy Chowdhury

Abstract: Reconstructing 3D scenes from a single image is a fundamentally ill-posed task due to the severely under-constrained nature of the problem. Consequently, when the scene is rendered from novel camera views, existing single image to 3D reconstruction methods render incoherent and blurry views. This problem is exacerbated when the unseen regions are far away from the input camera. In this work, we ad… ▽ More Reconstructing 3D scenes from a single image is a fundamentally ill-posed task due to the severely under-constrained nature of the problem. Consequently, when the scene is rendered from novel camera views, existing single image to 3D reconstruction methods render incoherent and blurry views. This problem is exacerbated when the unseen regions are far away from the input camera. In this work, we address these inherent limitations in existing single image-to-3D scene feedforward networks. To alleviate the poor performance due to insufficient information beyond the input image's view, we leverage a strong generative prior in the form of a pre-trained latent video diffusion model, for iterative refinement of a coarse scene represented by optimizable Gaussian parameters. To ensure that the style and texture of the generated images align with that of the input image, we incorporate on-the-fly Fourier-style transfer between the generated images and the input image. Additionally, we design a semantic uncertainty quantification module that calculates the per-pixel entropy and yields uncertainty maps used to guide the refinement process from the most confident pixels while discarding the remaining highly uncertain ones. We conduct extensive experiments on real-world scene datasets, including in-domain RealEstate-10K and out-of-domain KITTI-v2, showing that our approach can provide more realistic and high-fidelity novel view synthesis results compared to existing state-of-the-art methods. △ Less

Submitted 19 March, 2025; originally announced March 2025.

Comments: 13 pages, 7 figures

arXiv:2503.13947 [pdf, other]

Conformal Prediction and MLLM aided Uncertainty Quantification in Scene Graph Generation

Authors: Sayak Nag, Udita Ghosh, Calvin-Khang Ta, Sarosij Bose, Jiachen Li, Amit K Roy Chowdhury

Abstract: Scene Graph Generation (SGG) aims to represent visual scenes by identifying objects and their pairwise relationships, providing a structured understanding of image content. However, inherent challenges like long-tailed class distributions and prediction variability necessitate uncertainty quantification in SGG for its practical viability. In this paper, we introduce a novel Conformal Prediction (C… ▽ More Scene Graph Generation (SGG) aims to represent visual scenes by identifying objects and their pairwise relationships, providing a structured understanding of image content. However, inherent challenges like long-tailed class distributions and prediction variability necessitate uncertainty quantification in SGG for its practical viability. In this paper, we introduce a novel Conformal Prediction (CP) based framework, adaptive to any existing SGG method, for quantifying their predictive uncertainty by constructing well-calibrated prediction sets over their generated scene graphs. These scene graph prediction sets are designed to achieve statistically rigorous coverage guarantees. Additionally, to ensure these prediction sets contain the most practically interpretable scene graphs, we design an effective MLLM-based post-processing strategy for selecting the most visually and semantically plausible scene graphs within these prediction sets. We show that our proposed approach can produce diverse possible scene graphs from an image, assess the reliability of SGG methods, and improve overall SGG performance. △ Less

Submitted 10 April, 2025; v1 submitted 18 March, 2025; originally announced March 2025.

Comments: Accepted at CVPR 2025

arXiv:2502.12173 [pdf, other]

nanoML for Human Activity Recognition

Authors: Alan T. L. Bacellar, Mugdha P. Jadhao, Shashank Nag, Priscila M. V. Lima, Felipe M. G. Franca, Lizy K. John

Abstract: Human Activity Recognition (HAR) is critical for applications in healthcare, fitness, and IoT, but deploying accurate models on resource-constrained devices remains challenging due to high energy and memory demands. This paper demonstrates the application of Differentiable Weightless Neural Networks (DWNs) to HAR, achieving competitive accuracies of 96.34% and 96.67% while consuming only 56nJ and… ▽ More Human Activity Recognition (HAR) is critical for applications in healthcare, fitness, and IoT, but deploying accurate models on resource-constrained devices remains challenging due to high energy and memory demands. This paper demonstrates the application of Differentiable Weightless Neural Networks (DWNs) to HAR, achieving competitive accuracies of 96.34% and 96.67% while consuming only 56nJ and 104nJ per sample, with an inference time of just 5ns per sample. The DWNs were implemented and evaluated on an FPGA, showcasing their practical feasibility for energy-efficient hardware deployment. DWNs achieve up to 926,000x energy savings and 260x memory reduction compared to state-of-the-art deep learning methods. These results position DWNs as a nano-machine learning nanoML model for HAR, setting a new benchmark in energy efficiency and compactness for edge and wearable devices, paving the way for ultra-efficient edge AI. △ Less

Submitted 13 February, 2025; originally announced February 2025.

Comments: Accepted as a full paper by the 2025 EDGE AI FOUNDATION Austin

arXiv:2502.07278 [pdf, other]

Articulate That Object Part (ATOP): 3D Part Articulation via Text and Motion Personalization

Authors: Aditya Vora, Sauradip Nag, Hao Zhang

Abstract: We present ATOP (Articulate That Object Part), a novel few-shot method based on motion personalization to articulate a static 3D object with respect to a part and its motion as prescribed in a text prompt. Given the scarcity of available datasets with motion attribute annotations, existing methods struggle to generalize well in this task. In our work, the text input allows us to tap into the power… ▽ More We present ATOP (Articulate That Object Part), a novel few-shot method based on motion personalization to articulate a static 3D object with respect to a part and its motion as prescribed in a text prompt. Given the scarcity of available datasets with motion attribute annotations, existing methods struggle to generalize well in this task. In our work, the text input allows us to tap into the power of modern-day diffusion models to generate plausible motion samples for the right object category and part. In turn, the input 3D object provides image prompting to personalize the generated video to that very object we wish to articulate. Our method starts with a few-shot finetuning for category-specific motion generation, a key first step to compensate for the lack of articulation awareness by current diffusion models. For this, we finetune a pre-trained multi-view image generation model for controllable multi-view video generation, using a small collection of video samples obtained for the target object category. This is followed by motion video personalization that is realized by multi-view rendered images of the target 3D object. At last, we transfer the personalized video motion to the target 3D object via differentiable rendering to optimize part motion parameters by a score distillation sampling loss. Experimental results on PartNet-Sapien and ACD datasets show that our method is capable of generating realistic motion videos and predicting 3D motion parameters in a more accurate and generalizable way, compared to prior works in the few-shot setting. △ Less

Submitted 13 March, 2025; v1 submitted 11 February, 2025; originally announced February 2025.

Comments: Technical Report, 16 pages

arXiv:2502.01555 [pdf, other]

Query Brand Entity Linking in E-Commerce Search

Authors: Dong Liu, Sreyashi Nag

Abstract: In this work, we address the brand entity linking problem for e-commerce search queries. The entity linking task is done by either i)a two-stage process consisting of entity mention detection followed by entity disambiguation or ii) an end-to-end linking approaches that directly fetch the target entity given the input text. The task presents unique challenges: queries are extremely short (averagin… ▽ More In this work, we address the brand entity linking problem for e-commerce search queries. The entity linking task is done by either i)a two-stage process consisting of entity mention detection followed by entity disambiguation or ii) an end-to-end linking approaches that directly fetch the target entity given the input text. The task presents unique challenges: queries are extremely short (averaging 2.4 words), lack natural language structure, and must handle a massive space of unique brands. We present a two-stage approach combining named-entity recognition with matching, and a novel end-to-end solution using extreme multi-class classification. We validate our solutions by both offline benchmarks and the impact of online A/B test. △ Less

Submitted 3 February, 2025; originally announced February 2025.

arXiv:2501.07845 [pdf, other]

Reasoning with Graphs: Structuring Implicit Knowledge to Enhance LLMs Reasoning

Authors: Haoyu Han, Yaochen Xie, Hui Liu, Xianfeng Tang, Sreyashi Nag, William Headden, Hui Liu, Yang Li, Chen Luo, Shuiwang Ji, Qi He, Jiliang Tang

Abstract: Large language models (LLMs) have demonstrated remarkable success across a wide range of tasks; however, they still encounter challenges in reasoning tasks that require understanding and inferring relationships between distinct pieces of information within text sequences. This challenge is particularly pronounced in tasks involving multi-step processes, such as logical reasoning and multi-hop ques… ▽ More Large language models (LLMs) have demonstrated remarkable success across a wide range of tasks; however, they still encounter challenges in reasoning tasks that require understanding and inferring relationships between distinct pieces of information within text sequences. This challenge is particularly pronounced in tasks involving multi-step processes, such as logical reasoning and multi-hop question answering, where understanding implicit relationships between entities and leveraging multi-hop connections in the given context are crucial. Graphs, as fundamental data structures, explicitly represent pairwise relationships between entities, thereby offering the potential to enhance LLMs' reasoning capabilities. External graphs have proven effective in supporting LLMs across multiple tasks. However, in many reasoning tasks, no pre-existing graph structure is provided. Can we structure implicit knowledge derived from context into graphs to assist LLMs in reasoning? In this paper, we propose Reasoning with Graphs (RwG) by first constructing explicit graphs from the context and then leveraging these graphs to enhance LLM reasoning performance on reasoning tasks. Extensive experiments demonstrate the effectiveness of the proposed method in improving both logical reasoning and multi-hop question answering tasks. △ Less

Submitted 14 January, 2025; originally announced January 2025.

arXiv:2501.02135 [pdf, other]

AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs

Authors: Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta, Yaoting Wang, Mohamed Elhoseiny, Ruohan Gao, Dinesh Manocha

Abstract: With the rapid advancement of Multi-modal Large Language Models (MLLMs), several diagnostic benchmarks have recently been developed to assess these models' multi-modal reasoning proficiency. However, these benchmarks are restricted to assessing primarily the visual aspect and do not examine the holistic audio-visual (AV) understanding. Moreover, currently, there are no benchmarks that investigate… ▽ More With the rapid advancement of Multi-modal Large Language Models (MLLMs), several diagnostic benchmarks have recently been developed to assess these models' multi-modal reasoning proficiency. However, these benchmarks are restricted to assessing primarily the visual aspect and do not examine the holistic audio-visual (AV) understanding. Moreover, currently, there are no benchmarks that investigate the capabilities of AVLLMs to calibrate their responses when presented with perturbed inputs. To this end, we introduce Audio-Visual Trustworthiness assessment Benchmark (AVTrustBench), comprising 600K samples spanning over 9 meticulously crafted tasks, evaluating the capabilities of AVLLMs across three distinct dimensions: Adversarial attack, Compositional reasoning, and Modality-specific dependency. Using our benchmark we extensively evaluate 13 state-of-the-art AVLLMs. The findings reveal that the majority of existing models fall significantly short of achieving human-like comprehension, offering valuable insights for future research directions. To alleviate the limitations in the existing approaches, we further propose a robust, model-agnostic calibrated audio-visual preference optimization based training strategy CAVPref, obtaining a gain up to 30.19% across all 9 tasks. We will publicly release our code and benchmark to facilitate future research in this direction. △ Less

Submitted 3 January, 2025; originally announced January 2025.

arXiv:2501.01039 [pdf, other]

MSWA: Refining Local Attention with Multi-ScaleWindow Attention

Authors: Yixing Xu, Shivank Nag, Dong Li, Lu Tian, Emad Barsoum

Abstract: Transformer-based LLMs have achieved exceptional performance across a wide range of NLP tasks. However, the standard self-attention mechanism suffers from quadratic time complexity and linearly increased cache size. Sliding window attention (SWA) solves this problem by restricting the attention range to a fixed-size local context window. Nevertheless, SWA employs a uniform window size for each hea… ▽ More Transformer-based LLMs have achieved exceptional performance across a wide range of NLP tasks. However, the standard self-attention mechanism suffers from quadratic time complexity and linearly increased cache size. Sliding window attention (SWA) solves this problem by restricting the attention range to a fixed-size local context window. Nevertheless, SWA employs a uniform window size for each head in each layer, making it inefficient in capturing context of varying scales. To mitigate this limitation, we propose Multi-Scale Window Attention (MSWA) which applies diverse window sizes across heads and layers in the Transformer. It not only allows for different window sizes among heads within the same layer but also progressively increases window size allocation from shallow to deep layers, thus enabling the model to capture contextual information with different lengths and distances. Experimental results on language modeling and common-sense reasoning tasks substantiate that MSWA outperforms traditional local attention in both effectiveness and efficiency. △ Less

Submitted 1 January, 2025; originally announced January 2025.

arXiv:2412.21198 [pdf, other]

Dynamic magnetic response in ABA type trilayered systems and compensation phenomenon

Authors: Enakshi Guru, Sonali Saha, Sankhasubhra Nag

Abstract: Dynamic magnetic response in a trilayered structure with non-equivalent layers (ABA type) has been studied with Monte Carlo simulation using Metropolis algorithm. In each layer, ferromagnetic (FM) nearest neighbour Ising interactions are present along with antiferromagnetic (AFM) nearest neighbour coupling across different layers. The system is studied under a harmonically oscillating external mag… ▽ More Dynamic magnetic response in a trilayered structure with non-equivalent layers (ABA type) has been studied with Monte Carlo simulation using Metropolis algorithm. In each layer, ferromagnetic (FM) nearest neighbour Ising interactions are present along with antiferromagnetic (AFM) nearest neighbour coupling across different layers. The system is studied under a harmonically oscillating external magnetic field. It is revealed that along with dynamic phase transition (DPT), compensation phenomenon emerges in this system under dynamic scenario too. This feature in dynamic case is unique for such trilayered systems only, in contrast to the bulk system reported earlier. The temporal behaviour of the magnetisation of each individual layer shows that different magnetic response of the non-equivalent layers results into such dynamic compensation phenomenon. The difference in response also results into warping of the dynamic hysteresis loops, under various external parameter values, such as amplitude of the oscillating field and temperature. △ Less

Submitted 30 December, 2024; originally announced December 2024.

Comments: 18 pages, 13 figures

MSC Class: 82C26

arXiv:2411.08028 [pdf, other]

Learning with Less: Knowledge Distillation from Large Language Models via Unlabeled Data

Authors: Juanhui Li, Sreyashi Nag, Hui Liu, Xianfeng Tang, Sheikh Sarwar, Limeng Cui, Hansu Gu, Suhang Wang, Qi He, Jiliang Tang

Abstract: In real-world NLP applications, Large Language Models (LLMs) offer promising solutions due to their extensive training on vast datasets. However, the large size and high computation demands of LLMs limit their practicality in many applications, especially when further fine-tuning is required. To address these limitations, smaller models are typically preferred for deployment. However, their traini… ▽ More In real-world NLP applications, Large Language Models (LLMs) offer promising solutions due to their extensive training on vast datasets. However, the large size and high computation demands of LLMs limit their practicality in many applications, especially when further fine-tuning is required. To address these limitations, smaller models are typically preferred for deployment. However, their training is hindered by the scarcity of labeled data. In contrast, unlabeled data is often readily which can be leveraged by using LLMs to generate pseudo-labels for training smaller models. This enables the smaller models (student) to acquire knowledge from LLMs(teacher) while reducing computational costs. This process introduces challenges, such as potential noisy pseudo-labels. Selecting high-quality and informative data is therefore critical to enhance model performance while improving the efficiency of data utilization. To address this, we propose LLKD that enables Learning with Less computational resources and less data for Knowledge Distillation from LLMs. LLKD is an adaptive sample selection method that incorporates signals from both the teacher and student. Specifically, it prioritizes samples where the teacher demonstrates high confidence in its labeling, indicating reliable labels, and where the student exhibits a high information need, identifying challenging samples that require further learning. Our comprehensive experiments show that LLKD achieves superior performance across various datasets with higher data efficiency. △ Less

Submitted 30 March, 2025; v1 submitted 12 November, 2024; originally announced November 2024.

arXiv:2411.01818 [pdf, other]

Shrinking the Giant : Quasi-Weightless Transformers for Low Energy Inference

Authors: Shashank Nag, Alan T. L. Bacellar, Zachary Susskind, Anshul Jha, Logan Liberty, Aishwarya Sivakumar, Eugene B. John, Krishnan Kailas, Priscila M. V. Lima, Neeraja J. Yadwadkar, Felipe M. G. Franca, Lizy K. John

Abstract: Transformers are set to become ubiquitous with applications ranging from chatbots and educational assistants to visual recognition and remote sensing. However, their increasing computational and memory demands is resulting in growing energy consumption. Building models with fast and energy-efficient inference is imperative to enable a variety of transformer-based applications. Look Up Table (LUT)… ▽ More Transformers are set to become ubiquitous with applications ranging from chatbots and educational assistants to visual recognition and remote sensing. However, their increasing computational and memory demands is resulting in growing energy consumption. Building models with fast and energy-efficient inference is imperative to enable a variety of transformer-based applications. Look Up Table (LUT) based Weightless Neural Networks are faster than the conventional neural networks as their inference only involves a few lookup operations. Recently, an approach for learning LUT networks directly via an Extended Finite Difference method was proposed. We build on this idea, extending it for performing the functions of the Multi Layer Perceptron (MLP) layers in transformer models and integrating them with transformers to propose Quasi Weightless Transformers (QuWeiT). This allows for a computational and energy-efficient inference solution for transformer-based models. On I-ViT-T, we achieve a comparable accuracy of 95.64% on CIFAR-10 dataset while replacing approximately 55% of all the multiplications in the entire model and achieving a 2.2x energy efficiency. We also observe similar savings on experiments with the nanoGPT framework. △ Less

Submitted 4 November, 2024; originally announced November 2024.

arXiv:2410.18538 [pdf, other]

SMITE: Segment Me In TimE

Authors: Amirhossein Alimohammadi, Sauradip Nag, Saeid Asgari Taghanaki, Andrea Tagliasacchi, Ghassan Hamarneh, Ali Mahdavi Amiri

Abstract: Segmenting an object in a video presents significant challenges. Each pixel must be accurately labelled, and these labels must remain consistent across frames. The difficulty increases when the segmentation is with arbitrary granularity, meaning the number of segments can vary arbitrarily, and masks are defined based on only one or a few sample images. In this paper, we address this issue by emplo… ▽ More Segmenting an object in a video presents significant challenges. Each pixel must be accurately labelled, and these labels must remain consistent across frames. The difficulty increases when the segmentation is with arbitrary granularity, meaning the number of segments can vary arbitrarily, and masks are defined based on only one or a few sample images. In this paper, we address this issue by employing a pre-trained text to image diffusion model supplemented with an additional tracking mechanism. We demonstrate that our approach can effectively manage various segmentation scenarios and outperforms state-of-the-art alternatives. △ Less

Submitted 18 February, 2025; v1 submitted 24 October, 2024; originally announced October 2024.

Comments: ICLR 2025; Project page is at https://segment-me-in-time.github.io/

arXiv:2410.17952 [pdf, other]

SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains

Authors: Ran Xu, Hui Liu, Sreyashi Nag, Zhenwei Dai, Yaochen Xie, Xianfeng Tang, Chen Luo, Yang Li, Joyce C. Ho, Carl Yang, Qi He

Abstract: Retrieval-augmented generation (RAG) enhances the question-answering (QA) abilities of large language models (LLMs) by integrating external knowledge. However, adapting general-purpose RAG systems to specialized fields such as science and medicine poses unique challenges due to distribution shifts and limited access to domain-specific data. To tackle this, we propose SimRAG, a self-training approa… ▽ More Retrieval-augmented generation (RAG) enhances the question-answering (QA) abilities of large language models (LLMs) by integrating external knowledge. However, adapting general-purpose RAG systems to specialized fields such as science and medicine poses unique challenges due to distribution shifts and limited access to domain-specific data. To tackle this, we propose SimRAG, a self-training approach that equips the LLM with joint capabilities of question answering and question generation for domain adaptation. Our method first fine-tunes the LLM on instruction-following, question-answering, and search-related data. Then, it prompts the same LLM to generate diverse domain-relevant questions from unlabeled corpora, with an additional filtering strategy to retain high-quality synthetic examples. By leveraging these self-generated synthetic examples, the LLM can improve their performance on domain-specific RAG tasks. Experiments on 11 datasets, spanning two backbone sizes and three domains, demonstrate that SimRAG outperforms baselines by 1.2\%--8.6\%. △ Less

Submitted 24 January, 2025; v1 submitted 23 October, 2024; originally announced October 2024.

Comments: Accepted to NAACL 2025 main conference

Journal ref: NAACL 2025

arXiv:2410.14220 [pdf, other]

Comparative Performance Analysis of Crystals in Total-Body PET Scanners: Monte-Carlo Simulation Study with Different Materials and Geometry

Authors: D. Choudhary, S. Nag

Abstract: Total-Body PET (TB-PET) scanners represent a significant advancement in medical diagnostics, exemplified by the uEXPLORER, the world's first TB-PET system with an axial span of 194 cm, which exhibits exceptional sensitivity and spatial resolution. This study employs the Monte Carlo simulation toolkit Geant4 to evaluate various configurations and materials of detector crystals. We concentrate on th… ▽ More Total-Body PET (TB-PET) scanners represent a significant advancement in medical diagnostics, exemplified by the uEXPLORER, the world's first TB-PET system with an axial span of 194 cm, which exhibits exceptional sensitivity and spatial resolution. This study employs the Monte Carlo simulation toolkit Geant4 to evaluate various configurations and materials of detector crystals. We concentrate on three critical parameters: sensitivity, intrinsic coincidence time resolution (CTR), and energy resolution across three crystal designs: 1)standard LYSO crystals as the baseline; 2)0.1% Mg, 1% Ce doped $Gd_3Al_2Ga_3O_{12}$ (Mg,Ce:GAGG) as an alternative material; and 3)pyramid-shaped LYSO crystals, which maintain the same dimensions as the standard LYSO. The research is grounded in the geometric configuration of the uEXPLORER. Our findings reveal that pyramid-shaped LYSO crystals exhibit superior performance, achieving an impressive CTR of 42 ps. In contrast, PET detectors utilizing doped GAGG crystals demonstrate a 6% reduction in intrinsic CTR compared to LYSO. However, Mg,Ce:GAGG crystals surpass LYSO in energy resolution by 25%, while cuboidal LYSO crystals achieve approximately 37% greater sensitivity than their Mg,Ce:GAGG counterparts. These results underscore the impact of different crystal materials and geometries on PET scanner performance, emphasizing the trade-offs among sensitivity, coincidence time resolution (CTR), and energy resolution. Such insights are pivotal for informing the design of future TB-PET systems. △ Less

Submitted 9 April, 2025; v1 submitted 18 October, 2024; originally announced October 2024.

Comments: 12 pages, 13 total figures, 7 captioned figures

arXiv:2409.04748 [pdf, other]

doi 10.1021/acs.jctc.4c00856

Dissipative self-assembly of patchy particles under nonequilibrium drive: a computational study

Authors: Shubhadeep Nag, Gili Bisker

Abstract: Inspired by biology and implemented using nanotechnology, the self-assembly of patchy particles has emerged as a pivotal mechanism for constructing complex structures that mimic natural systems with diverse functionalities. Here, we explore the dissipative self-assembly of patchy particles under nonequilibrium conditions, with the aim of overcoming the constraints imposed by equilibrium assembly.… ▽ More Inspired by biology and implemented using nanotechnology, the self-assembly of patchy particles has emerged as a pivotal mechanism for constructing complex structures that mimic natural systems with diverse functionalities. Here, we explore the dissipative self-assembly of patchy particles under nonequilibrium conditions, with the aim of overcoming the constraints imposed by equilibrium assembly. Utilizing extensive Monte Carlo (MC) and Molecular Dynamics (MD) simulations, we provide insight into the effects of external forces that mirror natural and chemical processes on the assembly rates and the stability of the resulting assemblies comprising $8$, $10$, and $13$ patchy particles. Implemented by a favorable bond-promoting drive in MC or a pulsed square wave potential in MD, our simulations reveal the role these external drives play in accelerating assembly kinetics and enhancing structural stability, evidenced by a decrease in the time to first assembly and an increase in the duration the system remains in an assembled state. Through the analysis of an order parameter, entropy production, bond dynamics, and interparticle forces, we unravel the underlying mechanisms driving these advancements. We also validated our key findings by simulating a larger system of $100$ patchy particles. Our comprehensive results not only shed light on the impact of external stimuli on self-assembly processes but also open a promising pathway for expanding the application by leveraging patchy particles for novel nanostructures. △ Less

Submitted 7 September, 2024; originally announced September 2024.

Comments: 74 pages and 21 figures

arXiv:2409.00975 [pdf, other]

Deciphering Interstellar Ice Morphology: Atomistic Simulations Reveal the Complex Behavior of Ethanethiol

Authors: Jeet Majumdar, Shubhadeep Nag, Tejender S Thakur, Subramanian Yashonath, Bhalamurugan Sivaraman, Prabal K. Maiti

Abstract: Ethanethiol (C$_2$H$_5$SH), a molecule detected in the interstellar medium (ISM), indicates the rich chemistry involving sulfur atoms. However, its behavior at low temperatures remains elusive, particularly the reported transition from an amorphous phase to a crystal. This study employs classical molecular dynamics (MD) simulations to reproduce the liquid-state properties of ethanethiol and to sim… ▽ More Ethanethiol (C$_2$H$_5$SH), a molecule detected in the interstellar medium (ISM), indicates the rich chemistry involving sulfur atoms. However, its behavior at low temperatures remains elusive, particularly the reported transition from an amorphous phase to a crystal. This study employs classical molecular dynamics (MD) simulations to reproduce the liquid-state properties of ethanethiol and to simulate the initial amorphous state of ethanethiol films deposited on a KBr substrate. The amorphous ethanethiol did not show spontaneous crystallization upon increasing temperature. Also, ethanethiol ice crystals exhibit melting behavior on KBr substrate at elevated temperatures. Our MD simulations of thin ice samples do not show any signature reversible phase change. It will be interesting to continue this study with a thicker sample, which is beyond our current computational means. These findings underscore the complexity of icy mantle morphology on cold ISM dust grains. △ Less

Submitted 2 September, 2024; originally announced September 2024.

Comments: Manuscript accepted for publication in "Astrophysics and Space Science Proceedings", Springer nature (Symposium: ISRA 2023)

arXiv:2408.02215 [pdf]

Exploring Query Understanding for Amazon Product Search

Authors: Chen Luo, Xianfeng Tang, Hanqing Lu, Yaochen Xie, Hui Liu, Zhenwei Dai, Limeng Cui, Ashutosh Joshi, Sreyashi Nag, Yang Li, Zhen Li, Rahul Goutam, Jiliang Tang, Haiyang Zhang, Qi He

Abstract: Online shopping platforms, such as Amazon, offer services to billions of people worldwide. Unlike web search or other search engines, product search engines have their unique characteristics, primarily featuring short queries which are mostly a combination of product attributes and structured product search space. The uniqueness of product search underscores the crucial importance of the query und… ▽ More Online shopping platforms, such as Amazon, offer services to billions of people worldwide. Unlike web search or other search engines, product search engines have their unique characteristics, primarily featuring short queries which are mostly a combination of product attributes and structured product search space. The uniqueness of product search underscores the crucial importance of the query understanding component. However, there are limited studies focusing on exploring this impact within real-world product search engines. In this work, we aim to bridge this gap by conducting a comprehensive study and sharing our year-long journey investigating how the query understanding service impacts Amazon Product Search. Firstly, we explore how query understanding-based ranking features influence the ranking process. Next, we delve into how the query understanding system contributes to understanding the performance of a ranking model. Building on the insights gained from our study on the evaluation of the query understanding-based ranking model, we propose a query understanding-based multi-task learning framework for ranking. We present our studies and investigations using the real-world system on Amazon Search. △ Less

Submitted 4 August, 2024; originally announced August 2024.

arXiv:2408.01690 [pdf, other]

IDNet: A Novel Dataset for Identity Document Analysis and Fraud Detection

Authors: Hong Guan, Yancheng Wang, Lulu Xie, Soham Nag, Rajeev Goel, Niranjan Erappa Narayana Swamy, Yingzhen Yang, Chaowei Xiao, Jonathan Prisby, Ross Maciejewski, Jia Zou

Abstract: Effective fraud detection and analysis of government-issued identity documents, such as passports, driver's licenses, and identity cards, are essential in thwarting identity theft and bolstering security on online platforms. The training of accurate fraud detection and analysis tools depends on the availability of extensive identity document datasets. However, current publicly available benchmark… ▽ More Effective fraud detection and analysis of government-issued identity documents, such as passports, driver's licenses, and identity cards, are essential in thwarting identity theft and bolstering security on online platforms. The training of accurate fraud detection and analysis tools depends on the availability of extensive identity document datasets. However, current publicly available benchmark datasets for identity document analysis, including MIDV-500, MIDV-2020, and FMIDV, fall short in several respects: they offer a limited number of samples, cover insufficient varieties of fraud patterns, and seldom include alterations in critical personal identifying fields like portrait images, limiting their utility in training models capable of detecting realistic frauds while preserving privacy. In response to these shortcomings, our research introduces a new benchmark dataset, IDNet, designed to advance privacy-preserving fraud detection efforts. The IDNet dataset comprises 837,060 images of synthetically generated identity documents, totaling approximately 490 gigabytes, categorized into 20 types from $10$ U.S. states and 10 European countries. We evaluate the utility and present use cases of the dataset, illustrating how it can aid in training privacy-preserving fraud detection methods, facilitating the generation of camera and video capturing of identity documents, and testing schema unification and other identity document management functionalities. △ Less

Submitted 3 September, 2024; v1 submitted 3 August, 2024; originally announced August 2024.

Comments: 40 pages

arXiv:2407.19392 [pdf, other]

AndroCon: Conning Location Services in Android

Authors: Soham Nag, Smruti R. Sarangi

Abstract: Mobile device hackers often target ambient sensing, human activity identification, and interior floor mapping. In addition to overt signals like microphones and cameras, covert channels like WiFi, Bluetooth, and augmented GPS signal strengths have been employed to gather this information. Until date, passive, receive-only satellite GPS sensing relied solely on signal strength and location informat… ▽ More Mobile device hackers often target ambient sensing, human activity identification, and interior floor mapping. In addition to overt signals like microphones and cameras, covert channels like WiFi, Bluetooth, and augmented GPS signal strengths have been employed to gather this information. Until date, passive, receive-only satellite GPS sensing relied solely on signal strength and location information. This paper demonstrates that semi-processed GPS data (39 features) accessible to apps since Android 7 with precise location permissions can be used as a highly accurate leaky channel for sensing ambient, recognising human activity, and mapping indoor spaces (99%+ accuracy). This report describes a longitudinal research that used semi-processed GPS readings from mobile devices throughout a 40,000 sq. km region for a year. Data was acquired from aeroplanes, cruise ships, and high-altitude places. To retain crucial information, we analyse all satellite GPS signals and select the best characteristics using cross-correlation analysis. Our work, AndroCon, combines lin-ear discriminant analysis, unscented Kalman filtering, gradient boosting, and random forest learning to provide an accurate ambient and human activity sensor. At AndroCon, basic ML algorithms are used for discreet and somewhat explainable outcomes. We can readily recognise challenging situations, such as being in a subway, when someone is waving a hand in front of a mobile device, in front of a stairway, or with others present (not always carrying phones). This is the most extensive study on satellite GPS-based sensing as of yet. △ Less

Submitted 28 July, 2024; originally announced July 2024.

Comments: 18 pages

arXiv:2407.18553 [pdf, other]

REAPER: Reasoning based Retrieval Planning for Complex RAG Systems

Authors: Ashutosh Joshi, Sheikh Muhammad Sarwar, Samarth Varshney, Sreyashi Nag, Shrivats Agrawal, Juhi Naik

Abstract: Complex dialog systems often use retrieved evidence to facilitate factual responses. Such RAG (Retrieval Augmented Generation) systems retrieve from massive heterogeneous data stores that are usually architected as multiple indexes or APIs instead of a single monolithic source. For a given query, relevant evidence needs to be retrieved from one or a small subset of possible retrieval sources. Comp… ▽ More Complex dialog systems often use retrieved evidence to facilitate factual responses. Such RAG (Retrieval Augmented Generation) systems retrieve from massive heterogeneous data stores that are usually architected as multiple indexes or APIs instead of a single monolithic source. For a given query, relevant evidence needs to be retrieved from one or a small subset of possible retrieval sources. Complex queries can even require multi-step retrieval. For example, a conversational agent on a retail site answering customer questions about past orders will need to retrieve the appropriate customer order first and then the evidence relevant to the customer's question in the context of the ordered product. Most RAG Agents handle such Chain-of-Thought (CoT) tasks by interleaving reasoning and retrieval steps. However, each reasoning step directly adds to the latency of the system. For large models this latency cost is significant -- in the order of multiple seconds. Multi-agent systems may classify the query to a single Agent associated with a retrieval source, though this means that a (small) classification model dictates the performance of a large language model. In this work we present REAPER (REAsoning-based PlannER) - an LLM based planner to generate retrieval plans in conversational systems. We show significant gains in latency over Agent-based systems and are able to scale easily to new and unseen use cases as compared to classification-based planning. Though our method can be applied to any RAG system, we show our results in the context of a conversational shopping assistant. △ Less

Submitted 30 July, 2024; v1 submitted 26 July, 2024; originally announced July 2024.

arXiv:2407.02389 [pdf, other]

SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation

Authors: Sayan Nag, Koustava Goswami, Srikrishna Karanam

Abstract: Referring Expression Segmentation (RES) aims to provide a segmentation mask of the target object in an image referred to by the text (i.e., referring expression). Existing methods require large-scale mask annotations. Moreover, such approaches do not generalize well to unseen/zero-shot scenarios. To address the aforementioned issues, we propose a weakly-supervised bootstrapping architecture for RE… ▽ More Referring Expression Segmentation (RES) aims to provide a segmentation mask of the target object in an image referred to by the text (i.e., referring expression). Existing methods require large-scale mask annotations. Moreover, such approaches do not generalize well to unseen/zero-shot scenarios. To address the aforementioned issues, we propose a weakly-supervised bootstrapping architecture for RES with several new algorithmic innovations. To the best of our knowledge, ours is the first approach that considers only a fraction of both mask and box annotations (shown in Figure 1 and Table 1) for training. To enable principled training of models in such low-annotation settings, improve image-text region-level alignment, and further enhance spatial localization of the target object in the image, we propose Cross-modal Fusion with Attention Consistency module. For automatic pseudo-labeling of unlabeled samples, we introduce a novel Mask Validity Filtering routine based on a spatially aware zero-shot proposal scoring approach. Extensive experiments show that with just 30% annotations, our model SafaRi achieves 59.31 and 48.26 mIoUs as compared to 58.93 and 48.19 mIoUs obtained by the fully-supervised SOTA method SeqTR respectively on RefCOCO+@testA and RefCOCO+testB datasets. SafaRi also outperforms SeqTR by 11.7% (on RefCOCO+testA) and 19.6% (on RefCOCO+testB) in a fully-supervised setting and demonstrates strong generalization capabilities in unseen/zero-shot tasks. △ Less

Submitted 2 July, 2024; originally announced July 2024.

Comments: Accepted at ECCV 2024

arXiv:2407.01851 [pdf, other]

Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time

Authors: Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta, Jun Chen, Mohamed Elhoseiny, Ruohan Gao, Dinesh Manocha

Abstract: Leveraging Large Language Models' remarkable proficiency in text-based tasks, recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and audio. However, the progress in these directions has been mostly focused on tasks that only require a coarse-grained understanding of the audio-visual semantics. We present Meerkat, an audio-visual LLM equipped with a fine-grained un… ▽ More Leveraging Large Language Models' remarkable proficiency in text-based tasks, recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and audio. However, the progress in these directions has been mostly focused on tasks that only require a coarse-grained understanding of the audio-visual semantics. We present Meerkat, an audio-visual LLM equipped with a fine-grained understanding of image and audio both spatially and temporally. With a new modality alignment module based on optimal transport and a cross-attention module that enforces audio-visual consistency, Meerkat can tackle challenging tasks such as audio referred image grounding, image guided audio temporal localization, and audio-visual fact-checking. Moreover, we carefully curate a large dataset AVFIT that comprises 3M instruction tuning samples collected from open-source datasets, and introduce MeerkatBench that unifies five challenging audio-visual tasks. We achieve state-of-the-art performance on all these downstream tasks with a relative improvement of up to 37.12%. △ Less

Submitted 3 July, 2024; v1 submitted 1 July, 2024; originally announced July 2024.

Comments: Accepted at ECCV 2024

arXiv:2406.04673 [pdf, other]

MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models

Authors: Sanjoy Chowdhury, Sayan Nag, K J Joseph, Balaji Vasan Srinivasan, Dinesh Manocha

Abstract: Music is a universal language that can communicate emotions and feelings. It forms an essential part of the whole spectrum of creative media, ranging from movies to social media posts. Machine learning models that can synthesize music are predominantly conditioned on textual descriptions of it. Inspired by how musicians compose music not just from a movie script, but also through visualizations, w… ▽ More Music is a universal language that can communicate emotions and feelings. It forms an essential part of the whole spectrum of creative media, ranging from movies to social media posts. Machine learning models that can synthesize music are predominantly conditioned on textual descriptions of it. Inspired by how musicians compose music not just from a movie script, but also through visualizations, we propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music. MeLFusion is a text-to-music diffusion model with a novel "visual synapse", which effectively infuses the semantics from the visual modality into the generated music. To facilitate research in this area, we introduce a new dataset MeLBench, and propose a new evaluation metric IMSM. Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music, measured both objectively and subjectively, with a relative gain of up to 67.98% on the FAD score. We hope that our work will gather attention to this pragmatic, yet relatively under-explored research area. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: Accepted at CVPR 2024 as Highlight paper. Webpage: https://schowdhury671.github.io/melfusion_cvpr2024/

arXiv:2405.00716 [pdf, other]

Large Language Models in the Clinic: A Comprehensive Benchmark

Authors: Fenglin Liu, Zheng Li, Hongjian Zhou, Qingyu Yin, Jingfeng Yang, Xianfeng Tang, Chen Luo, Ming Zeng, Haoming Jiang, Yifan Gao, Priyanka Nigam, Sreyashi Nag, Bing Yin, Yining Hua, Xuan Zhou, Omid Rohanian, Anshul Thakur, Lei Clifton, David A. Clifton

Abstract: The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering (QA) task with answer options for evaluation. However, many clinical decisions involve answering open-ended questions without pre-set options. To better understand LLMs in the clinic, we construct a benchmark ClinicBench. We first coll… ▽ More The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering (QA) task with answer options for evaluation. However, many clinical decisions involve answering open-ended questions without pre-set options. To better understand LLMs in the clinic, we construct a benchmark ClinicBench. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. Furthermore, we construct six novel datasets and clinical tasks that are complex but common in real-world practice, e.g., open-ended decision-making, long document processing, and emerging drug analysis. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings. Finally, we invite medical experts to evaluate the clinical usefulness of LLMs. The benchmark data is available at https://github.com/AI-in-Health/ClinicBench. △ Less

Submitted 16 October, 2024; v1 submitted 25 April, 2024; originally announced May 2024.

Comments: Accepted at EMNLP 2024 Main Conference

arXiv:2403.19113 [pdf, other]

FACTOID: FACtual enTailment fOr hallucInation Detection

Authors: Vipula Rawte, S. M Towhidul Islam Tonmoy, Krishnav Rajbangshi, Shravani Nag, Aman Chadha, Amit P. Sheth, Amitava Das

Abstract: The widespread adoption of Large Language Models (LLMs) has facilitated numerous benefits. However, hallucination is a significant concern. In response, Retrieval Augmented Generation (RAG) has emerged as a highly promising paradigm to improve LLM outputs by grounding them in factual information. RAG relies on textual entailment (TE) or similar methods to check if the text produced by LLMs is supp… ▽ More The widespread adoption of Large Language Models (LLMs) has facilitated numerous benefits. However, hallucination is a significant concern. In response, Retrieval Augmented Generation (RAG) has emerged as a highly promising paradigm to improve LLM outputs by grounding them in factual information. RAG relies on textual entailment (TE) or similar methods to check if the text produced by LLMs is supported or contradicted, compared to retrieved documents. This paper argues that conventional TE methods are inadequate for spotting hallucinations in content generated by LLMs. For instance, consider a prompt about the 'USA's stance on the Ukraine war''. The AI-generated text states, ...U.S. President Barack Obama says the U.S. will not put troops in Ukraine...'' However, during the war the U.S. president is Joe Biden which contradicts factual reality. Moreover, current TE systems are unable to accurately annotate the given text and identify the exact portion that is contradicted. To address this, we introduces a new type of TE called ``Factual Entailment (FE).'', aims to detect factual inaccuracies in content generated by LLMs while also highlighting the specific text segment that contradicts reality. We present FACTOID (FACTual enTAILment for hallucInation Detection), a benchmark dataset for FE. We propose a multi-task learning (MTL) framework for FE, incorporating state-of-the-art (SoTA) long text embeddings such as e5-mistral-7b-instruct, along with GPT-3, SpanBERT, and RoFormer. The proposed MTL architecture for FE achieves an avg. 40\% improvement in accuracy on the FACTOID benchmark compared to SoTA TE methods. As FE automatically detects hallucinations, we assessed 15 modern LLMs and ranked them using our proposed Auto Hallucination Vulnerability Index (HVI_auto). This index quantifies and offers a comparative scale to evaluate and rank LLMs according to their hallucinations. △ Less

Submitted 27 March, 2024; originally announced March 2024.

arXiv:2403.18341 [pdf, other]

IterAlign: Iterative Constitutional Alignment of Large Language Models

Authors: Xiusi Chen, Hongzhi Wen, Sreyashi Nag, Chen Luo, Qingyu Yin, Ruirui Li, Zheng Li, Wei Wang

Abstract: With the rapid development of large language models (LLMs), aligning LLMs with human values and societal norms to ensure their reliability and safety has become crucial. Reinforcement learning with human feedback (RLHF) and Constitutional AI (CAI) have been proposed for LLM alignment. However, these methods require either heavy human annotations or explicitly pre-defined constitutions, which are l… ▽ More With the rapid development of large language models (LLMs), aligning LLMs with human values and societal norms to ensure their reliability and safety has become crucial. Reinforcement learning with human feedback (RLHF) and Constitutional AI (CAI) have been proposed for LLM alignment. However, these methods require either heavy human annotations or explicitly pre-defined constitutions, which are labor-intensive and resource-consuming. To overcome these drawbacks, we study constitution-based LLM alignment and propose a data-driven constitution discovery and self-alignment framework called IterAlign. IterAlign leverages red teaming to unveil the weaknesses of an LLM and automatically discovers new constitutions using a stronger LLM. These constitutions are then used to guide self-correction of the base LLM. Such a constitution discovery pipeline can be run iteratively and automatically to discover new constitutions that specifically target the alignment gaps in the current LLM. Empirical results on several safety benchmark datasets and multiple base LLMs show that IterAlign successfully improves truthfulness, helpfulness, harmlessness and honesty, improving the LLM alignment by up to $13.5\%$ in harmlessness. △ Less

Submitted 27 March, 2024; originally announced March 2024.

Comments: NAACL 2024

arXiv:2403.06021 [pdf, other]

Hierarchical Query Classification in E-commerce Search

Authors: Bing He, Sreyashi Nag, Limeng Cui, Suhang Wang, Zheng Li, Rahul Goutam, Zhen Li, Haiyang Zhang

Abstract: E-commerce platforms typically store and structure product information and search data in a hierarchy. Efficiently categorizing user search queries into a similar hierarchical structure is paramount in enhancing user experience on e-commerce platforms as well as news curation and academic research. The significance of this task is amplified when dealing with sensitive query categorization or criti… ▽ More E-commerce platforms typically store and structure product information and search data in a hierarchy. Efficiently categorizing user search queries into a similar hierarchical structure is paramount in enhancing user experience on e-commerce platforms as well as news curation and academic research. The significance of this task is amplified when dealing with sensitive query categorization or critical information dissemination, where inaccuracies can lead to considerable negative impacts. The inherent complexity of hierarchical query classification is compounded by two primary challenges: (1) the pronounced class imbalance that skews towards dominant categories, and (2) the inherent brevity and ambiguity of search queries that hinder accurate classification. To address these challenges, we introduce a novel framework that leverages hierarchical information through (i) enhanced representation learning that utilizes the contrastive loss to discern fine-grained instance relationships within the hierarchy, called ''instance hierarchy'', and (ii) a nuanced hierarchical classification loss that attends to the intrinsic label taxonomy, named ''label hierarchy''. Additionally, based on our observation that certain unlabeled queries share typographical similarities with labeled queries, we propose a neighborhood-aware sampling technique to intelligently select these unlabeled queries to boost the classification performance. Extensive experiments demonstrate that our proposed method is better than state-of-the-art (SOTA) on the proprietary Amazon dataset, and comparable to SOTA on the public datasets of Web of Science and RCV1-V2. These results underscore the efficacy of our proposed solution, and pave the path toward the next generation of hierarchy-aware query classification systems. △ Less

Submitted 9 March, 2024; originally announced March 2024.

Comments: Published at: the ACM Web Conference 2024 in the industry track (WWW'24)

arXiv:2403.05435 [pdf, other]

OmniCount: Multi-label Object Counting with Semantic-Geometric Priors

Authors: Anindya Mondal, Sauradip Nag, Xiatian Zhu, Anjan Dutta

Abstract: Object counting is pivotal for understanding the composition of scenes. Previously, this task was dominated by class-specific methods, which have gradually evolved into more adaptable class-agnostic strategies. However, these strategies come with their own set of limitations, such as the need for manual exemplar input and multiple passes for multiple categories, resulting in significant inefficien… ▽ More Object counting is pivotal for understanding the composition of scenes. Previously, this task was dominated by class-specific methods, which have gradually evolved into more adaptable class-agnostic strategies. However, these strategies come with their own set of limitations, such as the need for manual exemplar input and multiple passes for multiple categories, resulting in significant inefficiencies. This paper introduces a more practical approach enabling simultaneous counting of multiple object categories using an open-vocabulary framework. Our solution, OmniCount, stands out by using semantic and geometric insights (priors) from pre-trained models to count multiple categories of objects as specified by users, all without additional training. OmniCount distinguishes itself by generating precise object masks and leveraging varied interactive prompts via the Segment Anything Model for efficient counting. To evaluate OmniCount, we created the OmniCount-191 benchmark, a first-of-its-kind dataset with multi-label object counts, including points, bounding boxes, and VQA annotations. Our comprehensive evaluation in OmniCount-191, alongside other leading benchmarks, demonstrates OmniCount's exceptional performance, significantly outpacing existing solutions. The project webpage is available at https://mondalanindya.github.io/OmniCount. △ Less

Submitted 22 February, 2025; v1 submitted 8 March, 2024; originally announced March 2024.

Comments: Accepted to AAAI 2025

arXiv:2403.05174 [pdf, other]

VTruST: Controllable value function based subset selection for Data-Centric Trustworthy AI

Authors: Soumi Das, Shubhadip Nag, Shreyyash Sharma, Suparna Bhattacharya, Sourangshu Bhattacharya

Abstract: Trustworthy AI is crucial to the widespread adoption of AI in high-stakes applications with fairness, robustness, and accuracy being some of the key trustworthiness metrics. In this work, we propose a controllable framework for data-centric trustworthy AI (DCTAI)- VTruST, that allows users to control the trade-offs between the different trustworthiness metrics of the constructed training datasets.… ▽ More Trustworthy AI is crucial to the widespread adoption of AI in high-stakes applications with fairness, robustness, and accuracy being some of the key trustworthiness metrics. In this work, we propose a controllable framework for data-centric trustworthy AI (DCTAI)- VTruST, that allows users to control the trade-offs between the different trustworthiness metrics of the constructed training datasets. A key challenge in implementing an efficient DCTAI framework is to design an online value-function-based training data subset selection algorithm. We pose the training data valuation and subset selection problem as an online sparse approximation formulation. We propose a novel online version of the Orthogonal Matching Pursuit (OMP) algorithm for solving this problem. Experimental results show that VTruST outperforms the state-of-the-art baselines on social, image, and scientific datasets. We also show that the data values generated by VTruST can provide effective data-centric explanations for different trustworthiness metrics. △ Less

Submitted 8 March, 2024; originally announced March 2024.

Comments: Accepted in ICLR 2024 DMLR workshop

arXiv:2401.16455 [pdf, other]

The Carbon Premium: Correlation or Causation? Evidence from S&P 500 Companies

Authors: Namasi G. Sankar, Suryadeepto Nag, Siddhartha P. Chakrabarty, Sankarshan Basu

Abstract: In the context of whether investors are aware of carbon-related risks, it is often hypothesized that there may be a carbon premium in the value of stocks of firms, conferring an abnormal excess value to firms' shares as a form of compensation to investors for their transition risk exposure through the ownership of carbon instensive stocks. However, there is little consensus in the literature regar… ▽ More In the context of whether investors are aware of carbon-related risks, it is often hypothesized that there may be a carbon premium in the value of stocks of firms, conferring an abnormal excess value to firms' shares as a form of compensation to investors for their transition risk exposure through the ownership of carbon instensive stocks. However, there is little consensus in the literature regarding the existence of such a premium. Moreover few studies have examined whether the correlation that is often observed is actually causal. The pertinent question is whether more polluting firms give higher returns or do firms with high returns have less incentive to decarbonize? In this study, we investigate whether firms' emissions is causally linked to the presence of a carbon premium in a panel of 141 firms listed in the S\&P500 index using fixed-effects analysis, with propensity score weighting to control for selection bias in which firms increase their emissions. We find that there is a statistically significant positive carbon premium associated with Scope 1 emissions, while there is no significant premium associated with Scope 2 emissions, implying that risks associated with direct emissions by the firm are priced, while bought emissions are not. △ Less

Submitted 29 January, 2024; originally announced January 2024.

arXiv:2312.12423 [pdf, other]

Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model

Authors: Shraman Pramanick, Guangxing Han, Rui Hou, Sayan Nag, Ser-Nam Lim, Nicolas Ballas, Qifan Wang, Rama Chellappa, Amjad Almahairi

Abstract: The ability of large language models (LLMs) to process visual inputs has given rise to general-purpose vision systems, unifying various vision-language (VL) tasks by instruction tuning. However, due to the enormous diversity in input-output formats in the vision domain, existing general-purpose models fail to successfully integrate segmentation and multi-image inputs with coarse-level tasks into a… ▽ More The ability of large language models (LLMs) to process visual inputs has given rise to general-purpose vision systems, unifying various vision-language (VL) tasks by instruction tuning. However, due to the enormous diversity in input-output formats in the vision domain, existing general-purpose models fail to successfully integrate segmentation and multi-image inputs with coarse-level tasks into a single framework. In this work, we introduce VistaLLM, a powerful visual system that addresses coarse- and fine-grained VL tasks over single and multiple input images using a unified framework. VistaLLM utilizes an instruction-guided image tokenizer that filters global embeddings using task descriptions to extract compressed and refined features from numerous images. Moreover, VistaLLM employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences, significantly improving over previously used uniform sampling. To bolster the desired capability of VistaLLM, we curate CoinIt, a comprehensive coarse-to-fine instruction tuning dataset with 6.8M samples. We also address the lack of multi-image grounding datasets by introducing a novel task, AttCoSeg (Attribute-level Co-Segmentation), which boosts the model's reasoning and grounding capability over multiple input images. Extensive experiments on a wide range of V- and VL tasks demonstrate the effectiveness of VistaLLM by achieving consistent state-of-the-art performance over strong baselines across all downstream tasks. Our project page can be found at https://shramanpramanick.github.io/VistaLLM/. △ Less

Submitted 19 June, 2024; v1 submitted 19 December, 2023; originally announced December 2023.

Comments: CVPR 2024 Highlight

arXiv:2312.05407 [pdf, other]

ODES: Domain Adaptation with Expert Guidance for Online Medical Image Segmentation

Authors: Md Shazid Islam, Sayak Nag, Arindam Dutta, Miraj Ahmed, Fahim Faisal Niloy, Amit K. Roy-Chowdhury

Abstract: Unsupervised domain adaptive segmentation typically relies on self-training using pseudo labels predicted by a pre-trained network on an unlabeled target dataset. However, the noisy nature of such pseudo-labels presents a major bottleneck in adapting a network to the distribution shift between source and target datasets. This challenge is exaggerated when the network encounters an incoming data st… ▽ More Unsupervised domain adaptive segmentation typically relies on self-training using pseudo labels predicted by a pre-trained network on an unlabeled target dataset. However, the noisy nature of such pseudo-labels presents a major bottleneck in adapting a network to the distribution shift between source and target datasets. This challenge is exaggerated when the network encounters an incoming data stream in online fashion, where the network is constrained to adapt to incoming streams of target domain data in exactly one round of forward and backward passes. In this scenario, relying solely on inaccurate pseudo-labels can lead to low-quality segmentation, which is detrimental to medical image analysis where accuracy and precision are of utmost priority. We hypothesize that a small amount of pixel-level annotation obtained from an expert can address this problem, thereby enhancing the performance of domain adaptation of online streaming data, even in the absence of dedicated training data. We call our method ODES: Domain Adaptation with Expert Guidance for Online Medical Image Segmentation that adapts to each incoming data batch in an online setup, incorporating feedback from an expert through active learning. Through active learning, the most informative pixels in each image can be selected for expert annotation. However, the acquisition of pixel-level annotations across all images in a batch often leads to redundant information while increasing temporal overhead in online learning. To reduce the annotation acquisition time and make the adaptation process more online-friendly, we further propose a novel image-pruning strategy that selects the most useful subset of images from the current batch for active learning. Our proposed approach outperforms existing online adaptation approaches and produces competitive results compared to offline domain adaptive active learning methods. △ Less

Submitted 15 October, 2024; v1 submitted 8 December, 2023; originally announced December 2023.

arXiv:2312.01564 [pdf, other]

APoLLo: Unified Adapter and Prompt Learning for Vision Language Models

Authors: Sanjoy Chowdhury, Sayan Nag, Dinesh Manocha

Abstract: The choice of input text prompt plays a critical role in the performance of Vision-Language Pretrained (VLP) models such as CLIP. We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models. Our method is designed to substantially improve the generalization capabilities of VLP models when they are fine-tuned in a few-shot setting. We intro… ▽ More The choice of input text prompt plays a critical role in the performance of Vision-Language Pretrained (VLP) models such as CLIP. We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models. Our method is designed to substantially improve the generalization capabilities of VLP models when they are fine-tuned in a few-shot setting. We introduce trainable cross-attention-based adapter layers in conjunction with vision and language encoders to strengthen the alignment between the two modalities. We enforce consistency between the respective encoder branches (receiving augmented inputs) to prevent overfitting in downstream tasks. Our method is evaluated on three representative tasks: generalization to novel classes, cross-dataset evaluation, and unseen domain shifts. In practice, APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets. △ Less

Submitted 3 December, 2023; originally announced December 2023.

Comments: Accepted at EMNLP 2023 (Main track)

arXiv:2311.05198 [pdf, other]

Adaptive-Labeling for Enhancing Remote Sensing Cloud Understanding

Authors: Jay Gala, Sauradip Nag, Huichou Huang, Ruirui Liu, Xiatian Zhu

Abstract: Cloud analysis is a critical component of weather and climate science, impacting various sectors like disaster management. However, achieving fine-grained cloud analysis, such as cloud segmentation, in remote sensing remains challenging due to the inherent difficulties in obtaining accurate labels, leading to significant labeling errors in training data. Existing methods often assume the availabil… ▽ More Cloud analysis is a critical component of weather and climate science, impacting various sectors like disaster management. However, achieving fine-grained cloud analysis, such as cloud segmentation, in remote sensing remains challenging due to the inherent difficulties in obtaining accurate labels, leading to significant labeling errors in training data. Existing methods often assume the availability of reliable segmentation annotations, limiting their overall performance. To address this inherent limitation, we introduce an innovative model-agnostic Cloud Adaptive-Labeling (CAL) approach, which operates iteratively to enhance the quality of training data annotations and consequently improve the performance of the learned model. Our methodology commences by training a cloud segmentation model using the original annotations. Subsequently, it introduces a trainable pixel intensity threshold for adaptively labeling the cloud training images on the fly. The newly generated labels are then employed to fine-tune the model. Extensive experiments conducted on multiple standard cloud segmentation benchmarks demonstrate the effectiveness of our approach in significantly boosting the performance of existing segmentation models. Our CAL method establishes new state-of-the-art results when compared to a wide array of existing alternatives. △ Less

Submitted 9 November, 2023; originally announced November 2023.

Comments: Accepted at the TCCML Workshop at NeurIPS 2023

arXiv:2309.09178 [pdf, other]

Does Reliable Electricity Mean Lesser Agricultural Labor Wages? Evidence from Indian Villages

Authors: Suryadeepto Nag

Abstract: Using a panel of 1,171 villages in rural India that were surveyed in the India Human Development Surveys, I perform a difference-in-differences analysis to find that improvements in electricity reliability have a negative effect on the increase in casual agricultural labor wage rates. Changes in men's wage rates are found to be affected more adversely than women's, resulting in a smaller widening… ▽ More Using a panel of 1,171 villages in rural India that were surveyed in the India Human Development Surveys, I perform a difference-in-differences analysis to find that improvements in electricity reliability have a negative effect on the increase in casual agricultural labor wage rates. Changes in men's wage rates are found to be affected more adversely than women's, resulting in a smaller widening of the gender wage gap. I find that better electricity reliability reduces the time spent by women in fuel collection substantially which could potentially increase labor supply. The demand for labor remains unaffected by reliability, which could lead the surplus in labor supply to cause wage rates to stunt. However, I show that electrical appliances such as groundwater pumps considerably increase labor demand indicating that governments could target increasing the adoption of electric pumps along with bettering the quality of electricity to absorb the surplus labor into agriculture. △ Less

Submitted 17 September, 2023; originally announced September 2023.

arXiv:2308.14115 [pdf, other]

Situated Natural Language Explanations

Authors: Zining Zhu, Haoming Jiang, Jingfeng Yang, Sreyashi Nag, Chao Zhang, Jie Huang, Yifan Gao, Frank Rudzicz, Bing Yin

Abstract: Natural language is among the most accessible tools for explaining decisions to humans, and large pretrained language models (PLMs) have demonstrated impressive abilities to generate coherent natural language explanations (NLE). The existing NLE research perspectives do not take the audience into account. An NLE can have high textual quality, but it might not accommodate audiences' needs and prefe… ▽ More Natural language is among the most accessible tools for explaining decisions to humans, and large pretrained language models (PLMs) have demonstrated impressive abilities to generate coherent natural language explanations (NLE). The existing NLE research perspectives do not take the audience into account. An NLE can have high textual quality, but it might not accommodate audiences' needs and preference. To address this limitation, we propose an alternative perspective, \textit{situated} NLE. On the evaluation side, we set up automated evaluation scores. These scores describe the properties of NLEs in lexical, semantic, and pragmatic categories. On the generation side, we identify three prompt engineering techniques and assess their applicability on the situations. Situated NLE provides a perspective and facilitates further research on the generation and evaluation of explanations. △ Less

Submitted 24 March, 2024; v1 submitted 27 August, 2023; originally announced August 2023.

arXiv:2308.07293 [pdf, other]

DiffSED: Sound Event Detection with Denoising Diffusion

Authors: Swapnil Bhosale, Sauradip Nag, Diptesh Kanojia, Jiankang Deng, Xiatian Zhu

Abstract: Sound Event Detection (SED) aims to predict the temporal boundaries of all the events of interest and their class labels, given an unconstrained audio sample. Taking either the splitand-classify (i.e., frame-level) strategy or the more principled event-level modeling approach, all existing methods consider the SED problem from the discriminative learning perspective. In this work, we reformulate t… ▽ More Sound Event Detection (SED) aims to predict the temporal boundaries of all the events of interest and their class labels, given an unconstrained audio sample. Taking either the splitand-classify (i.e., frame-level) strategy or the more principled event-level modeling approach, all existing methods consider the SED problem from the discriminative learning perspective. In this work, we reformulate the SED problem by taking a generative learning perspective. Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process, conditioned on a target audio sample. During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions in the elegant Transformer decoder framework. Doing so enables the model generate accurate event boundaries from even noisy queries during inference. Extensive experiments on the Urban-SED and EPIC-Sounds datasets demonstrate that our model significantly outperforms existing alternatives, with 40+% faster convergence in training. △ Less

Submitted 16 August, 2023; v1 submitted 14 August, 2023; originally announced August 2023.

arXiv:2307.10763 [pdf, other]

doi 10.1109/ICCVW60793.2023.00086

Actor-agnostic Multi-label Action Recognition with Multi-modal Query

Authors: Anindya Mondal, Sauradip Nag, Joaquin M Prada, Xiatian Zhu, Anjan Dutta

Abstract: Existing action recognition methods are typically actor-specific due to the intrinsic topological and apparent differences among the actors. This requires actor-specific pose estimation (e.g., humans vs. animals), leading to cumbersome model design complexity and high maintenance costs. Moreover, they often focus on learning the visual modality alone and single-label classification whilst neglecti… ▽ More Existing action recognition methods are typically actor-specific due to the intrinsic topological and apparent differences among the actors. This requires actor-specific pose estimation (e.g., humans vs. animals), leading to cumbersome model design complexity and high maintenance costs. Moreover, they often focus on learning the visual modality alone and single-label classification whilst neglecting other available information sources (e.g., class name text) and the concurrent occurrence of multiple actions. To overcome these limitations, we propose a new approach called 'actor-agnostic multi-modal multi-label action recognition,' which offers a unified solution for various types of actors, including humans and animals. We further formulate a novel Multi-modal Semantic Query Network (MSQNet) model in a transformer-based object detection framework (e.g., DETR), characterized by leveraging visual and textual modalities to represent the action classes better. The elimination of actor-specific model designs is a key advantage, as it removes the need for actor pose estimation altogether. Extensive experiments on five publicly available benchmarks show that our MSQNet consistently outperforms the prior arts of actor-specific alternatives on human and animal single- and multi-label action recognition tasks by up to 50%. Code is made available at https://github.com/mondalanindya/MSQNet. △ Less

Submitted 10 January, 2024; v1 submitted 20 July, 2023; originally announced July 2023.

Comments: Published at the 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Paris, France

Showing 1–50 of 159 results for author: Nag, S