Skip to main content

Showing 1–50 of 159 results for author: Nag, S

.
  1. arXiv:2506.21080  [pdf, ps, other

    cs.CV cs.AI cs.LG

    EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception

    Authors: Sanjoy Chowdhury, Subrata Biswas, Sayan Nag, Tushar Nagarajan, Calvin Murdock, Ishwarya Ananthabhotla, Yijun Qian, Vamsi Krishna Ithapu, Dinesh Manocha, Ruohan Gao

    Abstract: Modern perception models, particularly those designed for multisensory egocentric tasks, have achieved remarkable performance but often come with substantial computational costs. These high demands pose challenges for real-world deployment, especially in resource-constrained environments. In this paper, we introduce EgoAdapt, a framework that adaptively performs cross-modal distillation and policy… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

    Comments: Accepted at ICCV 2025

  2. arXiv:2506.11946  [pdf, ps, other

    math.NA

    A visco-plastic constitutive model for accurate densification and shape predictions in powder metallurgy hot isostatic pressing

    Authors: Subrato Sarkar, Jason R Mayeur, KPK Ajjarapu, Fred A List III, Soumya Nag, Ryan R Dehoff

    Abstract: Powder metallurgy hot isostatic pressing (PM-HIP) is an advanced manufacturing process that produces near net shape parts with high material utilization and uniform microstructures. Despite being used frequently to produce small-scale components, the application of PM-HIP to large-scale components is limited due to inadequate understanding of its complex mechanisms that cause unpredictable post-HI… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

  3. arXiv:2506.07016  [pdf, ps, other

    cs.CV cs.AI

    MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks

    Authors: Sanjoy Chowdhury, Mohamed Elmoghany, Yohan Abeysinghe, Junjie Fei, Sayan Nag, Salman Khan, Mohamed Elhoseiny, Dinesh Manocha

    Abstract: Large multimodal models (LMMs) have shown remarkable progress in audio-visual understanding, yet they struggle with real-world scenarios that require complex reasoning across extensive video collections. Existing benchmarks for video question answering remain limited in scope, typically involving one clip per query, which falls short of representing the challenges of large-scale, audio-visual retr… ▽ More

    Submitted 13 June, 2025; v1 submitted 8 June, 2025; originally announced June 2025.

    Comments: Audio-visual learning, Audio-Visual RAG, Multi-Video Linkage

  4. arXiv:2505.23907  [pdf, ps, other

    cs.CV

    Cora: Correspondence-aware image editing using few step diffusion

    Authors: Amirhossein Almohammadi, Aryan Mikaeili, Sauradip Nag, Negar Hassanpour, Andrea Tagliasacchi, Ali Mahdavi-Amiri

    Abstract: Image editing is an important task in computer graphics, vision, and VFX, with recent diffusion-based methods achieving fast and high-quality results. However, edits requiring significant structural changes, such as non-rigid deformations, object modifications, or content generation, remain challenging. Existing few step editing approaches produce artifacts such as irrelevant texture or struggle t… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: Published in SIGGRAPH 2025

    ACM Class: I.4.10; I.3.7; I.2.10

  5. arXiv:2505.20737  [pdf, ps, other

    cs.AI

    RRO: LLM Agent Optimization Through Rising Reward Trajectories

    Authors: Zilong Wang, Jingfeng Yang, Sreyashi Nag, Samarth Varshney, Xianfeng Tang, Haoming Jiang, Jingbo Shang, Sheikh Muhammad Sarwar

    Abstract: Large language models (LLMs) have exhibited extraordinary performance in a variety of tasks while it remains challenging for them to solve complex multi-step tasks as agents. In practice, agents sensitive to the outcome of certain key steps which makes them likely to fail the task because of a subtle mistake in the planning trajectory. Recent approaches resort to calibrating the reasoning process… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: preprint

  6. arXiv:2505.18832  [pdf, other

    cs.CV

    Localizing Knowledge in Diffusion Transformers

    Authors: Arman Zarei, Samyadeep Basu, Keivan Rezaei, Zihao Lin, Sayan Nag, Soheil Feizi

    Abstract: Understanding how knowledge is distributed across the layers of generative models is crucial for improving interpretability, controllability, and adaptation. While prior work has explored knowledge localization in UNet-based architectures, Diffusion Transformer (DiT)-based models remain underexplored in this context. In this paper, we propose a model- and knowledge-agnostic method to localize wher… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

  7. arXiv:2505.15196  [pdf, ps, other

    cs.CL

    EcomScriptBench: A Multi-task Benchmark for E-commerce Script Planning via Step-wise Intention-Driven Product Association

    Authors: Weiqi Wang, Limeng Cui, Xin Liu, Sreyashi Nag, Wenju Xu, Chen Luo, Sheikh Muhammad Sarwar, Yang Li, Hansu Gu, Hui Liu, Changlong Yu, Jiaxin Bai, Yifan Gao, Haiyang Zhang, Qi He, Shuiwang Ji, Yangqiu Song

    Abstract: Goal-oriented script planning, or the ability to devise coherent sequences of actions toward specific goals, is commonly employed by humans to plan for typical activities. In e-commerce, customers increasingly seek LLM-based assistants to generate scripts and recommend products at each step, thereby facilitating convenient and efficient shopping experiences. However, this capability remains undere… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: ACL2025

  8. arXiv:2505.10237  [pdf, ps, other

    nucl-ex nucl-th

    Competition between the neutron-proton pair break-ups delineating the level structure of 202Po

    Authors: Sahab Singh, D. Choudhury, B. Maheshwari, R. Roy, K. Yadav, R. Palit, B. Das, P. Dey, A. Kundu, Md. S. R. Laskar, D. Negi, V. Malik, S. Jadhav, B. S. Naidu, A. V. Thomas, D. L. Balabanski, A. Dhal, S. Bhattacharya, A. K. Singh, S. Bhattacharyya, S. Nag

    Abstract: High-spin spectroscopic study of $^{202}$Po ($Z$ = 84, $N$ = 118) has been carried out using the $^{195}$Pt($^{12}$C, 5n)$^{202}$Po fusion-evaporation reaction. An extended level scheme has been proposed up to an excitation energy of $E_x\approx$ 8 MeV and angular momentum of 27$\hbar$, with the addition of 57 newly observed $γ$-ray transitions, along with the revisions in the placement of 8 alrea… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  9. arXiv:2504.10724  [pdf, other

    cs.CL cs.LG

    HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving

    Authors: Avinash Kumar, Shashank Nag, Jason Clemons, Lizy John, Poulami Das

    Abstract: Deploying large language models (LLMs) presents critical challenges due to the inherent trade-offs associated with key performance metrics, such as latency, accuracy, and throughput. Typically, gains in one metric is accompanied with degradation in others. Early-Exit LLMs (EE-LLMs) efficiently navigate this trade-off space by skipping some of the later model layers when it confidently finds an out… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

  10. arXiv:2504.09723  [pdf, other

    cs.HC cs.CL

    AgentA/B: Automated and Scalable Web A/BTesting with Interactive LLM Agents

    Authors: Dakuo Wang, Ting-Yao Hsu, Yuxuan Lu, Hansu Gu, Limeng Cui, Yaochen Xie, William Headean, Bingsheng Yao, Akash Veeragouni, Jiapeng Liu, Sreyashi Nag, Jessie Wang

    Abstract: A/B testing experiment is a widely adopted method for evaluating UI/UX design decisions in modern web applications. Yet, traditional A/B testing remains constrained by its dependence on the large-scale and live traffic of human participants, and the long time of waiting for the testing result. Through formative interviews with six experienced industry practitioners, we identified critical bottlene… ▽ More

    Submitted 21 April, 2025; v1 submitted 13 April, 2025; originally announced April 2025.

  11. arXiv:2504.08366  [pdf, ps, other

    cs.GR cs.CV

    In-2-4D: Inbetweening from Two Single-View Images to 4D Generation

    Authors: Sauradip Nag, Daniel Cohen-Or, Hao Zhang, Ali Mahdavi-Amiri

    Abstract: We propose a new problem, In-2-4D, for generative 4D (i.e., 3D + motion) inbetweening from a minimalistic input setting: two single-view images capturing an object in two distinct motion states. Given two images representing the start and end states of an object in motion, our goal is to generate and reconstruct the motion in 4D. We utilize a video interpolation model to predict the motion, but la… ▽ More

    Submitted 11 April, 2025; originally announced April 2025.

    Comments: Technical Report

  12. arXiv:2503.23219  [pdf, other

    eess.AS cs.AI cs.CV cs.LG

    Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs

    Authors: Sanjoy Chowdhury, Hanan Gani, Nishit Anand, Sayan Nag, Ruohan Gao, Mohamed Elhoseiny, Salman Khan, Dinesh Manocha

    Abstract: Recent advancements in reasoning optimization have greatly enhanced the performance of large language models (LLMs). However, existing work fails to address the complexities of audio-visual scenarios, underscoring the need for further research. In this paper, we introduce AURELIA, a novel actor-critic based audio-visual (AV) reasoning framework that distills structured, step-by-step reasoning into… ▽ More

    Submitted 29 March, 2025; originally announced March 2025.

  13. arXiv:2503.15742  [pdf, other

    cs.CV

    Uncertainty-Aware Diffusion Guided Refinement of 3D Scenes

    Authors: Sarosij Bose, Arindam Dutta, Sayak Nag, Junge Zhang, Jiachen Li, Konstantinos Karydis, Amit K. Roy Chowdhury

    Abstract: Reconstructing 3D scenes from a single image is a fundamentally ill-posed task due to the severely under-constrained nature of the problem. Consequently, when the scene is rendered from novel camera views, existing single image to 3D reconstruction methods render incoherent and blurry views. This problem is exacerbated when the unseen regions are far away from the input camera. In this work, we ad… ▽ More

    Submitted 19 March, 2025; originally announced March 2025.

    Comments: 13 pages, 7 figures

  14. arXiv:2503.13947  [pdf, other

    cs.CV

    Conformal Prediction and MLLM aided Uncertainty Quantification in Scene Graph Generation

    Authors: Sayak Nag, Udita Ghosh, Calvin-Khang Ta, Sarosij Bose, Jiachen Li, Amit K Roy Chowdhury

    Abstract: Scene Graph Generation (SGG) aims to represent visual scenes by identifying objects and their pairwise relationships, providing a structured understanding of image content. However, inherent challenges like long-tailed class distributions and prediction variability necessitate uncertainty quantification in SGG for its practical viability. In this paper, we introduce a novel Conformal Prediction (C… ▽ More

    Submitted 10 April, 2025; v1 submitted 18 March, 2025; originally announced March 2025.

    Comments: Accepted at CVPR 2025

  15. arXiv:2502.12173  [pdf, other

    cs.LG cs.AI

    nanoML for Human Activity Recognition

    Authors: Alan T. L. Bacellar, Mugdha P. Jadhao, Shashank Nag, Priscila M. V. Lima, Felipe M. G. Franca, Lizy K. John

    Abstract: Human Activity Recognition (HAR) is critical for applications in healthcare, fitness, and IoT, but deploying accurate models on resource-constrained devices remains challenging due to high energy and memory demands. This paper demonstrates the application of Differentiable Weightless Neural Networks (DWNs) to HAR, achieving competitive accuracies of 96.34% and 96.67% while consuming only 56nJ and… ▽ More

    Submitted 13 February, 2025; originally announced February 2025.

    Comments: Accepted as a full paper by the 2025 EDGE AI FOUNDATION Austin

  16. arXiv:2502.07278  [pdf, other

    cs.CV

    Articulate That Object Part (ATOP): 3D Part Articulation via Text and Motion Personalization

    Authors: Aditya Vora, Sauradip Nag, Hao Zhang

    Abstract: We present ATOP (Articulate That Object Part), a novel few-shot method based on motion personalization to articulate a static 3D object with respect to a part and its motion as prescribed in a text prompt. Given the scarcity of available datasets with motion attribute annotations, existing methods struggle to generalize well in this task. In our work, the text input allows us to tap into the power… ▽ More

    Submitted 13 March, 2025; v1 submitted 11 February, 2025; originally announced February 2025.

    Comments: Technical Report, 16 pages

  17. arXiv:2502.01555  [pdf, other

    cs.IR cs.AI cs.LG

    Query Brand Entity Linking in E-Commerce Search

    Authors: Dong Liu, Sreyashi Nag

    Abstract: In this work, we address the brand entity linking problem for e-commerce search queries. The entity linking task is done by either i)a two-stage process consisting of entity mention detection followed by entity disambiguation or ii) an end-to-end linking approaches that directly fetch the target entity given the input text. The task presents unique challenges: queries are extremely short (averagin… ▽ More

    Submitted 3 February, 2025; originally announced February 2025.

  18. arXiv:2501.07845  [pdf, other

    cs.CL

    Reasoning with Graphs: Structuring Implicit Knowledge to Enhance LLMs Reasoning

    Authors: Haoyu Han, Yaochen Xie, Hui Liu, Xianfeng Tang, Sreyashi Nag, William Headden, Hui Liu, Yang Li, Chen Luo, Shuiwang Ji, Qi He, Jiliang Tang

    Abstract: Large language models (LLMs) have demonstrated remarkable success across a wide range of tasks; however, they still encounter challenges in reasoning tasks that require understanding and inferring relationships between distinct pieces of information within text sequences. This challenge is particularly pronounced in tasks involving multi-step processes, such as logical reasoning and multi-hop ques… ▽ More

    Submitted 14 January, 2025; originally announced January 2025.

  19. arXiv:2501.02135  [pdf, other

    cs.CV cs.AI

    AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs

    Authors: Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta, Yaoting Wang, Mohamed Elhoseiny, Ruohan Gao, Dinesh Manocha

    Abstract: With the rapid advancement of Multi-modal Large Language Models (MLLMs), several diagnostic benchmarks have recently been developed to assess these models' multi-modal reasoning proficiency. However, these benchmarks are restricted to assessing primarily the visual aspect and do not examine the holistic audio-visual (AV) understanding. Moreover, currently, there are no benchmarks that investigate… ▽ More

    Submitted 3 January, 2025; originally announced January 2025.

  20. arXiv:2501.01039  [pdf, other

    cs.CL cs.AI

    MSWA: Refining Local Attention with Multi-ScaleWindow Attention

    Authors: Yixing Xu, Shivank Nag, Dong Li, Lu Tian, Emad Barsoum

    Abstract: Transformer-based LLMs have achieved exceptional performance across a wide range of NLP tasks. However, the standard self-attention mechanism suffers from quadratic time complexity and linearly increased cache size. Sliding window attention (SWA) solves this problem by restricting the attention range to a fixed-size local context window. Nevertheless, SWA employs a uniform window size for each hea… ▽ More

    Submitted 1 January, 2025; originally announced January 2025.

  21. arXiv:2412.21198  [pdf, other

    cond-mat.stat-mech

    Dynamic magnetic response in ABA type trilayered systems and compensation phenomenon

    Authors: Enakshi Guru, Sonali Saha, Sankhasubhra Nag

    Abstract: Dynamic magnetic response in a trilayered structure with non-equivalent layers (ABA type) has been studied with Monte Carlo simulation using Metropolis algorithm. In each layer, ferromagnetic (FM) nearest neighbour Ising interactions are present along with antiferromagnetic (AFM) nearest neighbour coupling across different layers. The system is studied under a harmonically oscillating external mag… ▽ More

    Submitted 30 December, 2024; originally announced December 2024.

    Comments: 18 pages, 13 figures

    MSC Class: 82C26

  22. arXiv:2411.08028  [pdf, other

    cs.AI

    Learning with Less: Knowledge Distillation from Large Language Models via Unlabeled Data

    Authors: Juanhui Li, Sreyashi Nag, Hui Liu, Xianfeng Tang, Sheikh Sarwar, Limeng Cui, Hansu Gu, Suhang Wang, Qi He, Jiliang Tang

    Abstract: In real-world NLP applications, Large Language Models (LLMs) offer promising solutions due to their extensive training on vast datasets. However, the large size and high computation demands of LLMs limit their practicality in many applications, especially when further fine-tuning is required. To address these limitations, smaller models are typically preferred for deployment. However, their traini… ▽ More

    Submitted 30 March, 2025; v1 submitted 12 November, 2024; originally announced November 2024.

  23. arXiv:2411.01818  [pdf, other

    cs.LG

    Shrinking the Giant : Quasi-Weightless Transformers for Low Energy Inference

    Authors: Shashank Nag, Alan T. L. Bacellar, Zachary Susskind, Anshul Jha, Logan Liberty, Aishwarya Sivakumar, Eugene B. John, Krishnan Kailas, Priscila M. V. Lima, Neeraja J. Yadwadkar, Felipe M. G. Franca, Lizy K. John

    Abstract: Transformers are set to become ubiquitous with applications ranging from chatbots and educational assistants to visual recognition and remote sensing. However, their increasing computational and memory demands is resulting in growing energy consumption. Building models with fast and energy-efficient inference is imperative to enable a variety of transformer-based applications. Look Up Table (LUT)… ▽ More

    Submitted 4 November, 2024; originally announced November 2024.

  24. arXiv:2410.18538  [pdf, other

    cs.CV

    SMITE: Segment Me In TimE

    Authors: Amirhossein Alimohammadi, Sauradip Nag, Saeid Asgari Taghanaki, Andrea Tagliasacchi, Ghassan Hamarneh, Ali Mahdavi Amiri

    Abstract: Segmenting an object in a video presents significant challenges. Each pixel must be accurately labelled, and these labels must remain consistent across frames. The difficulty increases when the segmentation is with arbitrary granularity, meaning the number of segments can vary arbitrarily, and masks are defined based on only one or a few sample images. In this paper, we address this issue by emplo… ▽ More

    Submitted 18 February, 2025; v1 submitted 24 October, 2024; originally announced October 2024.

    Comments: ICLR 2025; Project page is at https://segment-me-in-time.github.io/

  25. arXiv:2410.17952  [pdf, other

    cs.CL cs.AI cs.IR cs.LG

    SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains

    Authors: Ran Xu, Hui Liu, Sreyashi Nag, Zhenwei Dai, Yaochen Xie, Xianfeng Tang, Chen Luo, Yang Li, Joyce C. Ho, Carl Yang, Qi He

    Abstract: Retrieval-augmented generation (RAG) enhances the question-answering (QA) abilities of large language models (LLMs) by integrating external knowledge. However, adapting general-purpose RAG systems to specialized fields such as science and medicine poses unique challenges due to distribution shifts and limited access to domain-specific data. To tackle this, we propose SimRAG, a self-training approa… ▽ More

    Submitted 24 January, 2025; v1 submitted 23 October, 2024; originally announced October 2024.

    Comments: Accepted to NAACL 2025 main conference

    Journal ref: NAACL 2025

  26. arXiv:2410.14220  [pdf, other

    physics.ins-det

    Comparative Performance Analysis of Crystals in Total-Body PET Scanners: Monte-Carlo Simulation Study with Different Materials and Geometry

    Authors: D. Choudhary, S. Nag

    Abstract: Total-Body PET (TB-PET) scanners represent a significant advancement in medical diagnostics, exemplified by the uEXPLORER, the world's first TB-PET system with an axial span of 194 cm, which exhibits exceptional sensitivity and spatial resolution. This study employs the Monte Carlo simulation toolkit Geant4 to evaluate various configurations and materials of detector crystals. We concentrate on th… ▽ More

    Submitted 9 April, 2025; v1 submitted 18 October, 2024; originally announced October 2024.

    Comments: 12 pages, 13 total figures, 7 captioned figures

  27. arXiv:2409.04748  [pdf, other

    cond-mat.soft cond-mat.stat-mech physics.comp-ph

    Dissipative self-assembly of patchy particles under nonequilibrium drive: a computational study

    Authors: Shubhadeep Nag, Gili Bisker

    Abstract: Inspired by biology and implemented using nanotechnology, the self-assembly of patchy particles has emerged as a pivotal mechanism for constructing complex structures that mimic natural systems with diverse functionalities. Here, we explore the dissipative self-assembly of patchy particles under nonequilibrium conditions, with the aim of overcoming the constraints imposed by equilibrium assembly.… ▽ More

    Submitted 7 September, 2024; originally announced September 2024.

    Comments: 74 pages and 21 figures

  28. arXiv:2409.00975  [pdf, other

    astro-ph.GA cond-mat.soft physics.space-ph

    Deciphering Interstellar Ice Morphology: Atomistic Simulations Reveal the Complex Behavior of Ethanethiol

    Authors: Jeet Majumdar, Shubhadeep Nag, Tejender S Thakur, Subramanian Yashonath, Bhalamurugan Sivaraman, Prabal K. Maiti

    Abstract: Ethanethiol (C$_2$H$_5$SH), a molecule detected in the interstellar medium (ISM), indicates the rich chemistry involving sulfur atoms. However, its behavior at low temperatures remains elusive, particularly the reported transition from an amorphous phase to a crystal. This study employs classical molecular dynamics (MD) simulations to reproduce the liquid-state properties of ethanethiol and to sim… ▽ More

    Submitted 2 September, 2024; originally announced September 2024.

    Comments: Manuscript accepted for publication in "Astrophysics and Space Science Proceedings", Springer nature (Symposium: ISRA 2023)

  29. arXiv:2408.02215  [pdf

    cs.IR

    Exploring Query Understanding for Amazon Product Search

    Authors: Chen Luo, Xianfeng Tang, Hanqing Lu, Yaochen Xie, Hui Liu, Zhenwei Dai, Limeng Cui, Ashutosh Joshi, Sreyashi Nag, Yang Li, Zhen Li, Rahul Goutam, Jiliang Tang, Haiyang Zhang, Qi He

    Abstract: Online shopping platforms, such as Amazon, offer services to billions of people worldwide. Unlike web search or other search engines, product search engines have their unique characteristics, primarily featuring short queries which are mostly a combination of product attributes and structured product search space. The uniqueness of product search underscores the crucial importance of the query und… ▽ More

    Submitted 4 August, 2024; originally announced August 2024.

  30. arXiv:2408.01690  [pdf, other

    cs.CV cs.AI cs.MM

    IDNet: A Novel Dataset for Identity Document Analysis and Fraud Detection

    Authors: Hong Guan, Yancheng Wang, Lulu Xie, Soham Nag, Rajeev Goel, Niranjan Erappa Narayana Swamy, Yingzhen Yang, Chaowei Xiao, Jonathan Prisby, Ross Maciejewski, Jia Zou

    Abstract: Effective fraud detection and analysis of government-issued identity documents, such as passports, driver's licenses, and identity cards, are essential in thwarting identity theft and bolstering security on online platforms. The training of accurate fraud detection and analysis tools depends on the availability of extensive identity document datasets. However, current publicly available benchmark… ▽ More

    Submitted 3 September, 2024; v1 submitted 3 August, 2024; originally announced August 2024.

    Comments: 40 pages

  31. arXiv:2407.19392  [pdf, other

    cs.CR

    AndroCon: Conning Location Services in Android

    Authors: Soham Nag, Smruti R. Sarangi

    Abstract: Mobile device hackers often target ambient sensing, human activity identification, and interior floor mapping. In addition to overt signals like microphones and cameras, covert channels like WiFi, Bluetooth, and augmented GPS signal strengths have been employed to gather this information. Until date, passive, receive-only satellite GPS sensing relied solely on signal strength and location informat… ▽ More

    Submitted 28 July, 2024; originally announced July 2024.

    Comments: 18 pages

  32. arXiv:2407.18553  [pdf, other

    cs.IR

    REAPER: Reasoning based Retrieval Planning for Complex RAG Systems

    Authors: Ashutosh Joshi, Sheikh Muhammad Sarwar, Samarth Varshney, Sreyashi Nag, Shrivats Agrawal, Juhi Naik

    Abstract: Complex dialog systems often use retrieved evidence to facilitate factual responses. Such RAG (Retrieval Augmented Generation) systems retrieve from massive heterogeneous data stores that are usually architected as multiple indexes or APIs instead of a single monolithic source. For a given query, relevant evidence needs to be retrieved from one or a small subset of possible retrieval sources. Comp… ▽ More

    Submitted 30 July, 2024; v1 submitted 26 July, 2024; originally announced July 2024.

  33. arXiv:2407.02389  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.MM

    SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation

    Authors: Sayan Nag, Koustava Goswami, Srikrishna Karanam

    Abstract: Referring Expression Segmentation (RES) aims to provide a segmentation mask of the target object in an image referred to by the text (i.e., referring expression). Existing methods require large-scale mask annotations. Moreover, such approaches do not generalize well to unseen/zero-shot scenarios. To address the aforementioned issues, we propose a weakly-supervised bootstrapping architecture for RE… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: Accepted at ECCV 2024

  34. arXiv:2407.01851  [pdf, other

    cs.CV cs.AI cs.LG eess.AS

    Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time

    Authors: Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta, Jun Chen, Mohamed Elhoseiny, Ruohan Gao, Dinesh Manocha

    Abstract: Leveraging Large Language Models' remarkable proficiency in text-based tasks, recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and audio. However, the progress in these directions has been mostly focused on tasks that only require a coarse-grained understanding of the audio-visual semantics. We present Meerkat, an audio-visual LLM equipped with a fine-grained un… ▽ More

    Submitted 3 July, 2024; v1 submitted 1 July, 2024; originally announced July 2024.

    Comments: Accepted at ECCV 2024

  35. arXiv:2406.04673  [pdf, other

    cs.CV cs.AI cs.MM eess.AS

    MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models

    Authors: Sanjoy Chowdhury, Sayan Nag, K J Joseph, Balaji Vasan Srinivasan, Dinesh Manocha

    Abstract: Music is a universal language that can communicate emotions and feelings. It forms an essential part of the whole spectrum of creative media, ranging from movies to social media posts. Machine learning models that can synthesize music are predominantly conditioned on textual descriptions of it. Inspired by how musicians compose music not just from a movie script, but also through visualizations, w… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: Accepted at CVPR 2024 as Highlight paper. Webpage: https://schowdhury671.github.io/melfusion_cvpr2024/

  36. arXiv:2405.00716  [pdf, other

    cs.CL cs.AI

    Large Language Models in the Clinic: A Comprehensive Benchmark

    Authors: Fenglin Liu, Zheng Li, Hongjian Zhou, Qingyu Yin, Jingfeng Yang, Xianfeng Tang, Chen Luo, Ming Zeng, Haoming Jiang, Yifan Gao, Priyanka Nigam, Sreyashi Nag, Bing Yin, Yining Hua, Xuan Zhou, Omid Rohanian, Anshul Thakur, Lei Clifton, David A. Clifton

    Abstract: The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering (QA) task with answer options for evaluation. However, many clinical decisions involve answering open-ended questions without pre-set options. To better understand LLMs in the clinic, we construct a benchmark ClinicBench. We first coll… ▽ More

    Submitted 16 October, 2024; v1 submitted 25 April, 2024; originally announced May 2024.

    Comments: Accepted at EMNLP 2024 Main Conference

  37. arXiv:2403.19113  [pdf, other

    cs.CL cs.AI

    FACTOID: FACtual enTailment fOr hallucInation Detection

    Authors: Vipula Rawte, S. M Towhidul Islam Tonmoy, Krishnav Rajbangshi, Shravani Nag, Aman Chadha, Amit P. Sheth, Amitava Das

    Abstract: The widespread adoption of Large Language Models (LLMs) has facilitated numerous benefits. However, hallucination is a significant concern. In response, Retrieval Augmented Generation (RAG) has emerged as a highly promising paradigm to improve LLM outputs by grounding them in factual information. RAG relies on textual entailment (TE) or similar methods to check if the text produced by LLMs is supp… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

  38. arXiv:2403.18341  [pdf, other

    cs.CL

    IterAlign: Iterative Constitutional Alignment of Large Language Models

    Authors: Xiusi Chen, Hongzhi Wen, Sreyashi Nag, Chen Luo, Qingyu Yin, Ruirui Li, Zheng Li, Wei Wang

    Abstract: With the rapid development of large language models (LLMs), aligning LLMs with human values and societal norms to ensure their reliability and safety has become crucial. Reinforcement learning with human feedback (RLHF) and Constitutional AI (CAI) have been proposed for LLM alignment. However, these methods require either heavy human annotations or explicitly pre-defined constitutions, which are l… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

    Comments: NAACL 2024

  39. arXiv:2403.06021  [pdf, other

    cs.IR cs.LG

    Hierarchical Query Classification in E-commerce Search

    Authors: Bing He, Sreyashi Nag, Limeng Cui, Suhang Wang, Zheng Li, Rahul Goutam, Zhen Li, Haiyang Zhang

    Abstract: E-commerce platforms typically store and structure product information and search data in a hierarchy. Efficiently categorizing user search queries into a similar hierarchical structure is paramount in enhancing user experience on e-commerce platforms as well as news curation and academic research. The significance of this task is amplified when dealing with sensitive query categorization or criti… ▽ More

    Submitted 9 March, 2024; originally announced March 2024.

    Comments: Published at: the ACM Web Conference 2024 in the industry track (WWW'24)

  40. arXiv:2403.05435  [pdf, other

    cs.CV eess.IV eess.SP

    OmniCount: Multi-label Object Counting with Semantic-Geometric Priors

    Authors: Anindya Mondal, Sauradip Nag, Xiatian Zhu, Anjan Dutta

    Abstract: Object counting is pivotal for understanding the composition of scenes. Previously, this task was dominated by class-specific methods, which have gradually evolved into more adaptable class-agnostic strategies. However, these strategies come with their own set of limitations, such as the need for manual exemplar input and multiple passes for multiple categories, resulting in significant inefficien… ▽ More

    Submitted 22 February, 2025; v1 submitted 8 March, 2024; originally announced March 2024.

    Comments: Accepted to AAAI 2025

  41. arXiv:2403.05174  [pdf, other

    cs.LG

    VTruST: Controllable value function based subset selection for Data-Centric Trustworthy AI

    Authors: Soumi Das, Shubhadip Nag, Shreyyash Sharma, Suparna Bhattacharya, Sourangshu Bhattacharya

    Abstract: Trustworthy AI is crucial to the widespread adoption of AI in high-stakes applications with fairness, robustness, and accuracy being some of the key trustworthiness metrics. In this work, we propose a controllable framework for data-centric trustworthy AI (DCTAI)- VTruST, that allows users to control the trade-offs between the different trustworthiness metrics of the constructed training datasets.… ▽ More

    Submitted 8 March, 2024; originally announced March 2024.

    Comments: Accepted in ICLR 2024 DMLR workshop

  42. arXiv:2401.16455  [pdf, other

    q-fin.PR

    The Carbon Premium: Correlation or Causation? Evidence from S&P 500 Companies

    Authors: Namasi G. Sankar, Suryadeepto Nag, Siddhartha P. Chakrabarty, Sankarshan Basu

    Abstract: In the context of whether investors are aware of carbon-related risks, it is often hypothesized that there may be a carbon premium in the value of stocks of firms, conferring an abnormal excess value to firms' shares as a form of compensation to investors for their transition risk exposure through the ownership of carbon instensive stocks. However, there is little consensus in the literature regar… ▽ More

    Submitted 29 January, 2024; originally announced January 2024.

  43. arXiv:2312.12423  [pdf, other

    cs.CV cs.AI

    Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model

    Authors: Shraman Pramanick, Guangxing Han, Rui Hou, Sayan Nag, Ser-Nam Lim, Nicolas Ballas, Qifan Wang, Rama Chellappa, Amjad Almahairi

    Abstract: The ability of large language models (LLMs) to process visual inputs has given rise to general-purpose vision systems, unifying various vision-language (VL) tasks by instruction tuning. However, due to the enormous diversity in input-output formats in the vision domain, existing general-purpose models fail to successfully integrate segmentation and multi-image inputs with coarse-level tasks into a… ▽ More

    Submitted 19 June, 2024; v1 submitted 19 December, 2023; originally announced December 2023.

    Comments: CVPR 2024 Highlight

  44. arXiv:2312.05407  [pdf, other

    cs.CV

    ODES: Domain Adaptation with Expert Guidance for Online Medical Image Segmentation

    Authors: Md Shazid Islam, Sayak Nag, Arindam Dutta, Miraj Ahmed, Fahim Faisal Niloy, Amit K. Roy-Chowdhury

    Abstract: Unsupervised domain adaptive segmentation typically relies on self-training using pseudo labels predicted by a pre-trained network on an unlabeled target dataset. However, the noisy nature of such pseudo-labels presents a major bottleneck in adapting a network to the distribution shift between source and target datasets. This challenge is exaggerated when the network encounters an incoming data st… ▽ More

    Submitted 15 October, 2024; v1 submitted 8 December, 2023; originally announced December 2023.

  45. arXiv:2312.01564  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    APoLLo: Unified Adapter and Prompt Learning for Vision Language Models

    Authors: Sanjoy Chowdhury, Sayan Nag, Dinesh Manocha

    Abstract: The choice of input text prompt plays a critical role in the performance of Vision-Language Pretrained (VLP) models such as CLIP. We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models. Our method is designed to substantially improve the generalization capabilities of VLP models when they are fine-tuned in a few-shot setting. We intro… ▽ More

    Submitted 3 December, 2023; originally announced December 2023.

    Comments: Accepted at EMNLP 2023 (Main track)

  46. arXiv:2311.05198  [pdf, other

    cs.CV

    Adaptive-Labeling for Enhancing Remote Sensing Cloud Understanding

    Authors: Jay Gala, Sauradip Nag, Huichou Huang, Ruirui Liu, Xiatian Zhu

    Abstract: Cloud analysis is a critical component of weather and climate science, impacting various sectors like disaster management. However, achieving fine-grained cloud analysis, such as cloud segmentation, in remote sensing remains challenging due to the inherent difficulties in obtaining accurate labels, leading to significant labeling errors in training data. Existing methods often assume the availabil… ▽ More

    Submitted 9 November, 2023; originally announced November 2023.

    Comments: Accepted at the TCCML Workshop at NeurIPS 2023

  47. arXiv:2309.09178  [pdf, other

    econ.GN

    Does Reliable Electricity Mean Lesser Agricultural Labor Wages? Evidence from Indian Villages

    Authors: Suryadeepto Nag

    Abstract: Using a panel of 1,171 villages in rural India that were surveyed in the India Human Development Surveys, I perform a difference-in-differences analysis to find that improvements in electricity reliability have a negative effect on the increase in casual agricultural labor wage rates. Changes in men's wage rates are found to be affected more adversely than women's, resulting in a smaller widening… ▽ More

    Submitted 17 September, 2023; originally announced September 2023.

  48. arXiv:2308.14115  [pdf, other

    cs.CL

    Situated Natural Language Explanations

    Authors: Zining Zhu, Haoming Jiang, Jingfeng Yang, Sreyashi Nag, Chao Zhang, Jie Huang, Yifan Gao, Frank Rudzicz, Bing Yin

    Abstract: Natural language is among the most accessible tools for explaining decisions to humans, and large pretrained language models (PLMs) have demonstrated impressive abilities to generate coherent natural language explanations (NLE). The existing NLE research perspectives do not take the audience into account. An NLE can have high textual quality, but it might not accommodate audiences' needs and prefe… ▽ More

    Submitted 24 March, 2024; v1 submitted 27 August, 2023; originally announced August 2023.

  49. arXiv:2308.07293  [pdf, other

    cs.SD cs.LG eess.AS

    DiffSED: Sound Event Detection with Denoising Diffusion

    Authors: Swapnil Bhosale, Sauradip Nag, Diptesh Kanojia, Jiankang Deng, Xiatian Zhu

    Abstract: Sound Event Detection (SED) aims to predict the temporal boundaries of all the events of interest and their class labels, given an unconstrained audio sample. Taking either the splitand-classify (i.e., frame-level) strategy or the more principled event-level modeling approach, all existing methods consider the SED problem from the discriminative learning perspective. In this work, we reformulate t… ▽ More

    Submitted 16 August, 2023; v1 submitted 14 August, 2023; originally announced August 2023.

  50. arXiv:2307.10763  [pdf, other

    cs.CV cs.AI cs.LG eess.IV

    Actor-agnostic Multi-label Action Recognition with Multi-modal Query

    Authors: Anindya Mondal, Sauradip Nag, Joaquin M Prada, Xiatian Zhu, Anjan Dutta

    Abstract: Existing action recognition methods are typically actor-specific due to the intrinsic topological and apparent differences among the actors. This requires actor-specific pose estimation (e.g., humans vs. animals), leading to cumbersome model design complexity and high maintenance costs. Moreover, they often focus on learning the visual modality alone and single-label classification whilst neglecti… ▽ More

    Submitted 10 January, 2024; v1 submitted 20 July, 2023; originally announced July 2023.

    Comments: Published at the 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Paris, France