Skip to main content

Showing 1–50 of 175 results for author: Baral, C

.
  1. arXiv:2510.18083  [pdf, ps, other

    cs.CV

    Chimera: Compositional Image Generation using Part-based Concepting

    Authors: Shivam Singh, Yiming Chen, Agneet Chatterjee, Amit Raj, James Hays, Yezhou Yang, Chitta Baral

    Abstract: Personalized image generative models are highly proficient at synthesizing images from text or a single image, yet they lack explicit control for composing objects from specific parts of multiple source images without user specified masks or annotations. To address this, we introduce Chimera, a personalized image generation model that generates novel objects by combining specified parts from diffe… ▽ More

    Submitted 22 October, 2025; v1 submitted 20 October, 2025; originally announced October 2025.

  2. arXiv:2510.11812  [pdf, ps, other

    cs.CL cs.AI

    PHANTOM RECALL: When Familiar Puzzles Fool Smart Models

    Authors: Souradeep Mukhopadhyay, Rishabh Baral, Nimeesh Mahajan, Samhitha Harish, Aswin RRV, Mihir Parmar, Mutsumi Nakamura, Chitta Baral

    Abstract: Large language models (LLMs) such as GPT, Gemini, and Claude often appear adept at solving classic logic puzzles--but how much genuine reasoning underlies their answers? Recent evidence suggests that these models frequently rely on memorized templates rather than reasoning from first principles. When puzzles are slightly modified, their performance collapses, revealing a striking fragility. In par… ▽ More

    Submitted 13 October, 2025; originally announced October 2025.

    Comments: 22 Pages

  3. arXiv:2510.03955  [pdf, ps, other

    cs.CV

    Harnessing Synthetic Preference Data for Enhancing Temporal Understanding of Video-LLMs

    Authors: Sameep Vani, Shreyas Jena, Maitreya Patel, Chitta Baral, Somak Aditya, Yezhou Yang

    Abstract: While Video Large Language Models (Video-LLMs) have demonstrated remarkable performance across general video understanding benchmarks-particularly in video captioning and descriptive tasks-they consistently underperform on tasks that require fine-grained temporal understanding. This limitation arises due to the lack of visual complexity and temporal nuance in current fine-tuning datasets, leading… ▽ More

    Submitted 4 October, 2025; originally announced October 2025.

    Comments: 17 pages, 9 figures, 6 tables. Presents TimeWarp, a synthetic preference data framework to improve temporal understanding in Video-LLMs, showing consistent gains across seven benchmarks. Includes supplementary material in the Appendix

  4. arXiv:2510.03777  [pdf, ps, other

    cs.AI cs.LG

    GuidedSampling: Steering LLMs Towards Diverse Candidate Solutions at Inference-Time

    Authors: Divij Handa, Mihir Parmar, Aswin RRV, Md Nayem Uddin, Hamid Palangi, Chitta Baral

    Abstract: Repeated Sampling (RS) is a simple inference-time algorithm that has been shown to improve model performance on complex tasks. Although it is an effective way of scaling inference time, it often struggles to generate diverse solution candidates, frequently relying on the same underlying approach to solve the problem and thus producing redundant samples. To address this limitation, we propose a new… ▽ More

    Submitted 4 October, 2025; originally announced October 2025.

  5. arXiv:2509.26555  [pdf, ps, other

    cs.CV

    Stable Cinemetrics : Structured Taxonomy and Evaluation for Professional Video Generation

    Authors: Agneet Chatterjee, Rahim Entezari, Maksym Zhuravinskyi, Maksim Lapin, Reshinth Adithyan, Amit Raj, Chitta Baral, Yezhou Yang, Varun Jampani

    Abstract: Recent advances in video generation have enabled high-fidelity video synthesis from user provided prompts. However, existing models and benchmarks fail to capture the complexity and requirements of professional video generation. Towards that goal, we introduce Stable Cinemetrics, a structured evaluation framework that formalizes filmmaking controls into four disentangled, hierarchical taxonomies:… ▽ More

    Submitted 30 September, 2025; originally announced September 2025.

    Comments: NeurIPS 2025. Project Page : https://stable-cinemetrics.github.io/

  6. arXiv:2509.25248  [pdf, ps, other

    cs.SE cs.AI cs.PL

    BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software

    Authors: Zehua Zhang, Ati Priya Bajaj, Divij Handa, Siyu Liu, Arvind S Raj, Hongkai Chen, Hulin Wang, Yibo Liu, Zion Leonahenahe Basque, Souradip Nath, Vishal Juneja, Nikhil Chapre, Yan Shoshitaishvili, Adam Doupé, Chitta Baral, Ruoyu Wang

    Abstract: Automatically compiling open-source software (OSS) projects is a vital, labor-intensive, and complex task, which makes it a good challenge for LLM Agents. Existing methods rely on manually curated rules and workflows, which cannot adapt to OSS that requires customized configuration or environment setup. Recent attempts using Large Language Models (LLMs) used selective evaluation on a subset of hig… ▽ More

    Submitted 26 September, 2025; originally announced September 2025.

  7. arXiv:2509.16141  [pdf, ps, other

    cs.CV

    AcT2I: Evaluating and Improving Action Depiction in Text-to-Image Models

    Authors: Vatsal Malaviya, Agneet Chatterjee, Maitreya Patel, Yezhou Yang, Chitta Baral

    Abstract: Text-to-Image (T2I) models have recently achieved remarkable success in generating images from textual descriptions. However, challenges still persist in accurately rendering complex scenes where actions and interactions form the primary semantic focus. Our key observation in this work is that T2I models frequently struggle to capture nuanced and often implicit attributes inherent in action depict… ▽ More

    Submitted 19 September, 2025; originally announced September 2025.

    Comments: Project Page : https://vatsal-malaviya.github.io/AcT2I/

  8. arXiv:2508.20931  [pdf, ps, other

    cs.CL

    How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on $τ$-bench

    Authors: Venkatesh Mishra, Amir Saeidi, Satyam Raj, Mutsumi Nakamura, Jayanth Srinivasa, Gaowen Liu, Ali Payani, Chitta Baral

    Abstract: Recent advances in reasoning and planning capabilities of large language models (LLMs) have enabled their potential as autonomous agents capable of tool use in dynamic environments. However, in multi-turn conversational environments like $τ$-bench, these agents often struggle with consistent reasoning, adherence to domain-specific policies, and extracting correct information over a long horizon of… ▽ More

    Submitted 1 September, 2025; v1 submitted 28 August, 2025; originally announced August 2025.

    Comments: Accepted to EMNLP 2025 Findings

  9. arXiv:2508.07616  [pdf, ps, other

    cs.AI cs.CL cs.LG

    ThinkTuning: Instilling Cognitive Reflections without Distillation

    Authors: Aswin RRV, Jacob Dineen, Divij Handa, Md Nayem Uddin, Mihir Parmar, Chitta Baral, Ben Zhou

    Abstract: Recent advances in test-time scaling have led to the emergence of thinking LLMs that exhibit self-reflective behaviors and multi-step reasoning. While RL drives this self-improvement paradigm, a recent study (Gandhi et al., 2025) shows that RL alone does not truly instill these new reasoning abilities - it merely draws out behaviors already present in the base models. This raises a question: How c… ▽ More

    Submitted 21 August, 2025; v1 submitted 11 August, 2025; originally announced August 2025.

    Comments: EMNLP 2025 (Main Conference)

  10. arXiv:2507.07495  [pdf, ps, other

    cs.CL cs.AI

    PLAN-TUNING: Post-Training Language Models to Learn Step-by-Step Planning for Complex Problem Solving

    Authors: Mihir Parmar, Palash Goyal, Xin Liu, Yiwen Song, Mingyang Ling, Chitta Baral, Hamid Palangi, Tomas Pfister

    Abstract: Recently, decomposing complex problems into simple subtasks--a crucial part of human-like natural planning--to solve the given problem has significantly boosted the performance of large language models (LLMs). However, leveraging such planning structures during post-training to boost the performance of smaller open-source LLMs remains underexplored. Motivated by this, we introduce PLAN-TUNING, a u… ▽ More

    Submitted 10 July, 2025; originally announced July 2025.

    Comments: 15 Pages

  11. arXiv:2507.03123  [pdf, ps, other

    cs.CV cs.CL cs.LG

    Investigating VLM Hallucination from a Cognitive Psychology Perspective: A First Step Toward Interpretation with Intriguing Observations

    Authors: Xiangrui Liu, Man Luo, Agneet Chatterjee, Hua Wei, Chitta Baral, Yezhou Yang

    Abstract: Hallucination is a long-standing problem that has been actively investigated in Vision-Language Models (VLMs). Existing research commonly attributes hallucinations to technical limitations or sycophancy bias, where the latter means the models tend to generate incorrect answers to align with user expectations. However, these explanations primarily focus on technical or externally driven factors, an… ▽ More

    Submitted 11 October, 2025; v1 submitted 3 July, 2025; originally announced July 2025.

  12. arXiv:2506.08123  [pdf, ps, other

    cs.CL

    QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA

    Authors: Jacob Dineen, Aswin RRV, Qin Liu, Zhikun Xu, Xiao Ye, Ming Shen, Zhaonan Li, Shijie Lu, Chitta Baral, Muhao Chen, Ben Zhou

    Abstract: Alignment of large language models (LLMs) with principles like helpfulness, honesty, and harmlessness typically relies on scalar rewards that obscure which objectives drive the training signal. We introduce QA-LIGN, which decomposes monolithic rewards into interpretable principle-specific evaluations through structured natural language programs. Models learn through a draft, critique, and revise p… ▽ More

    Submitted 26 September, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

    Comments: Accepted to Findings of EMNLP 2025

  13. arXiv:2506.03448  [pdf, ps, other

    cs.CV

    RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions

    Authors: Bimsara Pathiraja, Maitreya Patel, Shivam Singh, Yezhou Yang, Chitta Baral

    Abstract: Despite recent advances in inversion and instruction-based image editing, existing approaches primarily excel at editing single, prominent objects but significantly struggle when applied to complex scenes containing multiple entities. To quantify this gap, we first introduce RefEdit-Bench, a rigorous real-world benchmark rooted in RefCOCO, where even baselines trained on millions of samples perfor… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: Project page: \url{http://refedit.vercel.app}

  14. arXiv:2505.23174  [pdf, ps, other

    cs.CL

    Map&Make: Schema Guided Text to Table Generation

    Authors: Naman Ahuja, Fenil Bardoliya, Chitta Baral, Vivek Gupta

    Abstract: Transforming dense, detailed, unstructured text into an interpretable and summarised table, also colloquially known as Text-to-Table generation, is an essential task for information retrieval. Current methods, however, miss out on how and what complex information to extract; they also lack the ability to infer data from the text. In this paper, we introduce a versatile approach, Map&Make, which "d… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: Accepted to ACL 2025

  15. arXiv:2505.21863  [pdf, ps, other

    cs.CV cs.CL

    GETReason: Enhancing Image Context Extraction through Hierarchical Multi-Agent Reasoning

    Authors: Shikhhar Siingh, Abhinav Rawat, Chitta Baral, Vivek Gupta

    Abstract: Publicly significant images from events hold valuable contextual information, crucial for journalism and education. However, existing methods often struggle to extract this relevance accurately. To address this, we introduce GETReason (Geospatial Event Temporal Reasoning), a framework that moves beyond surface-level image descriptions to infer deeper contextual meaning. We propose that extracting… ▽ More

    Submitted 2 June, 2025; v1 submitted 27 May, 2025; originally announced May 2025.

  16. arXiv:2505.21771  [pdf, other

    cs.CV cs.AI

    MMTBENCH: A Unified Benchmark for Complex Multimodal Table Reasoning

    Authors: Prasham Yatinkumar Titiya, Jainil Trivedi, Chitta Baral, Vivek Gupta

    Abstract: Multimodal tables those that integrate semi structured data with visual elements such as charts and maps are ubiquitous across real world domains, yet they pose a formidable challenge to current vision language models (VLMs). While Large Language models (LLMs) and VLMs have demonstrated strong capabilities in text and image understanding, their performance on complex, real world multimodal table r… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  17. arXiv:2505.20816  [pdf, ps, other

    cs.CL

    Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective

    Authors: Krishna Singh Rajput, Tejas Anvekar, Chitta Baral, Vivek Gupta

    Abstract: Recent advances in multimodal question answering have primarily focused on combining heterogeneous modalities or fine-tuning multimodal large language models. While these approaches have shown strong performance, they often rely on a single, generalized reasoning strategy, overlooking the unique characteristics of each modality ultimately limiting both accuracy and interpretability. To address the… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  18. arXiv:2503.13730  [pdf, other

    cs.CV cs.CL

    TextInVision: Text and Prompt Complexity Driven Visual Text Generation Benchmark

    Authors: Forouzan Fallah, Maitreya Patel, Agneet Chatterjee, Vlad I. Morariu, Chitta Baral, Yezhou Yang

    Abstract: Generating images with embedded text is crucial for the automatic production of visual and multimodal documents, such as educational materials and advertisements. However, existing diffusion-based text-to-image models often struggle to accurately embed text within images, facing challenges in spelling accuracy, contextual relevance, and visual coherence. Evaluating the ability of such models to em… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

  19. arXiv:2503.00043  [pdf, other

    cs.CV cs.AI cs.CL

    VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning

    Authors: Nilay Yilmaz, Maitreya Patel, Yiran Lawrence Luo, Tejas Gokhale, Chitta Baral, Suren Jayasuriya, Yezhou Yang

    Abstract: Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information. Despite their exceptional performance on visual understanding benchmarks, measuring their ability to reason abstractly across multiple images remains a significant challenge. To address this, we introduce VOILA, a large-scale, open-ended, dynamic benchmark designed to evaluate MLLMs… ▽ More

    Submitted 4 March, 2025; v1 submitted 25 February, 2025; originally announced March 2025.

    Comments: Accepted at ICLR 2025. Code and data: https://github.com/nlylmz/Voila

  20. arXiv:2502.16111  [pdf, other

    cs.AI cs.CL

    PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving

    Authors: Mihir Parmar, Xin Liu, Palash Goyal, Yanfei Chen, Long Le, Swaroop Mishra, Hossein Mobahi, Jindong Gu, Zifeng Wang, Hootan Nakhost, Chitta Baral, Chen-Yu Lee, Tomas Pfister, Hamid Palangi

    Abstract: Recent agent frameworks and inference-time algorithms often struggle with complex planning problems due to limitations in verifying generated plans or reasoning and varying complexity of instances within a single task. Many existing methods for these tasks either perform task-level verification without considering constraints or apply inference-time algorithms without adapting to instance-level co… ▽ More

    Submitted 22 February, 2025; originally announced February 2025.

    Comments: 30 pages

  21. arXiv:2502.06023  [pdf, ps, other

    cs.CV

    Dual Caption Preference Optimization for Diffusion Models

    Authors: Amir Saeidi, Yiran Luo, Agneet Chatterjee, Shamanthak Hegde, Bimsara Pathiraja, Yezhou Yang, Chitta Baral

    Abstract: Recent advancements in human preference optimization, originally developed for Large Language Models (LLMs), have shown significant potential in improving text-to-image diffusion models. These methods aim to learn the distribution of preferred samples while distinguishing them from less preferred ones. However, within the existing preference datasets, the original caption often does not clearly fa… ▽ More

    Submitted 18 October, 2025; v1 submitted 9 February, 2025; originally announced February 2025.

  22. arXiv:2502.05675  [pdf, other

    cs.CL

    Investigating the Shortcomings of LLMs in Step-by-Step Legal Reasoning

    Authors: Venkatesh Mishra, Bimsara Pathiraja, Mihir Parmar, Sat Chidananda, Jayanth Srinivasa, Gaowen Liu, Ali Payani, Chitta Baral

    Abstract: Reasoning abilities of LLMs have been a key focus in recent years. One challenging reasoning domain with interesting nuances is legal reasoning, which requires careful application of rules, and precedents while balancing deductive and analogical reasoning, and conflicts between rules. Although there have been a few works on using LLMs for legal reasoning, their focus has been on overall accuracy.… ▽ More

    Submitted 8 February, 2025; originally announced February 2025.

    Comments: Accepted to NAACL 2025 Findings

  23. arXiv:2501.13299  [pdf, other

    cs.CL

    Hypothesis Generation for Materials Discovery and Design Using Goal-Driven and Constraint-Guided LLM Agents

    Authors: Shrinidhi Kumbhar, Venkatesh Mishra, Kevin Coutinho, Divij Handa, Ashif Iquebal, Chitta Baral

    Abstract: Materials discovery and design are essential for advancing technology across various industries by enabling the development of application-specific materials. Recent research has leveraged Large Language Models (LLMs) to accelerate this process. We explore the potential of LLMs to generate viable hypotheses that, once validated, can expedite materials discovery. Collaborating with materials scienc… ▽ More

    Submitted 8 February, 2025; v1 submitted 22 January, 2025; originally announced January 2025.

    Comments: Accepted in NAACL 2025

  24. arXiv:2412.05658  [pdf

    cond-mat.mtrl-sci

    Recent advances in hydrogen production using sulfide-based photocatalysts

    Authors: Suresh Chandra Baral, Dilip Sasmal, Mitali Hupele, Sradhanjali Lenka, Somaditya Sen

    Abstract: Sulfide-based photocatalysts (PC) are promising materials for efficiently producing hydrogen (H2). This chapter aims to provide a detailed survey of the recent advancements in sulfide-based photocatalysts and emphasize their enhanced performance and pathways to efficient H2 production. A detailed summary has been given, including several metal sulfides, such as cadmium sulfide (CdS), zinc sulfide… ▽ More

    Submitted 7 December, 2024; originally announced December 2024.

  25. arXiv:2412.05637  [pdf

    physics.app-ph cond-mat.mtrl-sci physics.chem-ph

    Enhancing Fenton-like Photo-degradation and Electrocatalytic Oxygen Evolution Reaction (OER) in Fe-doped Copper Oxide (CuO) Catalysts

    Authors: Suresh Chandra Baral, Dilip Sasmal, Sayak Datta, Mange Ram, Krishna Kanta Haldar, A. Mekki, Somaditya Sen

    Abstract: Although hydrogen generation by water electrolysis is the cheapest of all other available sources, water splitting still occurs with sluggish kinetics. It is a challenging barrier for H2 production on a large scale. Moreover, research is still underway to understand the oxygen evolution reaction (OER) and design the catalysts with improved OER performance. Herein, we report the synthesis, characte… ▽ More

    Submitted 7 December, 2024; originally announced December 2024.

  26. arXiv:2411.08515  [pdf

    cond-mat.mtrl-sci

    Effect of pH on photocatalytic degradation of Methylene Blue in water by facile hydrothermally grown TiO2 Nanoparticles under Natural Sunlight

    Authors: Uttama Kumar Saint, Suresh Chandra Baral, Dilip Sasmal, P. Maneesha, Sayak Datta, Farzana Naushin, Somaditya Sen

    Abstract: Each year, the production of synthetic dye wastewater reaches a trillion tons, posing a significant challenge to addressing water scarcity on a global level. Hence, the treatment of wastewater to prevent water scarcity is of prime importance, and failing to do so will increase ecotoxicological risks and human health. Textile wastewater contains harmful dye. Photocatalytic degradation of such dye-c… ▽ More

    Submitted 13 November, 2024; originally announced November 2024.

  27. arXiv:2411.02545  [pdf, other

    cs.CV cs.CL

    TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives

    Authors: Maitreya Patel, Abhiram Kusumba, Sheng Cheng, Changhoon Kim, Tejas Gokhale, Chitta Baral, Yezhou Yang

    Abstract: Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations. This makes the nature of the training data a significant factor in the efficacy of CLIP for downstream tasks. However, the lack of compositional diversity in contemporary image-text datasets limits the compositional reasoning ability of CLIP. We show tha… ▽ More

    Submitted 4 November, 2024; originally announced November 2024.

    Comments: Accepted at: NeurIPS 2024 | Project Page: https://tripletclip.github.io

  28. arXiv:2410.16235  [pdf, other

    cs.CL

    ToW: Thoughts of Words Improve Reasoning in Large Language Models

    Authors: Zhikun Xu, Ming Shen, Jacob Dineen, Zhaonan Li, Xiao Ye, Shijie Lu, Aswin RRV, Chitta Baral, Ben Zhou

    Abstract: We introduce thoughts of words (ToW), a novel training-time data-augmentation method for next-word prediction. ToW views next-word prediction as a core reasoning task and injects fine-grained thoughts explaining what the next word should be and how it is related to the previous contexts in pre-training texts. Our formulation addresses two fundamental drawbacks of existing next-word prediction lear… ▽ More

    Submitted 29 January, 2025; v1 submitted 21 October, 2024; originally announced October 2024.

    Comments: Accepted by NAACL 2025 Main Conference

  29. arXiv:2410.14702  [pdf, other

    cs.AI cs.CL

    Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark

    Authors: Himanshu Gupta, Shreyas Verma, Ujjwala Anantheswaran, Kevin Scaria, Mihir Parmar, Swaroop Mishra, Chitta Baral

    Abstract: Multi-modal Large Language Models (MLLMs) exhibit impressive problem-solving abilities in various domains, but their visual comprehension and abstract reasoning skills remain under-evaluated. To this end, we present PolyMATH, a challenging benchmark aimed at evaluating the general cognitive reasoning abilities of MLLMs. PolyMATH comprises 5,000 manually collected high-quality images of cognitive t… ▽ More

    Submitted 6 October, 2024; originally announced October 2024.

    Comments: 49 pages, (10 pages paper, 9 pages references, 30 pages appendix)

  30. arXiv:2410.13666  [pdf, other

    cs.CV cs.CL

    VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic Reasoning Tasks

    Authors: Shailaja Keyur Sampat, Mutsumi Nakamura, Shankar Kailas, Kartik Aggarwal, Mandy Zhou, Yezhou Yang, Chitta Baral

    Abstract: Deriving inference from heterogeneous inputs (such as images, text, and audio) is an important skill for humans to perform day-to-day tasks. A similar ability is desirable for the development of advanced Artificial Intelligence (AI) systems. While state-of-the-art models are rapidly closing the gap with human-level performance on diverse computer vision and NLP tasks separately, they struggle to s… ▽ More

    Submitted 17 October, 2024; originally announced October 2024.

    Comments: 18 pages, 7 figures

  31. arXiv:2410.13662  [pdf, other

    cs.CV

    ActionCOMET: A Zero-shot Approach to Learn Image-specific Commonsense Concepts about Actions

    Authors: Shailaja Keyur Sampat, Yezhou Yang, Chitta Baral

    Abstract: Humans observe various actions being performed by other humans (physically or in videos/images) and can draw a wide range of inferences about it beyond what they can visually perceive. Such inferences include determining the aspects of the world that make action execution possible (e.g. liquid objects can undergo pouring), predicting how the world will change as a result of the action (e.g. potato… ▽ More

    Submitted 17 October, 2024; originally announced October 2024.

    Comments: 15 pages, 3 figures. arXiv admin note: text overlap with arXiv:2004.10796 by other authors

  32. arXiv:2410.13651  [pdf, other

    cs.CV

    Help Me Identify: Is an LLM+VQA System All We Need to Identify Visual Concepts?

    Authors: Shailaja Keyur Sampat, Maitreya Patel, Yezhou Yang, Chitta Baral

    Abstract: An ability to learn about new objects from a small amount of visual data and produce convincing linguistic justification about the presence/absence of certain concepts (that collectively compose the object) in novel scenarios is an important characteristic of human cognition. This is possible due to abstraction of attributes/properties that an object is composed of e.g. an object `bird' can be ide… ▽ More

    Submitted 17 October, 2024; originally announced October 2024.

    Comments: 14 pages, 7 figures

  33. arXiv:2408.02231  [pdf, other

    cs.CV

    REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models

    Authors: Agneet Chatterjee, Yiran Luo, Tejas Gokhale, Yezhou Yang, Chitta Baral

    Abstract: Text-to-Image (T2I) and multimodal large language models (MLLMs) have been adopted in solutions for several computer vision and multimodal learning tasks. However, it has been found that such vision-language models lack the ability to correctly reason over spatial relationships. To tackle this shortcoming, we develop the REVISION framework which improves spatial fidelity in vision-language models.… ▽ More

    Submitted 5 August, 2024; originally announced August 2024.

    Comments: Accepted to ECCV 2024. Project Page : https://agneetchatterjee.com/revision/

  34. arXiv:2407.14790  [pdf, other

    cs.CL cs.AI

    Step-by-Step Reasoning to Solve Grid Puzzles: Where do LLMs Falter?

    Authors: Nemika Tyagi, Mihir Parmar, Mohith Kulkarni, Aswin RRV, Nisarg Patel, Mutsumi Nakamura, Arindam Mitra, Chitta Baral

    Abstract: Solving grid puzzles involves a significant amount of logical reasoning. Hence, it is a good domain to evaluate the reasoning capability of a model which can then guide us to improve the reasoning ability of models. However, most existing works evaluate only the final predicted answer of a puzzle, without delving into an in-depth analysis of the LLMs' reasoning chains (such as where they falter) o… ▽ More

    Submitted 4 October, 2024; v1 submitted 20 July, 2024; originally announced July 2024.

    Comments: Accepted at EMNLP 2024 Main

  35. arXiv:2407.03525  [pdf, ps, other

    cs.CL

    UnSeenTimeQA: Time-Sensitive Question-Answering Beyond LLMs' Memorization

    Authors: Md Nayem Uddin, Amir Saeidi, Divij Handa, Agastya Seth, Tran Cao Son, Eduardo Blanco, Steven R. Corman, Chitta Baral

    Abstract: This paper introduces UnSeenTimeQA, a novel data contamination-free time-sensitive question-answering (TSQA) benchmark. It differs from existing TSQA benchmarks by avoiding web-searchable queries grounded in the real world. We present a series of time-sensitive event scenarios based on synthetically generated facts. It requires large language models (LLMs) to engage in genuine temporal reasoning w… ▽ More

    Submitted 2 June, 2025; v1 submitted 3 July, 2024; originally announced July 2024.

    Comments: Accepted at ACL 2025 (Main)

  36. arXiv:2406.17169  [pdf, other

    cs.CL cs.AI

    Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models

    Authors: Nisarg Patel, Mohith Kulkarni, Mihir Parmar, Aashna Budhiraja, Mutsumi Nakamura, Neeraj Varshney, Chitta Baral

    Abstract: As Large Language Models (LLMs) continue to exhibit remarkable performance in natural language understanding tasks, there is a crucial need to measure their ability for human-like multi-step logical reasoning. Existing logical reasoning evaluation benchmarks often focus primarily on simplistic single-step or multi-step reasoning with a limited set of inference rules. Furthermore, the lack of datas… ▽ More

    Submitted 6 October, 2024; v1 submitted 24 June, 2024; originally announced June 2024.

    Comments: Accepted at EMNLP 2024 Main

  37. arXiv:2406.15444  [pdf, ps, other

    cs.CL

    Cutting Through the Noise: Boosting LLM Performance on Math Word Problems

    Authors: Ujjwala Anantheswaran, Himanshu Gupta, Kevin Scaria, Shreyas Verma, Chitta Baral, Swaroop Mishra

    Abstract: Large Language Models (LLMs) excel at various tasks, including solving math word problems (MWPs), but struggle with real-world problems containing irrelevant information. To address this, we propose a prompting framework that generates adversarial variants of MWPs by adding irrelevant variables. We introduce a dataset, PROBLEMATHIC, containing both adversarial and non-adversarial MWPs. Our experim… ▽ More

    Submitted 15 September, 2025; v1 submitted 30 May, 2024; originally announced June 2024.

    Comments: Published at ICLR 2025 Workshop on Reasoning and Planning for LLMs

  38. arXiv:2406.05494  [pdf, other

    cs.CL

    Investigating and Addressing Hallucinations of LLMs in Tasks Involving Negation

    Authors: Neeraj Varshney, Satyam Raj, Venkatesh Mishra, Agneet Chatterjee, Ritika Sarkar, Amir Saeidi, Chitta Baral

    Abstract: Large Language Models (LLMs) have achieved remarkable performance across a wide variety of natural language tasks. However, they have been shown to suffer from a critical limitation pertinent to 'hallucination' in their output. Recent research has focused on investigating and addressing this problem for a variety of tasks such as biography generation, question answering, abstractive summarization,… ▽ More

    Submitted 8 June, 2024; originally announced June 2024.

  39. arXiv:2406.04046  [pdf, other

    cs.CC cs.AI

    ActionReasoningBench: Reasoning about Actions with and without Ramification Constraints

    Authors: Divij Handa, Pavel Dolin, Shrinidhi Kumbhar, Tran Cao Son, Chitta Baral

    Abstract: Reasoning about Actions and Change (RAC) has historically played a pivotal role in solving foundational AI problems, such as the frame problem. It has driven advancements in AI fields, such as non-monotonic and commonsense reasoning. RAC remains crucial for AI systems that operate in dynamic environments, engage in interactive scenarios, or rely on commonsense reasoning. Despite substantial advanc… ▽ More

    Submitted 2 March, 2025; v1 submitted 6 June, 2024; originally announced June 2024.

    Comments: Accepted in ICLR 2025

  40. arXiv:2406.03827  [pdf, other

    cs.CL

    Chaos with Keywords: Exposing Large Language Models Sycophantic Hallucination to Misleading Keywords and Evaluating Defense Strategies

    Authors: Aswin RRV, Nemika Tyagi, Md Nayem Uddin, Neeraj Varshney, Chitta Baral

    Abstract: This study explores the sycophantic tendencies of Large Language Models (LLMs), where these models tend to provide answers that match what users want to hear, even if they are not entirely correct. The motivation behind this exploration stems from the common behavior observed in individuals searching the internet for facts with partial or misleading knowledge. Similar to using web search engines,… ▽ More

    Submitted 24 August, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

    Comments: Findings of ACL 2024

  41. arXiv:2405.16681  [pdf, other

    cs.CL

    Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization

    Authors: Amir Saeidi, Shivanshu Verma, Aswin RRV, Kashif Rasul, Chitta Baral

    Abstract: Reinforcement Learning with Human Feedback (RLHF) enhances the alignment of Large Language Models (LLMs). However, its limitations have led to the development of Direct Preference Optimization (DPO), an RL-free approach designed to overcome these shortcomings. While studies have shown that DPO improves instruction-following capabilities, it negatively impacts the reasoning ability of LLMs. Additio… ▽ More

    Submitted 17 February, 2025; v1 submitted 26 May, 2024; originally announced May 2024.

  42. arXiv:2405.15961  [pdf, other

    cs.CV

    Grounding Stylistic Domain Generalization with Quantitative Domain Shift Measures and Synthetic Scene Images

    Authors: Yiran Luo, Joshua Feinglass, Tejas Gokhale, Kuan-Cheng Lee, Chitta Baral, Yezhou Yang

    Abstract: Domain Generalization (DG) is a challenging task in machine learning that requires a coherent ability to comprehend shifts across various domains through extraction of domain-invariant features. DG performance is typically evaluated by performing image classification in domains of various image styles. However, current methodology lacks quantitative understanding about shifts in stylistic domain,… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

    Comments: Accepted at the 3rd CVPR Workshop on Vision Datasets Understanding

  43. arXiv:2405.09830  [pdf

    cond-mat.mtrl-sci

    Unveiling the Direct Piezoelectric Effect on Piezo-phototronic Coupling in Ferroelectrics: First Principle Study Assisted Experimental Approach

    Authors: Koyal Suman Samantaray, Sourabh Kumar, P Maneesha, Dilip Sasmal, Suresh Chandra Baral, B. R. Vaishnavi Krupa, Arup Dasgupta, K Harrabi, A Mekki, Somaditya Sen

    Abstract: A new study explores the distinct roles of spontaneous polarization and piezoelectric polarization in piezo-phototronic coupling. This investigation focuses on differences in photocatalytic and piezo-photocatalytic performance using sodium bismuth titanate (NBT), a key ferroelectric material. The research aims to identify which type of polarization has a greater influence on piezo-phototronic effe… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

  44. arXiv:2404.15522  [pdf, other

    cs.CL cs.AI

    LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

    Authors: Mihir Parmar, Nisarg Patel, Neeraj Varshney, Mutsumi Nakamura, Man Luo, Santosh Mashetty, Arindam Mitra, Chitta Baral

    Abstract: Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks. But, can they really "reason" over the natural language? This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied. However, the crucial skill pertaining to 'logi… ▽ More

    Submitted 6 June, 2024; v1 submitted 23 April, 2024; originally announced April 2024.

    Comments: Accepted at ACL(Main) 2024 | First version available @ https://openreview.net/forum?id=7NR2ZVzZxx

  45. arXiv:2404.14723  [pdf, other

    cs.CL

    Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks

    Authors: Amir Saeidi, Shivanshu Verma, Md Nayem Uddin, Chitta Baral

    Abstract: This study evaluates Direct Preference Optimization (DPO) and its variants for aligning Large Language Models (LLMs) with human preferences, testing three configurations: (1) with Supervised Fine Tuning (SFT), (2) without SFT, and (3) without SFT but using an instruction tuned model. We further investigate how training set size influences model performance. Our evaluation spans 13 benchmarks cover… ▽ More

    Submitted 8 February, 2025; v1 submitted 22 April, 2024; originally announced April 2024.

  46. arXiv:2404.08540  [pdf, other

    cs.CV

    On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation

    Authors: Agneet Chatterjee, Tejas Gokhale, Chitta Baral, Yezhou Yang

    Abstract: Recent advances in monocular depth estimation have been made by incorporating natural language as additional guidance. Although yielding impressive results, the impact of the language prior, particularly in terms of generalization and robustness, remains unexplored. In this paper, we address this gap by quantifying the impact of this prior and introduce methods to benchmark its effectiveness acros… ▽ More

    Submitted 12 April, 2024; originally announced April 2024.

    Comments: Accepted to CVPR 2024. Project webpage: https://agneetchatterjee.com/robustness_depth_lang/

  47. arXiv:2404.01197  [pdf, other

    cs.CV

    Getting it Right: Improving Spatial Consistency in Text-to-Image Models

    Authors: Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Aflalo, Sayak Paul, Dhruba Ghosh, Tejas Gokhale, Ludwig Schmidt, Hannaneh Hajishirzi, Vasudev Lal, Chitta Baral, Yezhou Yang

    Abstract: One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt. In this paper, we offer a comprehensive investigation of this limitation, while also developing datasets and methods that support algorithmic solutions to improve spatial reasoning in T2I models. We find… ▽ More

    Submitted 6 August, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

    Comments: Accepted to ECCV 2024. Project Page : https://spright-t2i.github.io/

  48. arXiv:2403.11092  [pdf, other

    cs.CL cs.AI cs.CV cs.CY eess.IV

    Lost in Translation? Translation Errors and Challenges for Fair Assessment of Text-to-Image Models on Multilingual Concepts

    Authors: Michael Saxon, Yiran Luo, Sharon Levy, Chitta Baral, Yezhou Yang, William Yang Wang

    Abstract: Benchmarks of the multilingual capabilities of text-to-image (T2I) models compare generated images prompted in a test language to an expected image distribution over a concept set. One such benchmark, "Conceptual Coverage Across Languages" (CoCo-CroLa), assesses the tangible noun inventory of T2I models by prompting them to generate pictures from a concept list translated to seven languages and co… ▽ More

    Submitted 17 March, 2024; originally announced March 2024.

    Comments: NAACL 2024 Main Conference

  49. arXiv:2402.10601  [pdf, ps, other

    cs.CL cs.AI

    When "Competency" in Reasoning Opens the Door to Vulnerability: Jailbreaking LLMs via Novel Complex Ciphers

    Authors: Divij Handa, Zehua Zhang, Amir Saeidi, Shrinidhi Kumbhar, Md Nayem Uddin, Aswin RRV, Chitta Baral

    Abstract: Recent advancements in Large Language Model (LLM) safety have primarily focused on mitigating attacks crafted in natural language or common ciphers (e.g. Base64), which are likely integrated into newer models' safety training. However, we reveal a paradoxical vulnerability: as LLMs advance in reasoning, they inadvertently become more susceptible to novel jailbreaking attacks. Enhanced reasoning en… ▽ More

    Submitted 14 October, 2025; v1 submitted 16 February, 2024; originally announced February 2024.

    Comments: Published in Reliable ML from Unreliable Data workshop @ NeurIPS 2025

  50. arXiv:2402.05195  [pdf, other

    cs.CV cs.CL

    $λ$-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space

    Authors: Maitreya Patel, Sangmin Jung, Chitta Baral, Yezhou Yang

    Abstract: Despite the recent advances in personalized text-to-image (P-T2I) generative models, it remains challenging to perform finetuning-free multi-subject-driven T2I in a resource-efficient manner. Predominantly, contemporary approaches, involving the training of Hypernetworks and Multimodal Large Language Models (MLLMs), require heavy computing resources that range from 600 to 12300 GPU hours of traini… ▽ More

    Submitted 9 April, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

    Comments: Project page: https://eclipse-t2i.github.io/Lambda-ECLIPSE/