Skip to main content

Showing 1–50 of 192 results for author: Joty, S

.
  1. arXiv:2506.09890  [pdf, ps, other

    cs.CL cs.AI

    The Emergence of Abstract Thought in Large Language Models Beyond Any Language

    Authors: Yuxin Chen, Yiran Zhao, Yang Zhang, An Zhang, Kenji Kawaguchi, Shafiq Joty, Junnan Li, Tat-Seng Chua, Michael Qizhe Shieh, Wenxuan Zhang

    Abstract: As large language models (LLMs) continue to advance, their capacity to function effectively across a diverse range of languages has shown marked improvement. Preliminary studies observe that the hidden activations of LLMs often resemble English, even when responding to non-English prompts. This has led to the widespread assumption that LLMs may "think" in English. However, more recent results show… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

  2. arXiv:2506.06950  [pdf, ps, other

    cs.CL

    What Makes a Good Natural Language Prompt?

    Authors: Do Xuan Long, Duy Dinh, Ngoc-Hai Nguyen, Kenji Kawaguchi, Nancy F. Chen, Shafiq Joty, Min-Yen Kan

    Abstract: As large language models (LLMs) have progressed towards more human-like and human--AI communications have become prevalent, prompting has emerged as a decisive component. However, there is limited conceptual consensus on what exactly quantifies natural language prompts. We attempt to address this question by conducting a meta-analysis surveying more than 150 prompting-related papers from leading N… ▽ More

    Submitted 7 June, 2025; originally announced June 2025.

    Comments: ACL 2025 Main Conference

  3. arXiv:2506.04723  [pdf, ps, other

    cs.AI

    Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning

    Authors: Jiayu Wang, Yifei Ming, Zixuan Ke, Caiming Xiong, Shafiq Joty, Aws Albarghouthi, Frederic Sala

    Abstract: Reinforcement learning (RL) has become the dominant paradigm for endowing language models with advanced reasoning capabilities. Despite the substantial empirical gains demonstrated by RL-based training methods like GRPO, a granular understanding of their advantages is still lacking. To address this gap, we introduce a fine-grained analytic framework to dissect the impact of RL on reasoning. Our fr… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

  4. arXiv:2506.03332  [pdf, ps, other

    cs.AI

    Helpful Agent Meets Deceptive Judge: Understanding Vulnerabilities in Agentic Workflows

    Authors: Yifei Ming, Zixuan Ke, Xuan-Phi Nguyen, Jiayu Wang, Shafiq Joty

    Abstract: Agentic workflows -- where multiple large language model (LLM) instances interact to solve tasks -- are increasingly built on feedback mechanisms, where one model evaluates and critiques another. Despite the promise of feedback-driven improvement, the stability of agentic workflows rests on the reliability of the judge. However, judges may hallucinate information, exhibit bias, or act adversariall… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

  5. arXiv:2506.01265  [pdf, ps, other

    cs.CL

    Beyond In-Context Learning: Aligning Long-form Generation of Large Language Models via Task-Inherent Attribute Guidelines

    Authors: Do Xuan Long, Duong Ngoc Yen, Do Xuan Trong, Luu Anh Tuan, Kenji Kawaguchi, Shafiq Joty, Min-Yen Kan, Nancy F. Chen

    Abstract: In-context learning (ICL) is an important yet not fully understood ability of pre-trained large language models (LLMs). It can greatly enhance task performance using a few examples, termed demonstrations, without fine-tuning. Although effective in question answering, ICL often underperforms in long-form generation tasks such as summarization. Under appropriately realistic assumptions, we empirical… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

    Comments: ACL 2025 Findings

  6. arXiv:2505.14996  [pdf, ps, other

    cs.CL cs.AI cs.LG

    MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision

    Authors: Zixuan Ke, Austin Xu, Yifei Ming, Xuan-Phi Nguyen, Caiming Xiong, Shafiq Joty

    Abstract: Multi-agent systems (MAS) leveraging the impressive capabilities of Large Language Models (LLMs) hold significant potential for tackling complex tasks. However, most current MAS depend on manually designed agent roles and communication protocols. These manual designs often fail to align with the underlying LLMs' strengths and struggle to adapt to novel tasks. Recent automatic MAS approaches attemp… ▽ More

    Submitted 25 May, 2025; v1 submitted 20 May, 2025; originally announced May 2025.

  7. arXiv:2505.13346  [pdf, ps, other

    cs.CL cs.AI

    J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization

    Authors: Austin Xu, Yilun Zhou, Xuan-Phi Nguyen, Caiming Xiong, Shafiq Joty

    Abstract: To keep pace with the increasing pace of large language models (LLM) development, model output evaluation has transitioned away from time-consuming human evaluation to automatic evaluation, where LLMs themselves are tasked with assessing and critiquing other model outputs. LLM-as-judge models are a class of generative evaluators that excel in evaluating relatively simple domains, like chat quality… ▽ More

    Submitted 18 June, 2025; v1 submitted 19 May, 2025; originally announced May 2025.

    Comments: 25 pages, 4 figures, 6 tables. Updated with code and benchmark

  8. arXiv:2505.12265  [pdf, ps, other

    cs.CL

    Learning Auxiliary Tasks Improves Reference-Free Hallucination Detection in Open-Domain Long-Form Generation

    Authors: Chengwei Qin, Wenxuan Zhou, Karthik Abinav Sankararaman, Nanshu Wang, Tengyu Xu, Alexander Radovic, Eryk Helenowski, Arya Talebzadeh, Aditya Tayade, Sinong Wang, Shafiq Joty, Han Fang, Hao Ma

    Abstract: Hallucination, the generation of factually incorrect information, remains a significant challenge for large language models (LLMs), especially in open-domain long-form generation. Existing approaches for detecting hallucination in long-form tasks either focus on limited domains or rely heavily on external fact-checking tools, which may not always be available. In this work, we systematically inv… ▽ More

    Submitted 18 May, 2025; originally announced May 2025.

  9. arXiv:2505.08468  [pdf, ps, other

    cs.CL cs.CV

    Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?

    Authors: Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Ridwan Mahbub, Ahmed Masry, Mizanur Rahman, Amran Bhuiyan, Mir Tafseer Nayeem, Shafiq Joty, Enamul Hoque, Jimmy Huang

    Abstract: Charts are ubiquitous as they help people understand and reason with data. Recently, various downstream tasks, such as chart question answering, chart2text, and fact-checking, have emerged. Large Vision-Language Models (LVLMs) show promise in tackling these tasks, but their evaluation is costly and time-consuming, limiting real-world deployment. While using LVLMs as judges to assess the chart comp… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

    Comments: Accepted at ACL 2025 Industry Track

  10. arXiv:2505.07849  [pdf, ps, other

    cs.SE cs.AI cs.IR

    SweRank: Software Issue Localization with Code Ranking

    Authors: Revanth Gangi Reddy, Tarun Suresh, JaeHyeok Doo, Ye Liu, Xuan Phi Nguyen, Yingbo Zhou, Semih Yavuz, Caiming Xiong, Heng Ji, Shafiq Joty

    Abstract: Software issue localization, the task of identifying the precise code locations (files, classes, or functions) relevant to a natural language issue description (e.g., bug report, feature request), is a critical yet time-consuming aspect of software development. While recent LLM-based agentic approaches demonstrate promise, they often incur significant latency and cost due to complex multi-step rea… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

  11. arXiv:2504.15253  [pdf, other

    cs.CL cs.LG

    Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators

    Authors: Yilun Zhou, Austin Xu, Peifeng Wang, Caiming Xiong, Shafiq Joty

    Abstract: Scaling test-time computation, or affording a generator large language model (LLM) extra compute during inference, typically employs the help of external non-generative evaluators (i.e., reward models). Concurrently, LLM-judges, models trained to generate evaluations and critiques (explanations) in natural language, are becoming increasingly popular in automatic evaluation. Despite judge empirical… ▽ More

    Submitted 21 May, 2025; v1 submitted 21 April, 2025; originally announced April 2025.

    Comments: ICML 2025. The first two authors contributed equally. The codebase is at https://github.com/SalesforceAIResearch/jetts-benchmark

  12. arXiv:2504.09037  [pdf, other

    cs.AI cs.CL

    A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems

    Authors: Zixuan Ke, Fangkai Jiao, Yifei Ming, Xuan-Phi Nguyen, Austin Xu, Do Xuan Long, Minzhi Li, Chengwei Qin, Peifeng Wang, Silvio Savarese, Caiming Xiong, Shafiq Joty

    Abstract: Reasoning is a fundamental cognitive process that enables logical inference, problem-solving, and decision-making. With the rapid advancement of large language models (LLMs), reasoning has emerged as a key capability that distinguishes advanced AI systems from conventional models that empower chatbots. In this survey, we categorize existing methods along two orthogonal dimensions: (1) Regimes, whi… ▽ More

    Submitted 11 April, 2025; originally announced April 2025.

    Comments: 72 pages, 6 figures

  13. arXiv:2504.05506  [pdf, other

    cs.CL

    ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering

    Authors: Ahmed Masry, Mohammed Saidul Islam, Mahir Ahmed, Aayush Bajaj, Firoz Kabir, Aaryaman Kartha, Md Tahmid Rahman Laskar, Mizanur Rahman, Shadikur Rahman, Mehrad Shahmohammadi, Megh Thakkar, Md Rizwan Parvez, Enamul Hoque, Shafiq Joty

    Abstract: Charts are ubiquitous, as people often use them to analyze data, answer questions, and discover critical insights. However, performing complex analytical tasks with charts requires significant perceptual and cognitive effort. Chart Question Answering (CQA) systems automate this process by enabling models to interpret and reason with visual representations of data. However, existing benchmarks like… ▽ More

    Submitted 10 April, 2025; v1 submitted 7 April, 2025; originally announced April 2025.

  14. arXiv:2504.03931  [pdf, other

    cs.CL cs.AI

    NAACL2025 Tutorial: Adaptation of Large Language Models

    Authors: Zixuan Ke, Yifei Ming, Shafiq Joty

    Abstract: This tutorial on adaptation of LLMs is designed to address the growing demand for models that go beyond the static capabilities of generic LLMs by providing an overview of dynamic, domain-specific, and task-adaptive LLM adaptation techniques. While general LLMs have demonstrated strong generalization across a variety of tasks, they often struggle to perform well in specialized domains such as fina… ▽ More

    Submitted 21 April, 2025; v1 submitted 4 April, 2025; originally announced April 2025.

    Comments: NAACL2025 Tutorial

  15. arXiv:2503.15620  [pdf, other

    cs.CL cs.AI cs.LG

    Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings

    Authors: Austin Xu, Srijan Bansal, Yifei Ming, Semih Yavuz, Shafiq Joty

    Abstract: The large language model (LLM)-as-judge paradigm has been used to meet the demand for a cheap, reliable, and fast evaluation of model outputs during AI system development and post-deployment monitoring. While judge models -- LLMs finetuned to specialize in assessing and critiquing model outputs -- have been touted as general purpose evaluators, they are typically evaluated only on non-contextual s… ▽ More

    Submitted 19 March, 2025; originally announced March 2025.

    Comments: 23 pages, 13 figures, 6 tables

  16. arXiv:2502.20592  [pdf, other

    cs.CL

    Multi2: Multi-Agent Test-Time Scalable Framework for Multi-Document Processing

    Authors: Juntai Cao, Xiang Zhang, Raymond Li, Chuyuan Li, Chenyu You, Shafiq Joty, Giuseppe Carenini

    Abstract: Recent advances in test-time scaling have shown promising results in improving Large Language Model (LLM) performance through strategic computation allocation during inference. While this approach has demonstrated strong improvements in logical and mathematical reasoning tasks, its application to natural language generation (NLG), particularly summarization, remains unexplored. Multi-Document Summ… ▽ More

    Submitted 19 May, 2025; v1 submitted 27 February, 2025; originally announced February 2025.

  17. arXiv:2502.11492  [pdf, other

    cs.AI cs.CL cs.CV

    Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding

    Authors: Kung-Hsiang Huang, Can Qin, Haoyi Qiu, Philippe Laban, Shafiq Joty, Caiming Xiong, Chien-Sheng Wu

    Abstract: Vision Language Models (VLMs) have achieved remarkable progress in multimodal tasks, yet they often struggle with visual arithmetic, seemingly simple capabilities like object counting or length comparison, which are essential for relevant complex tasks like chart understanding and geometric reasoning. In this work, we first investigate the root causes of this deficiency through a suite of probing… ▽ More

    Submitted 24 May, 2025; v1 submitted 17 February, 2025; originally announced February 2025.

    Comments: Code and data are available at https://github.com/SalesforceAIResearch/CogAlign

  18. arXiv:2501.04961  [pdf, other

    cs.CL cs.AI cs.CE cs.LG

    Demystifying Domain-adaptive Post-training for Financial LLMs

    Authors: Zixuan Ke, Yifei Ming, Xuan-Phi Nguyen, Caiming Xiong, Shafiq Joty

    Abstract: Domain-adaptive post-training of large language models (LLMs) has emerged as a promising approach for specialized domains such as medicine and finance. However, significant challenges remain in identifying optimal adaptation criteria and training strategies across varying data and model configurations. To address these challenges, we introduce FINDAP, a systematic and fine-grained investigation in… ▽ More

    Submitted 11 February, 2025; v1 submitted 8 January, 2025; originally announced January 2025.

  19. arXiv:2412.18011  [pdf, other

    cs.CL

    StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs

    Authors: Hailin Chen, Fangkai Jiao, Mathieu Ravaut, Nawshad Farruque, Xuan Phi Nguyen, Chengwei Qin, Manan Dey, Bosheng Ding, Caiming Xiong, Shafiq Joty, Yingbo Zhou

    Abstract: The rapid advancement of large language models (LLMs) demands robust, unbiased, and scalable evaluation methods. However, human annotations are costly to scale, model-based evaluations are susceptible to stylistic biases, and target-answer-based benchmarks are vulnerable to data contamination and cheating. To address these limitations, we propose StructTest, a novel benchmark that evaluates LLMs o… ▽ More

    Submitted 19 March, 2025; v1 submitted 23 December, 2024; originally announced December 2024.

  20. arXiv:2411.16345  [pdf, other

    cs.CL

    Preference Optimization for Reasoning with Pseudo Feedback

    Authors: Fangkai Jiao, Geyang Guo, Xingxing Zhang, Nancy F. Chen, Shafiq Joty, Furu Wei

    Abstract: Preference optimization techniques, such as Direct Preference Optimization (DPO), are frequently employed to enhance the reasoning capabilities of large language models (LLMs) in domains like mathematical reasoning and coding, typically following supervised fine-tuning. These methods rely on high-quality labels for reasoning tasks to generate preference pairs; however, the availability of reasonin… ▽ More

    Submitted 14 February, 2025; v1 submitted 25 November, 2024; originally announced November 2024.

    Comments: 28 pages, 11 figures. ICLR 2025

  21. arXiv:2411.15993  [pdf, other

    cs.CL

    Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown

    Authors: Lifu Tu, Rui Meng, Shafiq Joty, Yingbo Zhou, Semih Yavuz

    Abstract: Large language models (LLMs) have demonstrated strong capabilities in text understanding and generation. However, they often lack factuality, producing a mixture of true and false information, especially in long-form generation. In this work, we investigates the factuality of long-form text generation across various large language models (LLMs), including GPT-4, Gemini-1.5-Pro, Claude-3-Opus, Llam… ▽ More

    Submitted 24 November, 2024; originally announced November 2024.

  22. arXiv:2411.12644  [pdf, other

    cs.SE cs.AI

    CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval

    Authors: Ye Liu, Rui Meng, Shafiq Joty, Silvio Savarese, Caiming Xiong, Yingbo Zhou, Semih Yavuz

    Abstract: Despite the success of text retrieval in many NLP tasks, code retrieval remains a largely underexplored area. Most text retrieval systems are tailored for natural language queries, often neglecting the specific challenges of retrieving code. This gap leaves existing models unable to effectively capture the diversity of programming languages and tasks across different domains, highlighting the need… ▽ More

    Submitted 24 November, 2024; v1 submitted 19 November, 2024; originally announced November 2024.

  23. arXiv:2411.00142  [pdf, other

    cs.CL cs.AI

    JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking

    Authors: Tong Niu, Shafiq Joty, Ye Liu, Caiming Xiong, Yingbo Zhou, Semih Yavuz

    Abstract: Accurate document retrieval is crucial for the success of retrieval-augmented generation (RAG) applications, including open-domain question answering and code completion. While large language models (LLMs) have been employed as dense encoders or listwise rerankers in RAG systems, they often struggle with reasoning-intensive tasks because they lack nuanced analysis when judging document relevance.… ▽ More

    Submitted 31 October, 2024; originally announced November 2024.

  24. arXiv:2410.23609  [pdf, other

    cs.CL

    On Positional Bias of Faithfulness for Long-form Summarization

    Authors: David Wan, Jesse Vig, Mohit Bansal, Shafiq Joty

    Abstract: Large Language Models (LLMs) often exhibit positional bias in long-context settings, under-attending to information in the middle of inputs. We investigate the presence of this bias in long-form summarization, its impact on faithfulness, and various techniques to mitigate this bias. To consistently evaluate faithfulness, we first compile a benchmark of eight human-annotated long-form summarization… ▽ More

    Submitted 30 October, 2024; originally announced October 2024.

    Comments: 18 pages

  25. arXiv:2410.09207  [pdf, other

    cs.AI cs.CL

    P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains

    Authors: Simeng Han, Aaron Yu, Rui Shen, Zhenting Qi, Martin Riddell, Wenfei Zhou, Yujie Qiao, Yilun Zhao, Semih Yavuz, Ye Liu, Shafiq Joty, Yingbo Zhou, Caiming Xiong, Dragomir Radev, Rex Ying, Arman Cohan

    Abstract: Existing methods on understanding the capabilities of LLMs in logical reasoning rely on binary entailment classification or synthetically derived rationales, which are not sufficient for proper investigation of model's capabilities. We present P-FOLIO, a human-annotated dataset consisting of diverse and complex reasoning chains for a set of realistic logical reasoning stories also written by human… ▽ More

    Submitted 11 October, 2024; originally announced October 2024.

  26. arXiv:2410.07069  [pdf, other

    cs.CL cs.AI cs.LG

    ReIFE: Re-evaluating Instruction-Following Evaluation

    Authors: Yixin Liu, Kejian Shi, Alexander R. Fabbri, Yilun Zhao, Peifeng Wang, Chien-Sheng Wu, Shafiq Joty, Arman Cohan

    Abstract: The automatic evaluation of instruction following typically involves using large language models (LLMs) to assess response quality. However, there is a lack of comprehensive evaluation of these LLM-based evaluators across two dimensions: the base LLMs and the evaluation protocols. Therefore, we present a thorough meta-evaluation of instruction following, including 25 base LLMs and 15 recently prop… ▽ More

    Submitted 9 October, 2024; originally announced October 2024.

    Comments: GitHub Repo: https://github.com/yale-nlp/ReIFE, Evaluation Result Collection: https://huggingface.co/datasets/yale-nlp/ReIFE

  27. arXiv:2410.03727  [pdf, other

    cs.CL cs.AI cs.LG

    FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows"

    Authors: Yifei Ming, Senthil Purushwalkam, Shrey Pandit, Zixuan Ke, Xuan-Phi Nguyen, Caiming Xiong, Shafiq Joty

    Abstract: Ensuring faithfulness to context in large language models (LLMs) and retrieval-augmented generation (RAG) systems is crucial for reliable deployment in real-world applications, as incorrect or unsupported information can erode user trust. Despite advancements on standard benchmarks, faithfulness hallucination-where models generate responses misaligned with the provided context-remains a significan… ▽ More

    Submitted 24 April, 2025; v1 submitted 30 September, 2024; originally announced October 2024.

    Comments: The conference version of this paper is published at ICLR 2025

  28. arXiv:2410.01782  [pdf, other

    cs.CL cs.AI cs.LG

    Open-RAG: Enhanced Retrieval-Augmented Reasoning with Open-Source Large Language Models

    Authors: Shayekh Bin Islam, Md Asib Rahman, K S M Tozammel Hossain, Enamul Hoque, Shafiq Joty, Md Rizwan Parvez

    Abstract: Retrieval-Augmented Generation (RAG) has been shown to enhance the factual accuracy of Large Language Models (LLMs), but existing methods often suffer from limited reasoning capabilities in effectively using the retrieved evidence, particularly when using open-source LLMs. To mitigate this gap, we introduce a novel framework, Open-RAG, designed to enhance reasoning capabilities in RAG with open-so… ▽ More

    Submitted 2 October, 2024; originally announced October 2024.

    Comments: Accepted to EMNLP 2024 Findings. Website: https://openragmoe.github.io/. 14 pages, 7 figures, 5 tables

  29. arXiv:2410.01428  [pdf, other

    cs.CL

    Can We Further Elicit Reasoning in LLMs? Critic-Guided Planning with Retrieval-Augmentation for Solving Challenging Tasks

    Authors: Xingxuan Li, Weiwen Xu, Ruochen Zhao, Fangkai Jiao, Shafiq Joty, Lidong Bing

    Abstract: State-of-the-art large language models (LLMs) exhibit impressive problem-solving capabilities but may struggle with complex reasoning and factual correctness. Existing methods harness the strengths of chain-of-thought and retrieval-augmented generation (RAG) to decompose a complex problem into simpler steps and apply retrieval to improve factual correctness. These methods work well on straightforw… ▽ More

    Submitted 2 October, 2024; originally announced October 2024.

    Comments: Work in progress

  30. arXiv:2409.17422  [pdf, other

    cs.CL cs.AI cs.LG

    Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction

    Authors: Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, Shafiq Joty

    Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in handling long context inputs, but this comes at the cost of increased computational resources and latency. Our research introduces a novel approach for the long context bottleneck to accelerate LLM inference and reduce GPU memory consumption. Our research demonstrates that LLMs can identify relevant tokens in the early layer… ▽ More

    Submitted 25 September, 2024; originally announced September 2024.

  31. arXiv:2409.14664  [pdf, other

    cs.CL

    Direct Judgement Preference Optimization

    Authors: Peifeng Wang, Austin Xu, Yilun Zhou, Caiming Xiong, Shafiq Joty

    Abstract: Auto-evaluation is crucial for assessing response quality and offering feedback for model development. Recent studies have explored training large language models (LLMs) as generative judges to evaluate and critique other models' outputs. In this work, we investigate the idea of learning from both positive and negative data with preference optimization to enhance the evaluation capabilities of LLM… ▽ More

    Submitted 29 September, 2024; v1 submitted 22 September, 2024; originally announced September 2024.

    Comments: Preprint

  32. arXiv:2409.09916  [pdf, other

    cs.CL cs.AI

    SFR-RAG: Towards Contextually Faithful LLMs

    Authors: Xuan-Phi Nguyen, Shrey Pandit, Senthil Purushwalkam, Austin Xu, Hailin Chen, Yifei Ming, Zixuan Ke, Silvio Savarese, Caiming Xong, Shafiq Joty

    Abstract: Retrieval Augmented Generation (RAG), a paradigm that integrates external contextual information with large language models (LLMs) to enhance factual accuracy and relevance, has emerged as a pivotal area in generative AI. The LLMs used in RAG applications are required to faithfully and completely comprehend the provided context and users' questions, avoid hallucination, handle unanswerable, counte… ▽ More

    Submitted 15 September, 2024; originally announced September 2024.

    Comments: Technical report

  33. arXiv:2408.08656  [pdf, other

    cs.CL

    LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMs

    Authors: Do Xuan Long, Hai Nguyen Ngoc, Tiviatis Sim, Hieu Dao, Shafiq Joty, Kenji Kawaguchi, Nancy F. Chen, Min-Yen Kan

    Abstract: We present the first systematic evaluation examining format bias in performance of large language models (LLMs). Our approach distinguishes between two categories of an evaluation metric under format constraints to reliably and accurately assess performance: one measures performance when format constraints are adhered to, while the other evaluates performance regardless of constraint adherence. We… ▽ More

    Submitted 22 February, 2025; v1 submitted 16 August, 2024; originally announced August 2024.

    Comments: NAACL 2025 Main Conference

  34. arXiv:2408.05346  [pdf, other

    cs.CL

    DataNarrative: Automated Data-Driven Storytelling with Visualizations and Texts

    Authors: Mohammed Saidul Islam, Md Tahmid Rahman Laskar, Md Rizwan Parvez, Enamul Hoque, Shafiq Joty

    Abstract: Data-driven storytelling is a powerful method for conveying insights by combining narrative techniques with visualizations and text. These stories integrate visual aids, such as highlighted bars and lines in charts, along with textual annotations explaining insights. However, creating such stories requires a deep understanding of the data and meticulous narrative planning, often necessitating huma… ▽ More

    Submitted 3 October, 2024; v1 submitted 9 August, 2024; originally announced August 2024.

  35. arXiv:2407.21794  [pdf, ps, other

    cs.CV cs.AI cs.LG

    Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey

    Authors: Atsuyuki Miyai, Jingkang Yang, Jingyang Zhang, Yifei Ming, Yueqian Lin, Qing Yu, Go Irie, Shafiq Joty, Yixuan Li, Hai Li, Ziwei Liu, Toshihiko Yamasaki, Kiyoharu Aizawa

    Abstract: Detecting out-of-distribution (OOD) samples is crucial for ensuring the safety of machine learning systems and has shaped the field of OOD detection. Meanwhile, several other problems are closely related to OOD detection, including anomaly detection (AD), novelty detection (ND), open set recognition (OSR), and outlier detection (OD). To unify these problems, a generalized OOD detection framework w… ▽ More

    Submitted 18 June, 2025; v1 submitted 31 July, 2024; originally announced July 2024.

    Comments: Accepted at TMLR2025. Survey paper. We welcome questions, issues, and paper requests via https://github.com/AtsuMiyai/Awesome-OOD-VLM

  36. arXiv:2407.04172  [pdf, other

    cs.AI cs.CL cs.CV

    ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild

    Authors: Ahmed Masry, Megh Thakkar, Aayush Bajaj, Aaryaman Kartha, Enamul Hoque, Shafiq Joty

    Abstract: Given the ubiquity of charts as a data analysis, visualization, and decision-making tool across industries and sciences, there has been a growing interest in developing pre-trained foundation models as well as general purpose instruction-tuned models for chart understanding and reasoning. However, existing methods suffer crucial drawbacks across two critical axes affecting the performance of chart… ▽ More

    Submitted 3 November, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

  37. arXiv:2407.04069  [pdf, other

    cs.CL cs.AI cs.LG

    A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations

    Authors: Md Tahmid Rahman Laskar, Sawsan Alqahtani, M Saiful Bari, Mizanur Rahman, Mohammad Abdullah Matin Khan, Haidar Khan, Israt Jahan, Amran Bhuiyan, Chee Wei Tan, Md Rizwan Parvez, Enamul Hoque, Shafiq Joty, Jimmy Huang

    Abstract: Large Language Models (LLMs) have recently gained significant attention due to their remarkable capabilities in performing diverse tasks across various domains. However, a thorough evaluation of these models is crucial before deploying them in real-world applications to ensure they produce reliable performance. Despite the well-established importance of evaluating LLMs in the community, the comple… ▽ More

    Submitted 3 October, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

    Comments: Accepted at EMNLP 2024 (Main Conference)

  38. arXiv:2406.03776  [pdf, other

    cs.CL cs.AI cs.CV cs.IR

    XL-HeadTags: Leveraging Multimodal Retrieval Augmentation for the Multilingual Generation of News Headlines and Tags

    Authors: Faisal Tareque Shohan, Mir Tafseer Nayeem, Samsul Islam, Abu Ubaida Akash, Shafiq Joty

    Abstract: Millions of news articles published online daily can overwhelm readers. Headlines and entity (topic) tags are essential for guiding readers to decide if the content is worth their time. While headline generation has been extensively studied, tag generation remains largely unexplored, yet it offers readers better access to topics of interest. The need for conciseness in capturing readers' attention… ▽ More

    Submitted 7 June, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

    Comments: ACL 2024 camera ready. The first two authors contributed equally

  39. arXiv:2405.15329  [pdf, other

    cs.CL

    DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation

    Authors: Minzhi Li, Zhengyuan Liu, Shumin Deng, Shafiq Joty, Nancy F. Chen, Min-Yen Kan

    Abstract: The acceleration of Large Language Models (LLMs) research has opened up new possibilities for evaluating generated texts. They serve as scalable and economical evaluators, but the question of how reliable these evaluators are has emerged as a crucial research question. Prior research efforts in the meta-evaluation of LLMs as judges limit the prompting of an LLM to a single use to obtain a final ev… ▽ More

    Submitted 8 December, 2024; v1 submitted 24 May, 2024; originally announced May 2024.

    Comments: COLING2025

  40. arXiv:2404.16251  [pdf, other

    cs.CR cs.AI cs.CL

    Prompt Leakage effect and defense strategies for multi-turn LLM interactions

    Authors: Divyansh Agarwal, Alexander R. Fabbri, Ben Risher, Philippe Laban, Shafiq Joty, Chien-Sheng Wu

    Abstract: Prompt leakage poses a compelling security and privacy threat in LLM applications. Leakage of system prompts may compromise intellectual property, and act as adversarial reconnaissance for an attacker. A systematic evaluation of prompt leakage threats and mitigation strategies is lacking, especially for multi-turn LLM interactions. In this paper, we systematically investigate LLM vulnerabilities a… ▽ More

    Submitted 29 July, 2024; v1 submitted 24 April, 2024; originally announced April 2024.

  41. arXiv:2404.12728  [pdf, ps, other

    cs.CL

    Relevant or Random: Can LLMs Truly Perform Analogical Reasoning?

    Authors: Chengwei Qin, Wenhan Xia, Tan Wang, Fangkai Jiao, Yuchen Hu, Bosheng Ding, Ruirui Chen, Shafiq Joty

    Abstract: Analogical reasoning is a unique ability of humans to address unfamiliar challenges by transferring strategies from relevant past experiences. One key finding in psychology is that compared with irrelevant past experiences, recalling relevant ones can help humans better handle new tasks. Coincidentally, the NLP community has also recently found that self-generating relevant examples in the context… ▽ More

    Submitted 1 June, 2025; v1 submitted 19 April, 2024; originally announced April 2024.

  42. arXiv:2404.02507  [pdf, other

    cs.CL

    Lifelong Event Detection with Embedding Space Separation and Compaction

    Authors: Chengwei Qin, Ruirui Chen, Ruochen Zhao, Wenhan Xia, Shafiq Joty

    Abstract: To mitigate forgetting, existing lifelong event detection methods typically maintain a memory module and replay the stored memory data during the learning of a new task. However, the simple combination of memory data and new-task samples can still result in substantial forgetting of previously acquired knowledge, which may occur due to the potential overlap between the feature distribution of new… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

    Comments: NAACL 2024 main conference

  43. arXiv:2404.00699  [pdf, ps, other

    cs.CL

    A Comprehensive Survey of Contamination Detection Methods in Large Language Models

    Authors: Mathieu Ravaut, Bosheng Ding, Fangkai Jiao, Hailin Chen, Xingxuan Li, Ruochen Zhao, Chengwei Qin, Caiming Xiong, Shafiq Joty

    Abstract: With the rise of Large Language Models (LLMs) in recent years, abundant new opportunities are emerging, but also new challenges, among which contamination is quickly becoming critical. Business applications and fundraising in Artificial Intelligence (AI) have reached a scale at which a few percentage points gained on popular question-answering benchmarks could translate into dozens of millions of… ▽ More

    Submitted 2 April, 2025; v1 submitted 31 March, 2024; originally announced April 2024.

    Comments: 14 pages, 1 figure, 2 tables

  44. arXiv:2404.00570  [pdf, other

    cs.CL

    ParaICL: Towards Parallel In-Context Learning

    Authors: Xingxuan Li, Xuan-Phi Nguyen, Shafiq Joty, Lidong Bing

    Abstract: Large language models (LLMs) have become the norm in natural language processing (NLP), excelling in few-shot in-context learning (ICL) with their remarkable abilities. Nonetheless, the success of ICL largely hinges on the choice of few-shot demonstration examples, making the selection process increasingly crucial. Existing methods have delved into optimizing the quantity and semantic similarity o… ▽ More

    Submitted 5 May, 2025; v1 submitted 31 March, 2024; originally announced April 2024.

    Comments: Accepted by NAACL 2025

  45. arXiv:2403.12027  [pdf, other

    cs.CL cs.AI cs.CV

    From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models

    Authors: Kung-Hsiang Huang, Hou Pong Chan, Yi R. Fung, Haoyi Qiu, Mingyang Zhou, Shafiq Joty, Shih-Fu Chang, Heng Ji

    Abstract: Data visualization in the form of charts plays a pivotal role in data analysis, offering critical insights and aiding in informed decision-making. Automatic chart understanding has witnessed significant advancements with the rise of large foundation models in recent years. Foundation models, such as large language models, have revolutionized various natural language processing tasks and are increa… ▽ More

    Submitted 4 December, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

    Comments: IEEE Transactions on Knowledge and Data Engineering (TKDE)

  46. arXiv:2403.09028  [pdf, other

    cs.CL

    ChartInstruct: Instruction Tuning for Chart Comprehension and Reasoning

    Authors: Ahmed Masry, Mehrad Shahmohammadi, Md Rizwan Parvez, Enamul Hoque, Shafiq Joty

    Abstract: Charts provide visual representations of data and are widely used for analyzing information, addressing queries, and conveying insights to others. Various chart-related downstream tasks have emerged recently, such as question-answering and summarization. A common strategy to solve these tasks is to fine-tune various models originally trained on vision tasks language. However, such task-specific mo… ▽ More

    Submitted 13 March, 2024; originally announced March 2024.

  47. arXiv:2403.02990  [pdf, other

    cs.CL cs.AI

    Data Augmentation using Large Language Models: Data Perspectives, Learning Paradigms and Challenges

    Authors: Bosheng Ding, Chengwei Qin, Ruochen Zhao, Tianze Luo, Xinze Li, Guizhen Chen, Wenhan Xia, Junjie Hu, Anh Tuan Luu, Shafiq Joty

    Abstract: In the rapidly evolving field of large language models (LLMs), data augmentation (DA) has emerged as a pivotal technique for enhancing model performance by diversifying training examples without the need for additional data collection. This survey explores the transformative impact of LLMs on DA, particularly addressing the unique challenges and opportunities they present in the context of natural… ▽ More

    Submitted 2 July, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

  48. arXiv:2402.00658  [pdf, other

    cs.AI cs.CL

    Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing

    Authors: Fangkai Jiao, Chengwei Qin, Zhengyuan Liu, Nancy F. Chen, Shafiq Joty

    Abstract: Large Language Models (LLMs) have demonstrated significant potential in handling complex reasoning tasks through step-by-step rationale generation. However, recent studies have raised concerns regarding the hallucination and flaws in their reasoning process. Substantial efforts are being made to improve the reliability and faithfulness of the generated rationales. Some approaches model reasoning a… ▽ More

    Submitted 15 October, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

    Comments: 17 pages, 9 figures. EMNLP 2024

  49. arXiv:2401.13974  [pdf, other

    cs.CV cs.AI cs.GR

    BootPIG: Bootstrapping Zero-shot Personalized Image Generation Capabilities in Pretrained Diffusion Models

    Authors: Senthil Purushwalkam, Akash Gokul, Shafiq Joty, Nikhil Naik

    Abstract: Recent text-to-image generation models have demonstrated incredible success in generating images that faithfully follow input prompts. However, the requirement of using words to describe a desired concept provides limited control over the appearance of the generated concepts. In this work, we address this shortcoming by proposing an approach to enable personalization capabilities in existing text-… ▽ More

    Submitted 25 January, 2024; originally announced January 2024.

  50. arXiv:2312.17055  [pdf, ps, other

    cs.CL

    Beyond Output Matching: Bidirectional Alignment for Enhanced In-Context Learning

    Authors: Chengwei Qin, Wenhan Xia, Fangkai Jiao, Chen Chen, Yuchen Hu, Bosheng Ding, Ruirui Chen, Shafiq Joty

    Abstract: Large language models (LLMs) have shown impressive few-shot generalization on many tasks via in-context learning (ICL). Despite their success in showing such emergent abilities, the scale and complexity of larger models also lead to unprecedentedly high computational demands and deployment challenges. In reaction, researchers explore transferring the powerful capabilities of larger models to more… ▽ More

    Submitted 1 June, 2025; v1 submitted 28 December, 2023; originally announced December 2023.