Search | arXiv e-print repository

Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards

Authors: Manveer Singh Tamber, Forrest Sheng Bao, Chenyu Xu, Ge Luo, Suleman Kazi, Minseok Bae, Miaoran Li, Ofer Mendelevitch, Renyi Qu, Jimmy Lin

Abstract: Hallucinations remain a persistent challenge for LLMs. RAG aims to reduce hallucinations by grounding responses in contexts. However, even when provided context, LLMs still frequently introduce unsupported information or contradictions. This paper presents our efforts to measure LLM hallucinations with a focus on summarization tasks, assessing how often various LLMs introduce hallucinations when s… ▽ More Hallucinations remain a persistent challenge for LLMs. RAG aims to reduce hallucinations by grounding responses in contexts. However, even when provided context, LLMs still frequently introduce unsupported information or contradictions. This paper presents our efforts to measure LLM hallucinations with a focus on summarization tasks, assessing how often various LLMs introduce hallucinations when summarizing documents. We discuss Vectara's existing LLM hallucination leaderboard, based on the Hughes Hallucination Evaluation Model (HHEM). While HHEM and Vectara's Hallucination Leaderboard have garnered great research interest, we examine challenges faced by HHEM and current hallucination detection methods by analyzing the effectiveness of these methods on existing hallucination datasets. To address these limitations, we propose FaithJudge, an LLM-as-a-judge approach guided by few-shot human hallucination annotations, which substantially improves automated LLM hallucination evaluation over current methods. We introduce an enhanced hallucination leaderboard centered on FaithJudge, alongside our current hallucination leaderboard, enabling more reliable benchmarking of LLMs for hallucinations in RAG. △ Less

Submitted 7 May, 2025; originally announced May 2025.

arXiv:2410.13210 [pdf, other]

FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs

Authors: Forrest Sheng Bao, Miaoran Li, Renyi Qu, Ge Luo, Erana Wan, Yujia Tang, Weisi Fan, Manveer Singh Tamber, Suleman Kazi, Vivek Sourabh, Mike Qi, Ruixuan Tu, Chenyu Xu, Matthew Gonzales, Ofer Mendelevitch, Amin Ahmad

Abstract: Summarization is one of the most common tasks performed by large language models (LLMs), especially in applications like Retrieval-Augmented Generation (RAG). However, existing evaluations of hallucinations in LLM-generated summaries, and evaluations of hallucination detection models both suffer from a lack of diversity and recency in the LLM and LLM families considered. This paper introduces Fait… ▽ More Summarization is one of the most common tasks performed by large language models (LLMs), especially in applications like Retrieval-Augmented Generation (RAG). However, existing evaluations of hallucinations in LLM-generated summaries, and evaluations of hallucination detection models both suffer from a lack of diversity and recency in the LLM and LLM families considered. This paper introduces FaithBench, a summarization hallucination benchmark comprising challenging hallucinations made by 10 modern LLMs from 8 different families, with ground truth annotations by human experts. ``Challenging'' here means summaries on which popular, state-of-the-art hallucination detection models, including GPT-4o-as-a-judge, disagreed on. Our results show GPT-4o and GPT-3.5-Turbo produce the least hallucinations. However, even the best hallucination detection models have near 50\% accuracies on FaithBench, indicating lots of room for future improvement. The repo is https://github.com/vectara/FaithBench △ Less

Submitted 17 October, 2024; originally announced October 2024.

arXiv:2301.06622 [pdf, other]

IOPathTune: Adaptive Online Parameter Tuning for Parallel File System I/O Path

Authors: Md. Hasanur Rashid, Youbiao He, Forrest Sheng Bao, Dong Dai

Abstract: Parallel file systems contain complicated I/O paths from clients to storage servers. An efficient I/O path requires proper settings of multiple parameters, as the default settings often fail to deliver optimal performance, especially for diverse workloads in the HPC environment. Existing tuning strategies have shortcomings in being adaptive, timely, and flexible. We propose IOPathTune, which adapt… ▽ More Parallel file systems contain complicated I/O paths from clients to storage servers. An efficient I/O path requires proper settings of multiple parameters, as the default settings often fail to deliver optimal performance, especially for diverse workloads in the HPC environment. Existing tuning strategies have shortcomings in being adaptive, timely, and flexible. We propose IOPathTune, which adaptively tunes PFS I/O Path online from the client side without characterizing the workloads, doing expensive profiling, and communicating with other machines. We implemented IOPathTune on Lustre and leveraged CloudLab to conduct the evaluations on 20 different Filebench workloads in three different scenarios. We observed either on-par or better performance than the default configuration, as high as 231% on standalone executions. IOPathTune also delivers 89.57% better overall performance than CAPES in multiple client executions. △ Less

Submitted 16 January, 2023; originally announced January 2023.

arXiv:2301.02410 [pdf, other]

Codepod: A Namespace-Aware, Hierarchical Jupyter for Interactive Development at Scale

Authors: Hebi Li, Forrest Sheng Bao, Qi Xiao, Jin Tian

Abstract: Jupyter is a browser-based interactive development environment that has been popular recently. Jupyter models programs in code blocks, and makes it easy to develop code blocks interactively by running the code blocks and attaching rich media output. However, Jupyter provides no support for module systems and namespaces. Code blocks are linear and live in the global namespace; therefore, it is hard… ▽ More Jupyter is a browser-based interactive development environment that has been popular recently. Jupyter models programs in code blocks, and makes it easy to develop code blocks interactively by running the code blocks and attaching rich media output. However, Jupyter provides no support for module systems and namespaces. Code blocks are linear and live in the global namespace; therefore, it is hard to develop large projects that require modularization in Jupyter. As a result, large-code projects are still developed in traditional text files, and Jupyter is only used as a surface presentation. We present Codepod, a namespace-aware Jupyter that is suitable for interactive development at scale. Instead of linear code blocks, Codepod models code blocks as hierarchical code pods, and provides a simple yet powerful module system for namespace-aware incremental evaluation. Codepod is open source at https://github.com/codepod-io/codepod. △ Less

Submitted 6 January, 2023; originally announced January 2023.

arXiv:2212.10013 [pdf, other]

DocAsRef: An Empirical Study on Repurposing Reference-Based Summary Quality Metrics Reference-Freely

Authors: Forrest Sheng Bao, Ruixuan Tu, Ge Luo, Yinfei Yang, Hebi Li, Minghui Qiu, Youbiao He, Cen Chen

Abstract: Automated summary quality assessment falls into two categories: reference-based and reference-free. Reference-based metrics, historically deemed more accurate due to the additional information provided by human-written references, are limited by their reliance on human input. In this paper, we hypothesize that the comparison methodologies used by some reference-based metrics to evaluate a system s… ▽ More Automated summary quality assessment falls into two categories: reference-based and reference-free. Reference-based metrics, historically deemed more accurate due to the additional information provided by human-written references, are limited by their reliance on human input. In this paper, we hypothesize that the comparison methodologies used by some reference-based metrics to evaluate a system summary against its corresponding reference can be effectively adapted to assess it against its source document, thereby transforming these metrics into reference-free ones. Experimental results support this hypothesis. After being repurposed reference-freely, the zero-shot BERTScore using the pretrained DeBERTa-large-MNLI model of <0.5B parameters consistently outperforms its original reference-based version across various aspects on the SummEval and Newsroom datasets. It also excels in comparison to most existing reference-free metrics and closely competes with zero-shot summary evaluators based on GPT-3.5. △ Less

Submitted 26 November, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

Comments: Accepted into Findings of EMNLP 2023

arXiv:2006.13607 [pdf, other]

Circuit Routing Using Monte Carlo Tree Search and Deep Neural Networks

Authors: Youbiao He, Forrest Sheng Bao

Abstract: Circuit routing is a fundamental problem in designing electronic systems such as integrated circuits (ICs) and printed circuit boards (PCBs) which form the hardware of electronics and computers. Like finding paths between pairs of locations, circuit routing generates traces of wires to connect contacts or leads of circuit components. It is challenging because finding paths between dense and massiv… ▽ More Circuit routing is a fundamental problem in designing electronic systems such as integrated circuits (ICs) and printed circuit boards (PCBs) which form the hardware of electronics and computers. Like finding paths between pairs of locations, circuit routing generates traces of wires to connect contacts or leads of circuit components. It is challenging because finding paths between dense and massive electronic components involves a very large search space. Existing solutions are either manually designed with domain knowledge or tailored to specific design rules, hence, difficult to adapt to new problems or design needs. Therefore, a general routing approach is highly desired. In this paper, we model the circuit routing as a sequential decision-making problem, and solve it by Monte Carlo tree search (MCTS) with deep neural network (DNN) guided rollout. It could be easily extended to routing cases with more routing constraints and optimization goals. Experiments on randomly generated single-layer circuits show the potential to route complex circuits. The proposed approach can solve the problems that benchmark methods such as sequential A* method and Lee's algorithm cannot solve, and can also outperform the vanilla MCTS approach. △ Less

Submitted 24 June, 2020; originally announced June 2020.

ACM Class: F.2.2; I.2.8

arXiv:2005.06546 [pdf]

Triaging moderate COVID-19 and other viral pneumonias from routine blood tests

Authors: Forrest Sheng Bao, Youbiao He, Jie Liu, Yuanfang Chen, Qian Li, Christina R. Zhang, Lei Han, Baoli Zhu, Yaorong Ge, Shi Chen, Ming Xu, Liu Ouyang

Abstract: The COVID-19 is sweeping the world with deadly consequences. Its contagious nature and clinical similarity to other pneumonias make separating subjects contracted with COVID-19 and non-COVID-19 viral pneumonia a priority and a challenge. However, COVID-19 testing has been greatly limited by the availability and cost of existing methods, even in developed countries like the US. Intrigued by the wid… ▽ More The COVID-19 is sweeping the world with deadly consequences. Its contagious nature and clinical similarity to other pneumonias make separating subjects contracted with COVID-19 and non-COVID-19 viral pneumonia a priority and a challenge. However, COVID-19 testing has been greatly limited by the availability and cost of existing methods, even in developed countries like the US. Intrigued by the wide availability of routine blood tests, we propose to leverage them for COVID-19 testing using the power of machine learning. Two proven-robust machine learning model families, random forests (RFs) and support vector machines (SVMs), are employed to tackle the challenge. Trained on blood data from 208 moderate COVID-19 subjects and 86 subjects with non-COVID-19 moderate viral pneumonia, the best result is obtained in an SVM-based classifier with an accuracy of 84%, a sensitivity of 88%, a specificity of 80%, and a precision of 92%. The results are found explainable from both machine learning and medical perspectives. A privacy-protected web portal is set up to help medical personnel in their practice and the trained models are released for developers to further build other applications. We hope our results can help the world fight this pandemic and welcome clinical verification of our approach on larger populations. △ Less

Submitted 13 May, 2020; originally announced May 2020.

ACM Class: I.5.4

arXiv:2005.06377 [pdf, other]

SueNes: A Weakly Supervised Approach to Evaluating Single-Document Summarization via Negative Sampling

Authors: Forrest Sheng Bao, Hebi Li, Ge Luo, Minghui Qiu, Yinfei Yang, Youbiao He, Cen Chen

Abstract: Canonical automatic summary evaluation metrics, such as ROUGE, focus on lexical similarity which cannot well capture semantics nor linguistic quality and require a reference summary which is costly to obtain. Recently, there have been a growing number of efforts to alleviate either or both of the two drawbacks. In this paper, we present a proof-of-concept study to a weakly supervised summary evalu… ▽ More Canonical automatic summary evaluation metrics, such as ROUGE, focus on lexical similarity which cannot well capture semantics nor linguistic quality and require a reference summary which is costly to obtain. Recently, there have been a growing number of efforts to alleviate either or both of the two drawbacks. In this paper, we present a proof-of-concept study to a weakly supervised summary evaluation approach without the presence of reference summaries. Massive data in existing summarization datasets are transformed for training by pairing documents with corrupted reference summaries. In cross-domain tests, our strategy outperforms baselines with promising improvements, and show a great advantage in gauging linguistic qualities over all metrics. △ Less

Submitted 5 May, 2022; v1 submitted 13 May, 2020; originally announced May 2020.

Comments: accepted into NAACL 2022

ACM Class: I.2.7

arXiv:1910.08925 [pdf, other]

RLScheduler: An Automated HPC Batch Job Scheduler Using Reinforcement Learning

Authors: Di Zhang, Dong Dai, Youbiao He, Forrest Sheng Bao, Bing Xie

Abstract: Today high-performance computing (HPC) platforms are still dominated by batch jobs. Accordingly, effective batch job scheduling is crucial to obtain high system efficiency. Existing HPC batch job schedulers typically leverage heuristic priority functions to prioritize and schedule jobs. But, once configured and deployed by the experts, such priority functions can hardly adapt to the changes of job… ▽ More Today high-performance computing (HPC) platforms are still dominated by batch jobs. Accordingly, effective batch job scheduling is crucial to obtain high system efficiency. Existing HPC batch job schedulers typically leverage heuristic priority functions to prioritize and schedule jobs. But, once configured and deployed by the experts, such priority functions can hardly adapt to the changes of job loads, optimization goals, or system settings, potentially leading to degraded system efficiency when changes occur. To address this fundamental issue, we present RLScheduler, an automated HPC batch job scheduler built on reinforcement learning. RLScheduler relies on minimal manual interventions or expert knowledge, but can learn high-quality scheduling policies via its own continuous 'trial and error'. We introduce a new kernel-based neural network structure and trajectory filtering mechanism in RLScheduler to improve and stabilize the learning process. Through extensive evaluations, we confirm that RLScheduler can learn high-quality scheduling policies towards various workloads and various optimization goals with relatively low computation cost. Moreover, we show that the learned models perform stably even when applied to unseen workloads, making them practical for production use. △ Less

Submitted 1 September, 2020; v1 submitted 20 October, 2019; originally announced October 2019.

Comments: 14 pages; conference accepted

arXiv:1702.07998 [pdf, ps, other]

Detecting (Un)Important Content for Single-Document News Summarization

Authors: Yinfei Yang, Forrest Sheng Bao, Ani Nenkova

Abstract: We present a robust approach for detecting intrinsic sentence importance in news, by training on two corpora of document-summary pairs. When used for single-document summarization, our approach, combined with the "beginning of document" heuristic, outperforms a state-of-the-art summarizer and the beginning-of-article baseline in both automatic and manual evaluations. These results represent an imp… ▽ More We present a robust approach for detecting intrinsic sentence importance in news, by training on two corpora of document-summary pairs. When used for single-document summarization, our approach, combined with the "beginning of document" heuristic, outperforms a state-of-the-art summarizer and the beginning-of-article baseline in both automatic and manual evaluations. These results represent an important advance because in the absence of cross-document repetition, single document summarizers for news have not been able to consistently outperform the strong beginning-of-article baseline. △ Less

Submitted 26 February, 2017; originally announced February 2017.

Comments: Accepted By EACL 2017

arXiv:1009.5268 [pdf, ps, other]

General Scaled Support Vector Machines

Authors: Xin Liu, Ying Ding, Forrest Sheng Bao

Abstract: Support Vector Machines (SVMs) are popular tools for data mining tasks such as classification, regression, and density estimation. However, original SVM (C-SVM) only considers local information of data points on or over the margin. Therefore, C-SVM loses robustness. To solve this problem, one approach is to translate (i.e., to move without rotation or change of shape) the hyperplane according to t… ▽ More Support Vector Machines (SVMs) are popular tools for data mining tasks such as classification, regression, and density estimation. However, original SVM (C-SVM) only considers local information of data points on or over the margin. Therefore, C-SVM loses robustness. To solve this problem, one approach is to translate (i.e., to move without rotation or change of shape) the hyperplane according to the distribution of the entire data. But existing work can only be applied for 1-D case. In this paper, we propose a simple and efficient method called General Scaled SVM (GS-SVM) to extend the existing approach to multi-dimensional case. Our method translates the hyperplane according to the distribution of data projected on the normal vector of the hyperplane. Compared with C-SVM, GS-SVM has better performance on several data sets. △ Less

Submitted 27 September, 2010; originally announced September 2010.

Comments: 5 pages, 4 figures

ACM Class: I.5.2

arXiv:0906.0205 [pdf, ps, other]

A Survey of Tree Convex Sets Test

Authors: Yuanlin Zhang, Forrest Sheng Bao

Abstract: Tree convex sets refer to a collection of sets such that each set in the collection is a subtree of a tree whose nodes are the elements of these sets. They extend the concept of row convex sets each of which is an interval over a total ordering of the elements of those sets. They have been applied to identify tractable Constraint Satisfaction Problems and Combinatorial Auction Problems. Recently… ▽ More Tree convex sets refer to a collection of sets such that each set in the collection is a subtree of a tree whose nodes are the elements of these sets. They extend the concept of row convex sets each of which is an interval over a total ordering of the elements of those sets. They have been applied to identify tractable Constraint Satisfaction Problems and Combinatorial Auction Problems. Recently, polynomial algorithms have been proposed to recognize tree convex sets. In this paper, we review the materials that are the key to a linear recognition algorithm. △ Less

Submitted 31 May, 2009; originally announced June 2009.

Comments: 13 pages, 5 figures, 2 tables

ACM Class: F.2

arXiv:0904.3808 [pdf, ps, other]

Automated Epilepsy Diagnosis Using Interictal Scalp EEG

Authors: Forrest Sheng Bao, Jue-Ming Gao, Jing Hu, Donald Y. -C. Lie, Yuanlin Zhang, K. J. Oommen

Abstract: Approximately over 50 million people worldwide suffer from epilepsy. Traditional diagnosis of epilepsy relies on tedious visual screening by highly trained clinicians from lengthy EEG recording that contains the presence of seizure (ictal) activities. Nowadays, there are many automatic systems that can recognize seizure-related EEG signals to help the diagnosis. However, it is very costly and in… ▽ More Approximately over 50 million people worldwide suffer from epilepsy. Traditional diagnosis of epilepsy relies on tedious visual screening by highly trained clinicians from lengthy EEG recording that contains the presence of seizure (ictal) activities. Nowadays, there are many automatic systems that can recognize seizure-related EEG signals to help the diagnosis. However, it is very costly and inconvenient to obtain long-term EEG data with seizure activities, especially in areas short of medical resources. We demonstrate in this paper that we can use the interictal scalp EEG data, which is much easier to collect than the ictal data, to automatically diagnose whether a person is epileptic. In our automated EEG recognition system, we extract three classes of features from the EEG data and build Probabilistic Neural Networks (PNNs) fed with these features. We optimize the feature extraction parameters and combine these PNNs through a voting mechanism. As a result, our system achieves an impressive 94.07% accuracy, which is very close to reported human recognition accuracy by experienced medical professionals. △ Less

Submitted 24 April, 2009; v1 submitted 24 April, 2009; originally announced April 2009.

Comments: 5 pages, 4 figures, 3 tables, based on our IEEE ICTAI'08 paper, submitted to IEEE EMBC'09

ACM Class: I.5.4; I.2.1

arXiv:0804.3361 [pdf, ps, other]

doi 10.1109/ICTAI.2008.99

A New Approach to Automated Epileptic Diagnosis Using EEG and Probabilistic Neural Network

Authors: Forrest Sheng Bao, Donald Yu-Chun Lie, Yuanlin Zhang

Abstract: Epilepsy is one of the most common neurological disorders that greatly impair patient' daily lives. Traditional epileptic diagnosis relies on tedious visual screening by neurologists from lengthy EEG recording that requires the presence of seizure (ictal) activities. Nowadays, there are many systems helping the neurologists to quickly find interesting segments of the lengthy signal by automatic… ▽ More Epilepsy is one of the most common neurological disorders that greatly impair patient' daily lives. Traditional epileptic diagnosis relies on tedious visual screening by neurologists from lengthy EEG recording that requires the presence of seizure (ictal) activities. Nowadays, there are many systems helping the neurologists to quickly find interesting segments of the lengthy signal by automatic seizure detection. However, we notice that it is very difficult, if not impossible, to obtain long-term EEG data with seizure activities for epilepsy patients in areas lack of medical resources and trained neurologists. Therefore, we propose to study automated epileptic diagnosis using interictal EEG data that is much easier to collect than ictal data. The authors are not aware of any report on automated EEG diagnostic system that can accurately distinguish patients' interictal EEG from the EEG of normal people. The research presented in this paper, therefore, aims to develop an automated diagnostic system that can use interictal EEG data to diagnose whether the person is epileptic. Such a system should also detect seizure activities for further investigation by doctors and potential patient monitoring. To develop such a system, we extract four classes of features from the EEG data and build a Probabilistic Neural Network (PNN) fed with these features. Leave-one-out cross-validation (LOO-CV) on a widely used epileptic-normal data set reflects an impressive 99.5% accuracy of our system on distinguishing normal people's EEG from patient's interictal EEG. We also find our system can be used in patient monitoring (seizure detection) and seizure focus localization, with 96.7% and 77.5% accuracy respectively on the data set. △ Less

Submitted 4 July, 2008; v1 submitted 21 April, 2008; originally announced April 2008.

Comments: 5 pages, 6 figures, 1 table, submitted to IEEE ICTAI 2008

ACM Class: I.5.4; I.2.1

arXiv:0707.4289 [pdf, ps, other]

A Leaf Recognition Algorithm for Plant Classification Using Probabilistic Neural Network

Authors: Stephen Gang Wu, Forrest Sheng Bao, Eric You Xu, Yu-Xuan Wang, Yi-Fan Chang, Qiao-Liang Xiang

Abstract: In this paper, we employ Probabilistic Neural Network (PNN) with image and data processing techniques to implement a general purpose automated leaf recognition algorithm. 12 leaf features are extracted and orthogonalized into 5 principal variables which consist the input vector of the PNN. The PNN is trained by 1800 leaves to classify 32 kinds of plants with an accuracy greater than 90%. Compare… ▽ More In this paper, we employ Probabilistic Neural Network (PNN) with image and data processing techniques to implement a general purpose automated leaf recognition algorithm. 12 leaf features are extracted and orthogonalized into 5 principal variables which consist the input vector of the PNN. The PNN is trained by 1800 leaves to classify 32 kinds of plants with an accuracy greater than 90%. Compared with other approaches, our algorithm is an accurate artificial intelligence approach which is fast in execution and easy in implementation. △ Less

Submitted 29 July, 2007; originally announced July 2007.

Comments: 6 pages, 3 figures, 2 tables

ACM Class: I.5.4

arXiv:0706.0585 [pdf, ps, other]

doi 10.1109/ICTAI.2007.99

A Novel Model of Working Set Selection for SMO Decomposition Methods

Authors: Zhendong Zhao, Lei Yuan, Yuxuan Wang, Forrest Sheng Bao, Shunyi Zhang Yanfei Sun

Abstract: In the process of training Support Vector Machines (SVMs) by decomposition methods, working set selection is an important technique, and some exciting schemes were employed into this field. To improve working set selection, we propose a new model for working set selection in sequential minimal optimization (SMO) decomposition methods. In this model, it selects B as working set without reselectio… ▽ More In the process of training Support Vector Machines (SVMs) by decomposition methods, working set selection is an important technique, and some exciting schemes were employed into this field. To improve working set selection, we propose a new model for working set selection in sequential minimal optimization (SMO) decomposition methods. In this model, it selects B as working set without reselection. Some properties are given by simple proof, and experiments demonstrate that the proposed method is in general faster than existing methods. △ Less

Submitted 5 June, 2007; originally announced June 2007.

Comments: 8 pages, 12 figures, it was submitted to IEEE International conference of Tools on Artificial Intelligence

Showing 1–16 of 16 results for author: Bao, F S