Search | arXiv e-print repository

FELM: Benchmarking Factuality Evaluation of Large Language Models

Authors: Shiqi Chen, Yiran Zhao, Jinghan Zhang, I-Chun Chern, Siyang Gao, Pengfei Liu, Junxian He

Abstract: Assessing factuality of text generated by large language models (LLMs) is an emerging yet crucial research area, aimed at alerting users to potential errors and guiding the development of more reliable LLMs. Nonetheless, the evaluators assessing factuality necessitate suitable evaluation themselves to gauge progress and foster advancements. This direction remains under-explored, resulting in subst… ▽ More Assessing factuality of text generated by large language models (LLMs) is an emerging yet crucial research area, aimed at alerting users to potential errors and guiding the development of more reliable LLMs. Nonetheless, the evaluators assessing factuality necessitate suitable evaluation themselves to gauge progress and foster advancements. This direction remains under-explored, resulting in substantial impediments to the progress of factuality evaluators. To mitigate this issue, we introduce a benchmark for Factuality Evaluation of large Language Models, referred to as felm. In this benchmark, we collect responses generated from LLMs and annotate factuality labels in a fine-grained manner. Contrary to previous studies that primarily concentrate on the factuality of world knowledge (e.g.~information from Wikipedia), felm focuses on factuality across diverse domains, spanning from world knowledge to math and reasoning. Our annotation is based on text segments, which can help pinpoint specific factual errors. The factuality annotations are further supplemented by predefined error types and reference links that either support or contradict the statement. In our experiments, we investigate the performance of several LLM-based factuality evaluators on felm, including both vanilla LLMs and those augmented with retrieval mechanisms and chain-of-thought processes. Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors. △ Less

Submitted 28 November, 2023; v1 submitted 1 October, 2023; originally announced October 2023.

Comments: Accepted by NeurIPS 2023 Track on Datasets and Benchmarks

arXiv:2307.13528 [pdf, other]

FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios

Authors: I-Chun Chern, Steffi Chern, Shiqi Chen, Weizhe Yuan, Kehua Feng, Chunting Zhou, Junxian He, Graham Neubig, Pengfei Liu

Abstract: The emergence of generative pre-trained models has facilitated the synthesis of high-quality text, but it has also posed challenges in identifying factual errors in the generated text. In particular: (1) A wider range of tasks now face an increasing risk of containing factual errors when handled by generative models. (2) Generated texts tend to be lengthy and lack a clearly defined granularity for… ▽ More The emergence of generative pre-trained models has facilitated the synthesis of high-quality text, but it has also posed challenges in identifying factual errors in the generated text. In particular: (1) A wider range of tasks now face an increasing risk of containing factual errors when handled by generative models. (2) Generated texts tend to be lengthy and lack a clearly defined granularity for individual facts. (3) There is a scarcity of explicit evidence available during the process of fact checking. With the above challenges in mind, in this paper, we propose FacTool, a task and domain agnostic framework for detecting factual errors of texts generated by large language models (e.g., ChatGPT). Experiments on four different tasks (knowledge-based QA, code generation, mathematical reasoning, and scientific literature review) show the efficacy of the proposed method. We release the code of FacTool associated with ChatGPT plugin interface at https://github.com/GAIR-NLP/factool . △ Less

Submitted 26 July, 2023; v1 submitted 25 July, 2023; originally announced July 2023.

arXiv:2307.04507 [pdf, other]

Improving Factuality of Abstractive Summarization via Contrastive Reward Learning

Authors: I-Chun Chern, Zhiruo Wang, Sanjan Das, Bhavuk Sharma, Pengfei Liu, Graham Neubig

Abstract: Modern abstractive summarization models often generate summaries that contain hallucinated or contradictory information. In this paper, we propose a simple but effective contrastive learning framework that incorporates recent developments in reward learning and factuality metrics. Empirical studies demonstrate that the proposed framework enables summarization models to learn from feedback of factu… ▽ More Modern abstractive summarization models often generate summaries that contain hallucinated or contradictory information. In this paper, we propose a simple but effective contrastive learning framework that incorporates recent developments in reward learning and factuality metrics. Empirical studies demonstrate that the proposed framework enables summarization models to learn from feedback of factuality metrics using contrastive reward learning, leading to more factual summaries by human evaluations. This suggests that further advances in learning and evaluation algorithms can feed directly into providing more factual summaries. △ Less

Submitted 10 July, 2023; originally announced July 2023.

Comments: TrustNLP @ ACL 2023

arXiv:2210.17456 [pdf, other]

Audio-Visual Speech Enhancement and Separation by Utilizing Multi-Modal Self-Supervised Embeddings

Authors: I-Chun Chern, Kuo-Hsuan Hung, Yi-Ting Chen, Tassadaq Hussain, Mandar Gogate, Amir Hussain, Yu Tsao, Jen-Cheng Hou

Abstract: AV-HuBERT, a multi-modal self-supervised learning model, has been shown to be effective for categorical problems such as automatic speech recognition and lip-reading. This suggests that useful audio-visual speech representations can be obtained via utilizing multi-modal self-supervised embeddings. Nevertheless, it is unclear if such representations can be generalized to solve real-world multi-moda… ▽ More AV-HuBERT, a multi-modal self-supervised learning model, has been shown to be effective for categorical problems such as automatic speech recognition and lip-reading. This suggests that useful audio-visual speech representations can be obtained via utilizing multi-modal self-supervised embeddings. Nevertheless, it is unclear if such representations can be generalized to solve real-world multi-modal AV regression tasks, such as audio-visual speech enhancement (AVSE) and audio-visual speech separation (AVSS). In this study, we leveraged the pre-trained AV-HuBERT model followed by an SE module for AVSE and AVSS. Comparative experimental results demonstrate that our proposed model performs better than the state-of-the-art AVSE and traditional audio-only SE models. In summary, our results confirm the effectiveness of our proposed model for the AVSS task with proper fine-tuning strategies, demonstrating that multi-modal self-supervised embeddings obtained from AV-HuBERT can be generalized to audio-visual regression tasks. △ Less

Submitted 31 May, 2023; v1 submitted 31 October, 2022; originally announced October 2022.

Comments: ICASSP AMHAT 2023

arXiv:2102.01984 [pdf, other]

doi 10.1109/ISIT45174.2021.9518018

Decoding of Quantum Data-Syndrome Codes via Belief Propagation

Authors: Kao-Yueh Kuo, I-Chun Chern, Ching-Yi Lai

Abstract: Quantum error correction is necessary to protect logical quantum states and operations. However, no meaningful data protection can be made when the syndrome extraction is erroneous due to faulty measurement gates. Quantum data-syndrome (DS) codes are designed to protect the data qubits and syndrome bits concurrently. In this paper, we propose an efficient decoding algorithm for quantum DS codes wi… ▽ More Quantum error correction is necessary to protect logical quantum states and operations. However, no meaningful data protection can be made when the syndrome extraction is erroneous due to faulty measurement gates. Quantum data-syndrome (DS) codes are designed to protect the data qubits and syndrome bits concurrently. In this paper, we propose an efficient decoding algorithm for quantum DS codes with sparse check matrices. Based on a refined belief propagation (BP) decoding for stabilizer codes, we propose a DS-BP algorithm to handle the quaternary quantum data errors and binary syndrome bit errors. Moreover, a sparse quantum code may inherently be able to handle minor syndrome errors so that fewer redundant syndrome measurements are necessary. We demonstrate this with simulations on a quantum hypergraph-product code. △ Less

Submitted 3 February, 2021; originally announced February 2021.

Journal ref: in Proc. IEEE International Symposium on Information Theory (ISIT), 2021, pp. 1552--1557

arXiv:1402.2455 [pdf, other]

doi 10.1088/0266-5611/30/5/055003

String-Averaging Expectation-Maximization for Maximum Likelihood Estimation in Emission Tomography

Authors: E. S. Helou, Y. Censor, T. -B. Chen, I-L. Chern, Á. R. De Pierro, M. Jiang, H. H. -S. Lu

Abstract: We study the maximum likelihood model in emission tomography and propose a new family of algorithms for its solution, called String-Averaging Expectation-Maximization (SAEM). In the String-Averaging algorithmic regime, the index set of all underlying equations is split into subsets, called "strings," and the algorithm separately proceeds along each string, possibly in parallel. Then, the end-point… ▽ More We study the maximum likelihood model in emission tomography and propose a new family of algorithms for its solution, called String-Averaging Expectation-Maximization (SAEM). In the String-Averaging algorithmic regime, the index set of all underlying equations is split into subsets, called "strings," and the algorithm separately proceeds along each string, possibly in parallel. Then, the end-points of all strings are averaged to form the next iterate. SAEM algorithms with several strings presents better practical merits than the classical Row-Action Maximum-Likelihood Algorithm (RAMLA). We present numerical experiments showing the effectiveness of the algorithmic scheme in realistic situations. Performance is evaluated from the computational cost and reconstruction quality viewpoints. A complete convergence theory is also provided. △ Less

Submitted 11 February, 2014; originally announced February 2014.

Showing 1–6 of 6 results for author: Chern, I