-
FELM: Benchmarking Factuality Evaluation of Large Language Models
Authors:
Shiqi Chen,
Yiran Zhao,
Jinghan Zhang,
I-Chun Chern,
Siyang Gao,
Pengfei Liu,
Junxian He
Abstract:
Assessing factuality of text generated by large language models (LLMs) is an emerging yet crucial research area, aimed at alerting users to potential errors and guiding the development of more reliable LLMs. Nonetheless, the evaluators assessing factuality necessitate suitable evaluation themselves to gauge progress and foster advancements. This direction remains under-explored, resulting in subst…
▽ More
Assessing factuality of text generated by large language models (LLMs) is an emerging yet crucial research area, aimed at alerting users to potential errors and guiding the development of more reliable LLMs. Nonetheless, the evaluators assessing factuality necessitate suitable evaluation themselves to gauge progress and foster advancements. This direction remains under-explored, resulting in substantial impediments to the progress of factuality evaluators. To mitigate this issue, we introduce a benchmark for Factuality Evaluation of large Language Models, referred to as felm. In this benchmark, we collect responses generated from LLMs and annotate factuality labels in a fine-grained manner. Contrary to previous studies that primarily concentrate on the factuality of world knowledge (e.g.~information from Wikipedia), felm focuses on factuality across diverse domains, spanning from world knowledge to math and reasoning. Our annotation is based on text segments, which can help pinpoint specific factual errors. The factuality annotations are further supplemented by predefined error types and reference links that either support or contradict the statement. In our experiments, we investigate the performance of several LLM-based factuality evaluators on felm, including both vanilla LLMs and those augmented with retrieval mechanisms and chain-of-thought processes. Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.
△ Less
Submitted 28 November, 2023; v1 submitted 1 October, 2023;
originally announced October 2023.
-
FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios
Authors:
I-Chun Chern,
Steffi Chern,
Shiqi Chen,
Weizhe Yuan,
Kehua Feng,
Chunting Zhou,
Junxian He,
Graham Neubig,
Pengfei Liu
Abstract:
The emergence of generative pre-trained models has facilitated the synthesis of high-quality text, but it has also posed challenges in identifying factual errors in the generated text. In particular: (1) A wider range of tasks now face an increasing risk of containing factual errors when handled by generative models. (2) Generated texts tend to be lengthy and lack a clearly defined granularity for…
▽ More
The emergence of generative pre-trained models has facilitated the synthesis of high-quality text, but it has also posed challenges in identifying factual errors in the generated text. In particular: (1) A wider range of tasks now face an increasing risk of containing factual errors when handled by generative models. (2) Generated texts tend to be lengthy and lack a clearly defined granularity for individual facts. (3) There is a scarcity of explicit evidence available during the process of fact checking. With the above challenges in mind, in this paper, we propose FacTool, a task and domain agnostic framework for detecting factual errors of texts generated by large language models (e.g., ChatGPT). Experiments on four different tasks (knowledge-based QA, code generation, mathematical reasoning, and scientific literature review) show the efficacy of the proposed method. We release the code of FacTool associated with ChatGPT plugin interface at https://github.com/GAIR-NLP/factool .
△ Less
Submitted 26 July, 2023; v1 submitted 25 July, 2023;
originally announced July 2023.
-
Improving Factuality of Abstractive Summarization via Contrastive Reward Learning
Authors:
I-Chun Chern,
Zhiruo Wang,
Sanjan Das,
Bhavuk Sharma,
Pengfei Liu,
Graham Neubig
Abstract:
Modern abstractive summarization models often generate summaries that contain hallucinated or contradictory information. In this paper, we propose a simple but effective contrastive learning framework that incorporates recent developments in reward learning and factuality metrics. Empirical studies demonstrate that the proposed framework enables summarization models to learn from feedback of factu…
▽ More
Modern abstractive summarization models often generate summaries that contain hallucinated or contradictory information. In this paper, we propose a simple but effective contrastive learning framework that incorporates recent developments in reward learning and factuality metrics. Empirical studies demonstrate that the proposed framework enables summarization models to learn from feedback of factuality metrics using contrastive reward learning, leading to more factual summaries by human evaluations. This suggests that further advances in learning and evaluation algorithms can feed directly into providing more factual summaries.
△ Less
Submitted 10 July, 2023;
originally announced July 2023.
-
Audio-Visual Speech Enhancement and Separation by Utilizing Multi-Modal Self-Supervised Embeddings
Authors:
I-Chun Chern,
Kuo-Hsuan Hung,
Yi-Ting Chen,
Tassadaq Hussain,
Mandar Gogate,
Amir Hussain,
Yu Tsao,
Jen-Cheng Hou
Abstract:
AV-HuBERT, a multi-modal self-supervised learning model, has been shown to be effective for categorical problems such as automatic speech recognition and lip-reading. This suggests that useful audio-visual speech representations can be obtained via utilizing multi-modal self-supervised embeddings. Nevertheless, it is unclear if such representations can be generalized to solve real-world multi-moda…
▽ More
AV-HuBERT, a multi-modal self-supervised learning model, has been shown to be effective for categorical problems such as automatic speech recognition and lip-reading. This suggests that useful audio-visual speech representations can be obtained via utilizing multi-modal self-supervised embeddings. Nevertheless, it is unclear if such representations can be generalized to solve real-world multi-modal AV regression tasks, such as audio-visual speech enhancement (AVSE) and audio-visual speech separation (AVSS). In this study, we leveraged the pre-trained AV-HuBERT model followed by an SE module for AVSE and AVSS. Comparative experimental results demonstrate that our proposed model performs better than the state-of-the-art AVSE and traditional audio-only SE models. In summary, our results confirm the effectiveness of our proposed model for the AVSS task with proper fine-tuning strategies, demonstrating that multi-modal self-supervised embeddings obtained from AV-HuBERT can be generalized to audio-visual regression tasks.
△ Less
Submitted 31 May, 2023; v1 submitted 31 October, 2022;
originally announced October 2022.
-
Decoding of Quantum Data-Syndrome Codes via Belief Propagation
Authors:
Kao-Yueh Kuo,
I-Chun Chern,
Ching-Yi Lai
Abstract:
Quantum error correction is necessary to protect logical quantum states and operations. However, no meaningful data protection can be made when the syndrome extraction is erroneous due to faulty measurement gates. Quantum data-syndrome (DS) codes are designed to protect the data qubits and syndrome bits concurrently. In this paper, we propose an efficient decoding algorithm for quantum DS codes wi…
▽ More
Quantum error correction is necessary to protect logical quantum states and operations. However, no meaningful data protection can be made when the syndrome extraction is erroneous due to faulty measurement gates. Quantum data-syndrome (DS) codes are designed to protect the data qubits and syndrome bits concurrently. In this paper, we propose an efficient decoding algorithm for quantum DS codes with sparse check matrices. Based on a refined belief propagation (BP) decoding for stabilizer codes, we propose a DS-BP algorithm to handle the quaternary quantum data errors and binary syndrome bit errors. Moreover, a sparse quantum code may inherently be able to handle minor syndrome errors so that fewer redundant syndrome measurements are necessary. We demonstrate this with simulations on a quantum hypergraph-product code.
△ Less
Submitted 3 February, 2021;
originally announced February 2021.
-
String-Averaging Expectation-Maximization for Maximum Likelihood Estimation in Emission Tomography
Authors:
E. S. Helou,
Y. Censor,
T. -B. Chen,
I-L. Chern,
Á. R. De Pierro,
M. Jiang,
H. H. -S. Lu
Abstract:
We study the maximum likelihood model in emission tomography and propose a new family of algorithms for its solution, called String-Averaging Expectation-Maximization (SAEM). In the String-Averaging algorithmic regime, the index set of all underlying equations is split into subsets, called "strings," and the algorithm separately proceeds along each string, possibly in parallel. Then, the end-point…
▽ More
We study the maximum likelihood model in emission tomography and propose a new family of algorithms for its solution, called String-Averaging Expectation-Maximization (SAEM). In the String-Averaging algorithmic regime, the index set of all underlying equations is split into subsets, called "strings," and the algorithm separately proceeds along each string, possibly in parallel. Then, the end-points of all strings are averaged to form the next iterate. SAEM algorithms with several strings presents better practical merits than the classical Row-Action Maximum-Likelihood Algorithm (RAMLA). We present numerical experiments showing the effectiveness of the algorithmic scheme in realistic situations. Performance is evaluated from the computational cost and reconstruction quality viewpoints. A complete convergence theory is also provided.
△ Less
Submitted 11 February, 2014;
originally announced February 2014.