-
Poly-FEVER: A Multilingual Fact Verification Benchmark for Hallucination Detection in Large Language Models
Authors:
Hanzhi Zhang,
Sumera Anjum,
Heng Fan,
Weijian Zheng,
Yan Huang,
Yunhe Feng
Abstract:
Hallucinations in generative AI, particularly in Large Language Models (LLMs), pose a significant challenge to the reliability of multilingual applications. Existing benchmarks for hallucination detection focus primarily on English and a few widely spoken languages, lacking the breadth to assess inconsistencies in model performance across diverse linguistic contexts. To address this gap, we introd…
▽ More
Hallucinations in generative AI, particularly in Large Language Models (LLMs), pose a significant challenge to the reliability of multilingual applications. Existing benchmarks for hallucination detection focus primarily on English and a few widely spoken languages, lacking the breadth to assess inconsistencies in model performance across diverse linguistic contexts. To address this gap, we introduce Poly-FEVER, a large-scale multilingual fact verification benchmark specifically designed for evaluating hallucination detection in LLMs. Poly-FEVER comprises 77,973 labeled factual claims spanning 11 languages, sourced from FEVER, Climate-FEVER, and SciFact. It provides the first large-scale dataset tailored for analyzing hallucination patterns across languages, enabling systematic evaluation of LLMs such as ChatGPT and the LLaMA series. Our analysis reveals how topic distribution and web resource availability influence hallucination frequency, uncovering language-specific biases that impact model accuracy. By offering a multilingual benchmark for fact verification, Poly-FEVER facilitates cross-linguistic comparisons of hallucination detection and contributes to the development of more reliable, language-inclusive AI systems. The dataset is publicly available to advance research in responsible AI, fact-checking methodologies, and multilingual NLP, promoting greater transparency and robustness in LLM performance. The proposed Poly-FEVER is available at: https://huggingface.co/datasets/HanzhiZhang/Poly-FEVER.
△ Less
Submitted 26 March, 2025; v1 submitted 18 March, 2025;
originally announced March 2025.
-
HALO: Hallucination Analysis and Learning Optimization to Empower LLMs with Retrieval-Augmented Context for Guided Clinical Decision Making
Authors:
Sumera Anjum,
Hanzhi Zhang,
Wenjun Zhou,
Eun Jin Paek,
Xiaopeng Zhao,
Yunhe Feng
Abstract:
Large language models (LLMs) have significantly advanced natural language processing tasks, yet they are susceptible to generating inaccurate or unreliable responses, a phenomenon known as hallucination. In critical domains such as health and medicine, these hallucinations can pose serious risks. This paper introduces HALO, a novel framework designed to enhance the accuracy and reliability of medi…
▽ More
Large language models (LLMs) have significantly advanced natural language processing tasks, yet they are susceptible to generating inaccurate or unreliable responses, a phenomenon known as hallucination. In critical domains such as health and medicine, these hallucinations can pose serious risks. This paper introduces HALO, a novel framework designed to enhance the accuracy and reliability of medical question-answering (QA) systems by focusing on the detection and mitigation of hallucinations. Our approach generates multiple variations of a given query using LLMs and retrieves relevant information from external open knowledge bases to enrich the context. We utilize maximum marginal relevance scoring to prioritize the retrieved context, which is then provided to LLMs for answer generation, thereby reducing the risk of hallucinations. The integration of LangChain further streamlines this process, resulting in a notable and robust increase in the accuracy of both open-source and commercial LLMs, such as Llama-3.1 (from 44% to 65%) and ChatGPT (from 56% to 70%). This framework underscores the critical importance of addressing hallucinations in medical QA systems, ultimately improving clinical decision-making and patient care. The open-source HALO is available at: https://github.com/ResponsibleAILab/HALO.
△ Less
Submitted 18 September, 2024; v1 submitted 16 September, 2024;
originally announced September 2024.
-
Collecting Consistently High Quality Object Tracks with Minimal Human Involvement by Using Self-Supervised Learning to Detect Tracker Errors
Authors:
Samreen Anjum,
Suyog Jain,
Danna Gurari
Abstract:
We propose a hybrid framework for consistently producing high-quality object tracks by combining an automated object tracker with little human input. The key idea is to tailor a module for each dataset to intelligently decide when an object tracker is failing and so humans should be brought in to re-localize an object for continued tracking. Our approach leverages self-supervised learning on unlab…
▽ More
We propose a hybrid framework for consistently producing high-quality object tracks by combining an automated object tracker with little human input. The key idea is to tailor a module for each dataset to intelligently decide when an object tracker is failing and so humans should be brought in to re-localize an object for continued tracking. Our approach leverages self-supervised learning on unlabeled videos to learn a tailored representation for a target object that is then used to actively monitor its tracked region and decide when the tracker fails. Since labeled data is not needed, our approach can be applied to novel object categories. Experiments on three datasets demonstrate our method outperforms existing approaches, especially for small, fast moving, or occluded objects.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
VQA Therapy: Exploring Answer Differences by Visually Grounding Answers
Authors:
Chongyan Chen,
Samreen Anjum,
Danna Gurari
Abstract:
Visual question answering is a task of predicting the answer to a question about an image. Given that different people can provide different answers to a visual question, we aim to better understand why with answer groundings. We introduce the first dataset that visually grounds each unique answer to each visual question, which we call VQAAnswerTherapy. We then propose two novel problems of predic…
▽ More
Visual question answering is a task of predicting the answer to a question about an image. Given that different people can provide different answers to a visual question, we aim to better understand why with answer groundings. We introduce the first dataset that visually grounds each unique answer to each visual question, which we call VQAAnswerTherapy. We then propose two novel problems of predicting whether a visual question has a single answer grounding and localizing all answer groundings. We benchmark modern algorithms for these novel problems to show where they succeed and struggle. The dataset and evaluation server can be found publicly at https://vizwiz.org/tasks-and-datasets/vqa-answer-therapy/.
△ Less
Submitted 24 August, 2023; v1 submitted 21 August, 2023;
originally announced August 2023.
-
Blockchain associated machine learning and IoT based hypoglycemia detection system with auto-injection feature
Authors:
Rahnuma Mahzabin,
Fahim Hossain Sifat,
Sadia Anjum,
Al-Akhir Nayan,
Muhammad Golam Kibria
Abstract:
Hypoglycemia is an unpleasant phenomenon caused by low blood glucose. The disease can lead a person to death or a high level of body damage. To avoid significant damage, patients need sugar. The research aims at implementing an automatic system to detect hypoglycemia and perform automatic sugar injections to save a life. Receiving the benefits of the internet of things (IoT), the sensor data was t…
▽ More
Hypoglycemia is an unpleasant phenomenon caused by low blood glucose. The disease can lead a person to death or a high level of body damage. To avoid significant damage, patients need sugar. The research aims at implementing an automatic system to detect hypoglycemia and perform automatic sugar injections to save a life. Receiving the benefits of the internet of things (IoT), the sensor data was transferred using the hypertext transfer protocol (HTTP) protocol. To ensure the safety of health-related data, blockchain technology was utilized. The glucose sensor and smartwatch data were processed via Fog and sent to the cloud. A Random Forest algorithm was proposed and utilized to decide hypoglycemic events. When the hypoglycemic event was detected, the system sent a notification to the mobile application and auto-injection device to push the condensed sugar into the victims body. XGBoost, k-nearest neighbors (KNN), support vector machine (SVM), and decision tree were implemented to compare the proposed models performance. The random forest performed 0.942 testing accuracy, better than other models in detecting hypoglycemic events. The systems performance was measured in several conditions, and satisfactory results were achieved. The system can benefit hypoglycemia patients to survive this disease.
△ Less
Submitted 26 July, 2022;
originally announced August 2022.
-
Grounding Answers for Visual Questions Asked by Visually Impaired People
Authors:
Chongyan Chen,
Samreen Anjum,
Danna Gurari
Abstract:
Visual question answering is the task of answering questions about images. We introduce the VizWiz-VQA-Grounding dataset, the first dataset that visually grounds answers to visual questions asked by people with visual impairments. We analyze our dataset and compare it with five VQA-Grounding datasets to demonstrate what makes it similar and different. We then evaluate the SOTA VQA and VQA-Groundin…
▽ More
Visual question answering is the task of answering questions about images. We introduce the VizWiz-VQA-Grounding dataset, the first dataset that visually grounds answers to visual questions asked by people with visual impairments. We analyze our dataset and compare it with five VQA-Grounding datasets to demonstrate what makes it similar and different. We then evaluate the SOTA VQA and VQA-Grounding models and demonstrate that current SOTA algorithms often fail to identify the correct visual evidence where the answer is located. These models regularly struggle when the visual evidence occupies a small fraction of the image, for images that are higher quality, as well as for visual questions that require skills in text recognition. The dataset, evaluation server, and leaderboard all can be found at the following link: https://vizwiz.org/tasks-and-datasets/answer-grounding-for-vqa/.
△ Less
Submitted 8 April, 2022; v1 submitted 4 February, 2022;
originally announced February 2022.
-
CrowdMOT: Crowdsourcing Strategies for Tracking Multiple Objects in Videos
Authors:
Samreen Anjum,
Chi Lin,
Danna Gurari
Abstract:
Crowdsourcing is a valuable approach for tracking objects in videos in a more scalable manner than possible with domain experts. However, existing frameworks do not produce high quality results with non-expert crowdworkers, especially for scenarios where objects split. To address this shortcoming, we introduce a crowdsourcing platform called CrowdMOT, and investigate two micro-task design decision…
▽ More
Crowdsourcing is a valuable approach for tracking objects in videos in a more scalable manner than possible with domain experts. However, existing frameworks do not produce high quality results with non-expert crowdworkers, especially for scenarios where objects split. To address this shortcoming, we introduce a crowdsourcing platform called CrowdMOT, and investigate two micro-task design decisions: (1) whether to decompose the task so that each worker is in charge of annotating all objects in a sub-segment of the video versus annotating a single object across the entire video, and (2) whether to show annotations from previous workers to the next individuals working on the task. We conduct experiments on a diversity of videos which show both familiar objects (aka - people) and unfamiliar objects (aka - cells). Our results highlight strategies for efficiently collecting higher quality annotations than observed when using strategies employed by today's state-of-art crowdsourcing system.
△ Less
Submitted 29 September, 2020;
originally announced September 2020.
-
Object Oriented Model for Evaluation of On-Chip Networks
Authors:
Sheraz Anjum,
Ehsan Ullah Munir,
Waqas Anwar,
Nadeem Javaid
Abstract:
The Network on Chip (NoC) paradigm is rapidly replacing bus based System on Chip (SoC) designs due to their inherent disadvantages such as non-scalability, saturation and congestion. Currently very few tools are available for the simulation and evaluation of on-chip architectures. This study proposes a generic object oriented model for performance evaluation of on-chip interconnect architectures a…
▽ More
The Network on Chip (NoC) paradigm is rapidly replacing bus based System on Chip (SoC) designs due to their inherent disadvantages such as non-scalability, saturation and congestion. Currently very few tools are available for the simulation and evaluation of on-chip architectures. This study proposes a generic object oriented model for performance evaluation of on-chip interconnect architectures and algorithms. The generic nature of the proposed model can help the researchers in evaluation of any kind of on-chip switching networks. The model was applied on 2D-Mesh and 2D-Diagonal-Mesh on-chip switching networks for verification and selection of best out of both the analyzed architectures. The results show the superiority of 2D-Diagonal-Mesh over 2D-Mesh in terms of average packet delay.
△ Less
Submitted 23 March, 2013;
originally announced April 2013.