Search | arXiv e-print repository

UQLM: A Python Package for Uncertainty Quantification in Large Language Models

Authors: Dylan Bouchard, Mohit Singh Chauhan, David Skarbrevik, Ho-Kyeong Ra, Viren Bajaj, Zeya Ahmad

Abstract: Hallucinations, defined as instances where Large Language Models (LLMs) generate false or misleading content, pose a significant challenge that impacts the safety and trust of downstream applications. We introduce UQLM, a Python package for LLM hallucination detection using state-of-the-art uncertainty quantification (UQ) techniques. This toolkit offers a suite of UQ-based scorers that compute res… ▽ More Hallucinations, defined as instances where Large Language Models (LLMs) generate false or misleading content, pose a significant challenge that impacts the safety and trust of downstream applications. We introduce UQLM, a Python package for LLM hallucination detection using state-of-the-art uncertainty quantification (UQ) techniques. This toolkit offers a suite of UQ-based scorers that compute response-level confidence scores ranging from 0 to 1. This library provides an off-the-shelf solution for UQ-based hallucination detection that can be easily integrated to enhance the reliability of LLM outputs. △ Less

Submitted 8 July, 2025; originally announced July 2025.

Comments: Submitted to Journal of Machine Learning Research (MLOSS); UQLM Repository: https://github.com/cvs-health/uqlm

arXiv:2504.19254 [pdf, other]

Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers

Authors: Dylan Bouchard, Mohit Singh Chauhan

Abstract: Hallucinations are a persistent problem with Large Language Models (LLMs). As these models become increasingly used in high-stakes domains, such as healthcare and finance, the need for effective hallucination detection is crucial. To this end, we propose a versatile framework for zero-resource hallucination detection that practitioners can apply to real-world use cases. To achieve this, we adapt a… ▽ More Hallucinations are a persistent problem with Large Language Models (LLMs). As these models become increasingly used in high-stakes domains, such as healthcare and finance, the need for effective hallucination detection is crucial. To this end, we propose a versatile framework for zero-resource hallucination detection that practitioners can apply to real-world use cases. To achieve this, we adapt a variety of existing uncertainty quantification (UQ) techniques, including black-box UQ, white-box UQ, and LLM-as-a-Judge, transforming them as necessary into standardized response-level confidence scores ranging from 0 to 1. To enhance flexibility, we introduce a tunable ensemble approach that incorporates any combination of the individual confidence scores. This approach enables practitioners to optimize the ensemble for a specific use case for improved performance. To streamline implementation, the full suite of scorers is offered in this paper's companion Python toolkit, UQLM. To evaluate the performance of the various scorers, we conduct an extensive set of experiments using several LLM question-answering benchmarks. We find that our tunable ensemble typically surpasses its individual components and outperforms existing hallucination detection methods. Our results demonstrate the benefits of customized hallucination detection strategies for improving the accuracy and reliability of LLMs. △ Less

Submitted 30 April, 2025; v1 submitted 27 April, 2025; originally announced April 2025.

Comments: UQLM repository: https://github.com/cvs-health/uqlm

arXiv:2501.03112 [pdf, other]

doi 10.21105/joss.07570

LangFair: A Python Package for Assessing Bias and Fairness in Large Language Model Use Cases

Authors: Dylan Bouchard, Mohit Singh Chauhan, David Skarbrevik, Viren Bajaj, Zeya Ahmad

Abstract: Large Language Models (LLMs) have been observed to exhibit bias in numerous ways, potentially creating or worsening outcomes for specific groups identified by protected attributes such as sex, race, sexual orientation, or age. To help address this gap, we introduce LangFair, an open-source Python package that aims to equip LLM practitioners with the tools to evaluate bias and fairness risks releva… ▽ More Large Language Models (LLMs) have been observed to exhibit bias in numerous ways, potentially creating or worsening outcomes for specific groups identified by protected attributes such as sex, race, sexual orientation, or age. To help address this gap, we introduce LangFair, an open-source Python package that aims to equip LLM practitioners with the tools to evaluate bias and fairness risks relevant to their specific use cases. The package offers functionality to easily generate evaluation datasets, comprised of LLM responses to use-case-specific prompts, and subsequently calculate applicable metrics for the practitioner's use case. To guide in metric selection, LangFair offers an actionable decision framework. △ Less

Submitted 6 January, 2025; originally announced January 2025.

Comments: Journal of Open Source Software; LangFair repository: https://github.com/cvs-health/langfair

Journal ref: Journal of Open Source Software, 10(105), 7570 (2025)

arXiv:2104.08741 [pdf, other]

CEAR: Cross-Entity Aware Reranker for Knowledge Base Completion

Authors: Keshav Kolluru, Mayank Singh Chauhan, Yatin Nandwani, Parag Singla, Mausam

Abstract: Pre-trained language models (LMs) like BERT have shown to store factual knowledge about the world. This knowledge can be used to augment the information present in Knowledge Bases, which tend to be incomplete. However, prior attempts at using BERT for task of Knowledge Base Completion (KBC) resulted in performance worse than embedding based techniques that rely only on the graph structure. In this… ▽ More Pre-trained language models (LMs) like BERT have shown to store factual knowledge about the world. This knowledge can be used to augment the information present in Knowledge Bases, which tend to be incomplete. However, prior attempts at using BERT for task of Knowledge Base Completion (KBC) resulted in performance worse than embedding based techniques that rely only on the graph structure. In this work we develop a novel model, Cross-Entity Aware Reranker (CEAR), that uses BERT to re-rank the output of existing KBC models with cross-entity attention. Unlike prior work that scores each entity independently, CEAR uses BERT to score the entities together, which is effective for exploiting its factual knowledge. CEAR achieves a new state of art for the OLPBench dataset. △ Less

Submitted 27 January, 2022; v1 submitted 18 April, 2021; originally announced April 2021.

Comments: We found a bug in the code that invalidates the reported results for FB15k-237 and WN18RR. The results for OLPBench hold the same. We are in process of updating the paper

arXiv:1901.06358 [pdf, other]

Embedded CNN based vehicle classification and counting in non-laned road traffic

Authors: Mayank Singh Chauhan, Arshdeep Singh, Mansi Khemka, Arneish Prateek, Rijurekha Sen

Abstract: Classifying and counting vehicles in road traffic has numerous applications in the transportation engineering domain. However, the wide variety of vehicles (two-wheelers, three-wheelers, cars, buses, trucks etc.) plying on roads of developing regions without any lane discipline, makes vehicle classification and counting a hard problem to automate. In this paper, we use state of the art Convolution… ▽ More Classifying and counting vehicles in road traffic has numerous applications in the transportation engineering domain. However, the wide variety of vehicles (two-wheelers, three-wheelers, cars, buses, trucks etc.) plying on roads of developing regions without any lane discipline, makes vehicle classification and counting a hard problem to automate. In this paper, we use state of the art Convolutional Neural Network (CNN) based object detection models and train them for multiple vehicle classes using data from Delhi roads. We get upto 75% MAP on an 80-20 train-test split using 5562 video frames from four different locations. As robust network connectivity is scarce in developing regions for continuous video transmissions from the road to cloud servers, we also evaluate the latency, energy and hardware cost of embedded implementations of our CNN model based inferences. △ Less

Submitted 18 January, 2019; originally announced January 2019.

Comments: *These authors contributed equally

Showing 1–5 of 5 results for author: Chauhan, M S