Search | arXiv e-print repository

Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications

Authors: Fadel M. Megahed, Ying-Ju Chen, L. Allision Jones-Farmer, Younghwa Lee, Jiawei Brooke Wang, Inez M. Zwetsloot

Abstract: This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification, addressing the lack of established reliability assessment methods. Adapting psychometric principles, we determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability. Our case study examines financial news sentiment clas… ▽ More This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification, addressing the lack of established reliability assessment methods. Adapting psychometric principles, we determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability. Our case study examines financial news sentiment classification across 14 LLMs (including claude-3-7-sonnet, gpt-4o, deepseek-r1, gemma3, llama3.2, phi4, and command-r-plus), with five replicates per model on 1,350 articles. Models demonstrated high intra-rater consistency, achieving perfect agreement on 90-98% of examples, with minimal differences between expensive and economical models from the same families. When validated against StockNewsAPI labels, models achieved strong performance (accuracy 0.76-0.88), with smaller models like gemma3:1B, llama3.2:3B, and claude-3-5-haiku outperforming larger counterparts. All models performed at chance when predicting actual market movements, indicating task constraints rather than model limitations. Our framework provides systematic guidance for LLM selection, sample size planning, and reliability assessment, enabling organizations to optimize resources for classification tasks. △ Less

Submitted 20 May, 2025; originally announced May 2025.

Comments: 25 pages

arXiv:2501.12596 [pdf, other]

Adapting OpenAI's CLIP Model for Few-Shot Image Inspection in Manufacturing Quality Control: An Expository Case Study with Multiple Application Examples

Authors: Fadel M. Megahed, Ying-Ju Chen, Bianca Maria Colosimo, Marco Luigi Giuseppe Grasso, L. Allison Jones-Farmer, Sven Knoth, Hongyue Sun, Inez Zwetsloot

Abstract: This expository paper introduces a simplified approach to image-based quality inspection in manufacturing using OpenAI's CLIP (Contrastive Language-Image Pretraining) model adapted for few-shot learning. While CLIP has demonstrated impressive capabilities in general computer vision tasks, its direct application to manufacturing inspection presents challenges due to the domain gap between its train… ▽ More This expository paper introduces a simplified approach to image-based quality inspection in manufacturing using OpenAI's CLIP (Contrastive Language-Image Pretraining) model adapted for few-shot learning. While CLIP has demonstrated impressive capabilities in general computer vision tasks, its direct application to manufacturing inspection presents challenges due to the domain gap between its training data and industrial applications. We evaluate CLIP's effectiveness through five case studies: metallic pan surface inspection, 3D printing extrusion profile analysis, stochastic textured surface evaluation, automotive assembly inspection, and microstructure image classification. Our results show that CLIP can achieve high classification accuracy with relatively small learning sets (50-100 examples per class) for single-component and texture-based applications. However, the performance degrades with complex multi-component scenes. We provide a practical implementation framework that enables quality engineers to quickly assess CLIP's suitability for their specific applications before pursuing more complex solutions. This work establishes CLIP-based few-shot learning as an effective baseline approach that balances implementation simplicity with robust performance, demonstrated in several manufacturing quality control applications. △ Less

Submitted 21 January, 2025; originally announced January 2025.

Comments: 31 pages, 13 figures

arXiv:2407.15010 [pdf]

ChatISA: A Prompt-Engineered, In-House Multi-Modal Generative AI Chatbot for Information Systems Education

Authors: Fadel M. Megahed, Ying-Ju Chen, Joshua A. Ferris, Cameron Resatar, Kaitlyn Ross, Younghwa Lee, L. Allison Jones-Farmer

Abstract: As generative AI ('GenAI') continues to evolve, educators face the challenge of preparing students for a future where AI-assisted work is integral to professional success. This paper introduces ChatISA, an in-house, multi-model AI chatbot designed to support students and faculty in an Information Systems and Analytics (ISA) department. ChatISA comprises four primary modules: Coding Companion, Proj… ▽ More As generative AI ('GenAI') continues to evolve, educators face the challenge of preparing students for a future where AI-assisted work is integral to professional success. This paper introduces ChatISA, an in-house, multi-model AI chatbot designed to support students and faculty in an Information Systems and Analytics (ISA) department. ChatISA comprises four primary modules: Coding Companion, Project Coach, Exam Ally, and Interview Mentor, each tailored to enhance different aspects of the educational experience. Through iterative development, student feedback, and leveraging open-source frameworks, we created a robust tool that addresses coding inquiries, project management, exam preparation, and interview readiness. The implementation of ChatISA provided valuable insights and highlighted key challenges. Our findings demonstrate the benefits of ChatISA for ISA education while underscoring the need for adaptive pedagogy and proactive engagement with AI tools to fully harness their educational potential. To support broader adoption and innovation, all code for ChatISA is made publicly available on GitHub, enabling other institutions to customize and integrate similar AI-driven educational tools within their curricula. △ Less

Submitted 16 May, 2025; v1 submitted 13 June, 2024; originally announced July 2024.

Comments: 22 pages

arXiv:2308.13550 [pdf, other]

doi 10.1080/00224065.2024.2372328

Introducing ChatSQC: Enhancing Statistical Quality Control with Augmented AI

Authors: Fadel M. Megahed, Ying-Ju Chen, Inez Zwetsloot, Sven Knoth, Douglas C. Montgomery, L. Allison Jones-Farmer

Abstract: We introduce ChatSQC, an innovative chatbot system that combines the power of OpenAI's Large Language Models (LLM) with a specific knowledge base in Statistical Quality Control (SQC). Our research focuses on enhancing LLMs using specific SQC references, shedding light on how data preprocessing parameters and LLM selection impact the quality of generated responses. By illustrating this process, we… ▽ More We introduce ChatSQC, an innovative chatbot system that combines the power of OpenAI's Large Language Models (LLM) with a specific knowledge base in Statistical Quality Control (SQC). Our research focuses on enhancing LLMs using specific SQC references, shedding light on how data preprocessing parameters and LLM selection impact the quality of generated responses. By illustrating this process, we hope to motivate wider community engagement to refine LLM design and output appraisal techniques. We also highlight potential research opportunities within the SQC domain that can be facilitated by leveraging ChatSQC, thereby broadening the application spectrum of SQC. A primary goal of our work is to provide a template and proof-of-concept on how LLMs can be utilized by our community. To continuously improve ChatSQC, we ask the SQC community to provide feedback, highlight potential issues, request additional features, and/or contribute via pull requests through our public GitHub repository. Additionally, the team will continue to explore adding supplementary reference material that would further improve the contextual understanding of the chatbot. Overall, ChatSQC serves as a testament to the transformative potential of AI within SQC, and we hope it will spur further advancements in the integration of AI in this field. △ Less

Submitted 28 March, 2024; v1 submitted 22 August, 2023; originally announced August 2023.

Comments: 24 pages

Journal ref: Journal of Quality Technology, (2024), 1-24

arXiv:2302.10916 [pdf, other]

doi 10.1080/08982112.2023.2206479

How Generative AI models such as ChatGPT can be (Mis)Used in SPC Practice, Education, and Research? An Exploratory Study

Authors: Fadel M. Megahed, Ying-Ju Chen, Joshua A. Ferris, Sven Knoth, L. Allison Jones-Farmer

Abstract: Generative Artificial Intelligence (AI) models such as OpenAI's ChatGPT have the potential to revolutionize Statistical Process Control (SPC) practice, learning, and research. However, these tools are in the early stages of development and can be easily misused or misunderstood. In this paper, we give an overview of the development of Generative AI. Specifically, we explore ChatGPT's ability to pr… ▽ More Generative Artificial Intelligence (AI) models such as OpenAI's ChatGPT have the potential to revolutionize Statistical Process Control (SPC) practice, learning, and research. However, these tools are in the early stages of development and can be easily misused or misunderstood. In this paper, we give an overview of the development of Generative AI. Specifically, we explore ChatGPT's ability to provide code, explain basic concepts, and create knowledge related to SPC practice, learning, and research. By investigating responses to structured prompts, we highlight the benefits and limitations of the results. Our study indicates that the current version of ChatGPT performs well for structured tasks, such as translating code from one language to another and explaining well-known concepts but struggles with more nuanced tasks, such as explaining less widely known terms and creating code from scratch. We find that using new AI tools may help practitioners, educators, and researchers to be more efficient and productive. However, in their current stages of development, some results are misleading and wrong. Overall, the use of generative AI models in SPC must be properly validated and used in conjunction with other methods to ensure accurate results. △ Less

Submitted 17 February, 2023; originally announced February 2023.

Comments: 30 pages, 20 figures

MSC Class: 62P30 ACM Class: G.3; G.4; J.2; J.6

arXiv:1706.06368 [pdf, other]

doi 10.1145/3132847.3132938

FA*IR: A Fair Top-k Ranking Algorithm

Authors: Meike Zehlike, Francesco Bonchi, Carlos Castillo, Sara Hajian, Mohamed Megahed, Ricardo Baeza-Yates

Abstract: In this work, we define and solve the Fair Top-k Ranking problem, in which we want to determine a subset of k candidates from a large pool of n >> k candidates, maximizing utility (i.e., select the "best" candidates) subject to group fairness criteria. Our ranked group fairness definition extends group fairness using the standard notion of protected groups and is based on ensuring that the proport… ▽ More In this work, we define and solve the Fair Top-k Ranking problem, in which we want to determine a subset of k candidates from a large pool of n >> k candidates, maximizing utility (i.e., select the "best" candidates) subject to group fairness criteria. Our ranked group fairness definition extends group fairness using the standard notion of protected groups and is based on ensuring that the proportion of protected candidates in every prefix of the top-k ranking remains statistically above or indistinguishable from a given minimum. Utility is operationalized in two ways: (i) every candidate included in the top-$k$ should be more qualified than every candidate not included; and (ii) for every pair of candidates in the top-k, the more qualified candidate should be ranked above. An efficient algorithm is presented for producing the Fair Top-k Ranking, and tested experimentally on existing datasets as well as new datasets released with this paper, showing that our approach yields small distortions with respect to rankings that maximize utility without considering fairness criteria. To the best of our knowledge, this is the first algorithm grounded in statistical tests that can mitigate biases in the representation of an under-represented group along a ranked list. △ Less

Submitted 2 July, 2018; v1 submitted 20 June, 2017; originally announced June 2017.

Comments: In Proceedings of the 26th ACM International Conference on Information and Knowledge Management (CIKM'17). This version corrects an error on Table 4

ACM Class: H.3.3; J.1

Showing 1–6 of 6 results for author: Megahed, M