-
Ethics by Design: A Lifecycle Framework for Trustworthy AI in Medical Imaging From Transparent Data Governance to Clinically Validated Deployment
Authors:
Umer Sadiq Khan,
Saif Ur Rehman Khan
Abstract:
The integration of artificial intelligence (AI) in medical imaging raises crucial ethical concerns at every stage of its development, from data collection to deployment. Addressing these concerns is essential for ensuring that AI systems are developed and implemented in a manner that respects patient rights and promotes fairness. This study aims to explore the ethical implications of AI in medical…
▽ More
The integration of artificial intelligence (AI) in medical imaging raises crucial ethical concerns at every stage of its development, from data collection to deployment. Addressing these concerns is essential for ensuring that AI systems are developed and implemented in a manner that respects patient rights and promotes fairness. This study aims to explore the ethical implications of AI in medical imaging, focusing on five key stages: data collection, data processing, model training, model evaluation, and deployment. The goal is to evaluate how these stages adhere to fundamental ethical principles, including data privacy, fairness, transparency, accountability, and autonomy. An analytical approach was employed to examine the ethical challenges associated with each stage of AI development. We reviewed existing literature, guidelines, and regulations concerning AI ethics in healthcare and identified critical ethical issues at each stage. The study outlines specific inquiries and principles for each phase of AI development. The findings highlight key ethical issues: ensuring patient consent and anonymization during data collection, addressing biases in model training, ensuring transparency and fairness during model evaluation, and the importance of continuous ethical assessments during deployment. The analysis also emphasizes the impact of accessibility issues on different stakeholders, including private, public, and third-party entities. The study concludes that ethical considerations must be systematically integrated into each stage of AI development in medical imaging. By adhering to these ethical principles, AI systems can be made more robust, transparent, and aligned with patient care and data control. We propose tailored ethical inquiries and strategies to support the creation of ethically sound AI systems in medical imaging.
△ Less
Submitted 6 July, 2025;
originally announced July 2025.
-
FairI Tales: Evaluation of Fairness in Indian Contexts with a Focus on Bias and Stereotypes
Authors:
Janki Atul Nawale,
Mohammed Safi Ur Rahman Khan,
Janani D,
Mansi Gupta,
Danish Pruthi,
Mitesh M. Khapra
Abstract:
Existing studies on fairness are largely Western-focused, making them inadequate for culturally diverse countries such as India. To address this gap, we introduce INDIC-BIAS, a comprehensive India-centric benchmark designed to evaluate fairness of LLMs across 85 identity groups encompassing diverse castes, religions, regions, and tribes. We first consult domain experts to curate over 1,800 socio-c…
▽ More
Existing studies on fairness are largely Western-focused, making them inadequate for culturally diverse countries such as India. To address this gap, we introduce INDIC-BIAS, a comprehensive India-centric benchmark designed to evaluate fairness of LLMs across 85 identity groups encompassing diverse castes, religions, regions, and tribes. We first consult domain experts to curate over 1,800 socio-cultural topics spanning behaviors and situations, where biases and stereotypes are likely to emerge. Grounded in these topics, we generate and manually validate 20,000 real-world scenario templates to probe LLMs for fairness. We structure these templates into three evaluation tasks: plausibility, judgment, and generation. Our evaluation of 14 popular LLMs on these tasks reveals strong negative biases against marginalized identities, with models frequently reinforcing common stereotypes. Additionally, we find that models struggle to mitigate bias even when explicitly asked to rationalize their decision. Our evaluation provides evidence of both allocative and representational harms that current LLMs could cause towards Indian identities, calling for a more cautious usage in practical applications. We release INDIC-BIAS as an open-source benchmark to advance research on benchmarking and mitigating biases and stereotypes in the Indian context.
△ Less
Submitted 29 June, 2025;
originally announced June 2025.
-
Robust & Precise Knowledge Distillation-based Novel Context-Aware Predictor for Disease Detection in Brain and Gastrointestinal
Authors:
Saif Ur Rehman Khan,
Muhammad Nabeel Asim,
Sebastian Vollmer,
Andreas Dengel
Abstract:
Medical disease prediction, particularly through imaging, remains a challenging task due to the complexity and variability of medical data, including noise, ambiguity, and differing image quality. Recent deep learning models, including Knowledge Distillation (KD) methods, have shown promising results in brain tumor image identification but still face limitations in handling uncertainty and general…
▽ More
Medical disease prediction, particularly through imaging, remains a challenging task due to the complexity and variability of medical data, including noise, ambiguity, and differing image quality. Recent deep learning models, including Knowledge Distillation (KD) methods, have shown promising results in brain tumor image identification but still face limitations in handling uncertainty and generalizing across diverse medical conditions. Traditional KD methods often rely on a context-unaware temperature parameter to soften teacher model predictions, which does not adapt effectively to varying uncertainty levels present in medical images. To address this issue, we propose a novel framework that integrates Ant Colony Optimization (ACO) for optimal teacher-student model selection and a novel context-aware predictor approach for temperature scaling. The proposed context-aware framework adjusts the temperature based on factors such as image quality, disease complexity, and teacher model confidence, allowing for more robust knowledge transfer. Additionally, ACO efficiently selects the most appropriate teacher-student model pair from a set of pre-trained models, outperforming current optimization methods by exploring a broader solution space and better handling complex, non-linear relationships within the data. The proposed framework is evaluated using three publicly available benchmark datasets, each corresponding to a distinct medical imaging task. The results demonstrate that the proposed framework significantly outperforms current state-of-the-art methods, achieving top accuracy rates: 98.01% on the MRI brain tumor (Kaggle) dataset, 92.81% on the Figshare MRI dataset, and 96.20% on the GastroNet dataset. This enhanced performance is further evidenced by the improved results, surpassing existing benchmarks of 97.24% (Kaggle), 91.43% (Figshare), and 95.00% (GastroNet).
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
AI-Driven Diabetic Retinopathy Diagnosis Enhancement through Image Processing and Salp Swarm Algorithm-Optimized Ensemble Network
Authors:
Saif Ur Rehman Khan,
Muhammad Nabeel Asim,
Sebastian Vollmer,
Andreas Dengel
Abstract:
Diabetic retinopathy is a leading cause of blindness in diabetic patients and early detection plays a crucial role in preventing vision loss. Traditional diagnostic methods are often time-consuming and prone to errors. The emergence of deep learning techniques has provided innovative solutions to improve diagnostic efficiency. However, single deep learning models frequently face issues related to…
▽ More
Diabetic retinopathy is a leading cause of blindness in diabetic patients and early detection plays a crucial role in preventing vision loss. Traditional diagnostic methods are often time-consuming and prone to errors. The emergence of deep learning techniques has provided innovative solutions to improve diagnostic efficiency. However, single deep learning models frequently face issues related to extracting key features from complex retinal images. To handle this problem, we present an effective ensemble method for DR diagnosis comprising four main phases: image pre-processing, selection of backbone pre-trained models, feature enhancement, and optimization. Our methodology initiates with the pre-processing phase, where we apply CLAHE to enhance image contrast and Gamma correction is then used to adjust the brightness for better feature recognition. We then apply Discrete Wavelet Transform (DWT) for image fusion by combining multi-resolution details to create a richer dataset. Then, we selected three pre-trained models with the best performance named DenseNet169, MobileNetV1, and Xception for diverse feature extraction. To further improve feature extraction, an improved residual block is integrated into each model. Finally, the predictions from these base models are then aggregated using weighted ensemble approach, with the weights optimized by using Salp Swarm Algorithm (SSA).SSA intelligently explores the weight space and finds the optimal configuration of base architectures to maximize the performance of the ensemble model. The proposed model is evaluated on the multiclass Kaggle APTOS 2019 dataset and obtained 88.52% accuracy.
△ Less
Submitted 18 March, 2025;
originally announced March 2025.
-
Can Vision-Language Models Evaluate Handwritten Math?
Authors:
Oikantik Nath,
Hanani Bathina,
Mohammed Safi Ur Rahman Khan,
Mitesh M. Khapra
Abstract:
Recent advancements in Vision-Language Models (VLMs) have opened new possibilities in automatic grading of handwritten student responses, particularly in mathematics. However, a comprehensive study to test the ability of VLMs to evaluate and reason over handwritten content remains absent. To address this gap, we introduce FERMAT, a benchmark designed to assess the ability of VLMs to detect, locali…
▽ More
Recent advancements in Vision-Language Models (VLMs) have opened new possibilities in automatic grading of handwritten student responses, particularly in mathematics. However, a comprehensive study to test the ability of VLMs to evaluate and reason over handwritten content remains absent. To address this gap, we introduce FERMAT, a benchmark designed to assess the ability of VLMs to detect, localize and correct errors in handwritten mathematical content. FERMAT spans four key error dimensions - computational, conceptual, notational, and presentation - and comprises over 2,200 handwritten math solutions derived from 609 manually curated problems from grades 7-12 with intentionally introduced perturbations. Using FERMAT we benchmark nine VLMs across three tasks: error detection, localization, and correction. Our results reveal significant shortcomings in current VLMs in reasoning over handwritten text, with Gemini-1.5-Pro achieving the highest error correction rate (77%). We also observed that some models struggle with processing handwritten content, as their accuracy improves when handwritten inputs are replaced with printed text or images. These findings highlight the limitations of current VLMs and reveal new avenues for improvement. We release FERMAT and all the associated resources in the open-source to drive further research.
△ Less
Submitted 12 March, 2025; v1 submitted 13 January, 2025;
originally announced January 2025.
-
Pralekha: An Indic Document Alignment Evaluation Benchmark
Authors:
Sanjay Suryanarayanan,
Haiyue Song,
Mohammed Safi Ur Rahman Khan,
Anoop Kunchukuttan,
Mitesh M. Khapra,
Raj Dabre
Abstract:
Mining parallel document pairs poses a significant challenge because existing sentence embedding models often have limited context windows, preventing them from effectively capturing document-level information. Another overlooked issue is the lack of concrete evaluation benchmarks comprising high-quality parallel document pairs for assessing document-level mining approaches, particularly for Indic…
▽ More
Mining parallel document pairs poses a significant challenge because existing sentence embedding models often have limited context windows, preventing them from effectively capturing document-level information. Another overlooked issue is the lack of concrete evaluation benchmarks comprising high-quality parallel document pairs for assessing document-level mining approaches, particularly for Indic languages. In this study, we introduce Pralekha, a large-scale benchmark for document-level alignment evaluation. Pralekha includes over 2 million documents, with a 1:2 ratio of unaligned to aligned pairs, covering 11 Indic languages and English. Using Pralekha, we evaluate various document-level mining approaches across three dimensions: the embedding models, the granularity levels, and the alignment algorithm. To address the challenge of aligning documents using sentence and chunk-level alignments, we propose a novel scoring method, Document Alignment Coefficient (DAC). DAC demonstrates substantial improvements over baseline pooling approaches, particularly in noisy scenarios, achieving average gains of 20-30% in precision and 15-20% in F1 score. These results highlight DAC's effectiveness in parallel document mining for Indic languages.
△ Less
Submitted 28 November, 2024;
originally announced November 2024.
-
Towards Building Large Scale Datasets and State-of-the-Art Automatic Speech Translation Systems for 14 Indian Languages
Authors:
Ashwin Sankar,
Sparsh Jain,
Nikhil Narasimhan,
Devilal Choudhary,
Dhairya Suman,
Mohammed Safi Ur Rahman Khan,
Anoop Kunchukuttan,
Mitesh M Khapra,
Raj Dabre
Abstract:
Speech translation for Indian languages remains a challenging task due to the scarcity of large-scale, publicly available datasets that capture the linguistic diversity and domain coverage essential for real-world applications. Existing datasets cover a fraction of Indian languages and lack the breadth needed to train robust models that generalize beyond curated benchmarks. To bridge this gap, we…
▽ More
Speech translation for Indian languages remains a challenging task due to the scarcity of large-scale, publicly available datasets that capture the linguistic diversity and domain coverage essential for real-world applications. Existing datasets cover a fraction of Indian languages and lack the breadth needed to train robust models that generalize beyond curated benchmarks. To bridge this gap, we introduce BhasaAnuvaad, the largest speech translation dataset for Indian languages, spanning over 44 thousand hours of audio and 17 million aligned text segments across 14 Indian languages and English. Our dataset is built through a threefold methodology: (a) aggregating high-quality existing sources, (b) large-scale web crawling to ensure linguistic and domain diversity, and (c) creating synthetic data to model real-world speech disfluencies. Leveraging BhasaAnuvaad, we train IndicSeamless, a state-of-the-art speech translation model for Indian languages that performs better than existing models. Our experiments demonstrate improvements in the translation quality, setting a new standard for Indian language speech translation. We will release all the code, data and model weights in the open-source, with permissive licenses to promote accessibility and collaboration.
△ Less
Submitted 31 May, 2025; v1 submitted 7 November, 2024;
originally announced November 2024.
-
MILU: A Multi-task Indic Language Understanding Benchmark
Authors:
Sshubam Verma,
Mohammed Safi Ur Rahman Khan,
Vishwajeet Kumar,
Rudra Murthy,
Jaydeep Sen
Abstract:
Evaluating Large Language Models (LLMs) in low-resource and linguistically diverse languages remains a significant challenge in NLP, particularly for languages using non-Latin scripts like those spoken in India. Existing benchmarks predominantly focus on English, leaving substantial gaps in assessing LLM capabilities in these languages. We introduce MILU, a Multi task Indic Language Understanding…
▽ More
Evaluating Large Language Models (LLMs) in low-resource and linguistically diverse languages remains a significant challenge in NLP, particularly for languages using non-Latin scripts like those spoken in India. Existing benchmarks predominantly focus on English, leaving substantial gaps in assessing LLM capabilities in these languages. We introduce MILU, a Multi task Indic Language Understanding Benchmark, a comprehensive evaluation benchmark designed to address this gap. MILU spans 8 domains and 41 subjects across 11 Indic languages, reflecting both general and culturally specific knowledge. With an India-centric design, incorporates material from regional and state-level examinations, covering topics such as local history, arts, festivals, and laws, alongside standard subjects like science and mathematics. We evaluate over 42 LLMs, and find that current LLMs struggle with MILU, with GPT-4o achieving the highest average accuracy at 74 percent. Open multilingual models outperform language-specific fine-tuned models, which perform only slightly better than random baselines. Models also perform better in high resource languages as compared to low resource ones. Domain-wise analysis indicates that models perform poorly in culturally relevant areas like Arts and Humanities, Law and Governance compared to general fields like STEM. To the best of our knowledge, MILU is the first of its kind benchmark focused on Indic languages, serving as a crucial step towards comprehensive cultural evaluation. All code, benchmarks, and artifacts are publicly available to foster open research.
△ Less
Submitted 4 February, 2025; v1 submitted 4 November, 2024;
originally announced November 2024.
-
Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs
Authors:
Sumanth Doddapaneni,
Mohammed Safi Ur Rahman Khan,
Dilip Venkatesh,
Raj Dabre,
Anoop Kunchukuttan,
Mitesh M. Khapra
Abstract:
Evaluating machine-generated text remains a significant challenge in NLP, especially for non-English languages. Current methodologies, including automated metrics, human assessments, and LLM-based evaluations, predominantly focus on English, revealing a significant gap in multilingual evaluation frameworks. We introduce the Cross Lingual Auto Evaluation (CIA) Suite, an extensible framework that in…
▽ More
Evaluating machine-generated text remains a significant challenge in NLP, especially for non-English languages. Current methodologies, including automated metrics, human assessments, and LLM-based evaluations, predominantly focus on English, revealing a significant gap in multilingual evaluation frameworks. We introduce the Cross Lingual Auto Evaluation (CIA) Suite, an extensible framework that includes evaluator LLMs (Hercule) and a novel test set (Recon) specifically designed for multilingual evaluation. Our test set features 500 human-annotated instructions spanning various task capabilities along with human judgment scores across six languages. This would enable benchmarking of general-purpose multilingual LLMs and facilitate meta-evaluation of Evaluator LLMs. The proposed model, Hercule, is a cross-lingual evaluation model that addresses the scarcity of reference answers in the target language by learning to assign scores to responses based on easily available reference answers in English. Our experiments demonstrate that Hercule aligns more closely with human judgments compared to proprietary models, demonstrating the effectiveness of such cross-lingual evaluation in low resource scenarios. Further, it is also effective in zero-shot evaluation on unseen languages. This study is the first comprehensive examination of cross-lingual evaluation using LLMs, presenting a scalable and effective approach for multilingual assessment. All code, datasets, and models will be publicly available to enable further research in this important area.
△ Less
Submitted 17 October, 2024;
originally announced October 2024.
-
Finding Blind Spots in Evaluator LLMs with Interpretable Checklists
Authors:
Sumanth Doddapaneni,
Mohammed Safi Ur Rahman Khan,
Sshubam Verma,
Mitesh M. Khapra
Abstract:
Large Language Models (LLMs) are increasingly relied upon to evaluate text outputs of other LLMs, thereby influencing leaderboards and development decisions. However, concerns persist over the accuracy of these assessments and the potential for misleading conclusions. In this work, we investigate the effectiveness of LLMs as evaluators for text generation tasks. We propose FBI, a novel framework d…
▽ More
Large Language Models (LLMs) are increasingly relied upon to evaluate text outputs of other LLMs, thereby influencing leaderboards and development decisions. However, concerns persist over the accuracy of these assessments and the potential for misleading conclusions. In this work, we investigate the effectiveness of LLMs as evaluators for text generation tasks. We propose FBI, a novel framework designed to examine the proficiency of Evaluator LLMs in assessing four critical abilities in other LLMs: factual accuracy, instruction following, coherence in long-form writing, and reasoning proficiency. By introducing targeted perturbations in answers generated by LLMs, that clearly impact one of these key capabilities, we test whether an Evaluator LLM can detect these quality drops. By creating a total of 2400 perturbed answers covering 22 perturbation categories, we conduct a comprehensive study using different evaluation strategies on five prominent LLMs commonly used as evaluators in the literature. Our findings reveal significant shortcomings in current Evaluator LLMs, which failed to identify quality drops in over 50\% of cases on average. Single-answer and pairwise evaluations demonstrated notable limitations, whereas reference-based evaluations showed comparatively better performance. These results underscore the unreliable nature of current Evaluator LLMs and advocate for cautious implementation in practical applications. Code and data are available at https://github.com/AI4Bharat/FBI.
△ Less
Submitted 26 November, 2024; v1 submitted 19 June, 2024;
originally announced June 2024.
-
IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages
Authors:
Mohammed Safi Ur Rahman Khan,
Priyam Mehta,
Ananth Sankar,
Umashankar Kumaravelan,
Sumanth Doddapaneni,
Suriyaprasaad B,
Varun Balan G,
Sparsh Jain,
Anoop Kunchukuttan,
Pratyush Kumar,
Raj Dabre,
Mitesh M. Khapra
Abstract:
Despite the considerable advancements in English LLMs, the progress in building comparable models for other languages has been hindered due to the scarcity of tailored resources. Our work aims to bridge this divide by introducing an expansive suite of resources specifically designed for the development of Indic LLMs, covering 22 languages, containing a total of 251B tokens and 74.8M instruction-re…
▽ More
Despite the considerable advancements in English LLMs, the progress in building comparable models for other languages has been hindered due to the scarcity of tailored resources. Our work aims to bridge this divide by introducing an expansive suite of resources specifically designed for the development of Indic LLMs, covering 22 languages, containing a total of 251B tokens and 74.8M instruction-response pairs. Recognizing the importance of both data quality and quantity, our approach combines highly curated manually verified data, unverified yet valuable data, and synthetic data. We build a clean, open-source pipeline for curating pre-training data from diverse sources, including websites, PDFs, and videos, incorporating best practices for crawling, cleaning, flagging, and deduplication. For instruction-fine tuning, we amalgamate existing Indic datasets, translate/transliterate English datasets into Indian languages, and utilize LLaMa2 and Mixtral models to create conversations grounded in articles from Indian Wikipedia and Wikihow. Additionally, we address toxicity alignment by generating toxic prompts for multiple scenarios and then generate non-toxic responses by feeding these toxic prompts to an aligned LLaMa2 model. We hope that the datasets, tools, and resources released as a part of this work will not only propel the research and development of Indic LLMs but also establish an open-source blueprint for extending such efforts to other languages. The data and other artifacts created as part of this work are released with permissive licenses.
△ Less
Submitted 28 November, 2024; v1 submitted 10 March, 2024;
originally announced March 2024.
-
Airavata: Introducing Hindi Instruction-tuned LLM
Authors:
Jay Gala,
Thanmay Jayakumar,
Jaavid Aktar Husain,
Aswanth Kumar M,
Mohammed Safi Ur Rahman Khan,
Diptesh Kanojia,
Ratish Puduppully,
Mitesh M. Khapra,
Raj Dabre,
Rudra Murthy,
Anoop Kunchukuttan
Abstract:
We announce the initial release of "Airavata," an instruction-tuned LLM for Hindi. Airavata was created by fine-tuning OpenHathi with diverse, instruction-tuning Hindi datasets to make it better suited for assistive tasks. Along with the model, we also share the IndicInstruct dataset, which is a collection of diverse instruction-tuning datasets to enable further research for Indic LLMs. Additional…
▽ More
We announce the initial release of "Airavata," an instruction-tuned LLM for Hindi. Airavata was created by fine-tuning OpenHathi with diverse, instruction-tuning Hindi datasets to make it better suited for assistive tasks. Along with the model, we also share the IndicInstruct dataset, which is a collection of diverse instruction-tuning datasets to enable further research for Indic LLMs. Additionally, we present evaluation benchmarks and a framework for assessing LLM performance across tasks in Hindi. Currently, Airavata supports Hindi, but we plan to expand this to all 22 scheduled Indic languages. You can access all artifacts at https://ai4bharat.github.io/airavata.
△ Less
Submitted 26 February, 2024; v1 submitted 26 January, 2024;
originally announced January 2024.