Search | arXiv e-print repository

Gene42: Long-Range Genomic Foundation Model With Dense Attention

Authors: Kirill Vishniakov, Boulbaba Ben Amor, Engin Tekin, Nancy A. ElNaker, Karthik Viswanathan, Aleksandr Medvedev, Aahan Singh, Maryam Nadeem, Mohammad Amaan Sayeed, Praveenkumar Kanithi, Tiago Magalhaes, Natalia Vassilieva, Dwarikanath Mahapatra, Marco Pimentel, and Shadab Khan

Abstract: We introduce Gene42, a novel family of Genomic Foundation Models (GFMs) designed to manage context lengths of up to 192,000 base pairs (bp) at a single-nucleotide resolution. Gene42 models utilize a decoder-only (LLaMA-style) architecture with a dense self-attention mechanism. Initially trained on fixed-length sequences of 4,096 bp, our models underwent continuous pretraining to extend the context… ▽ More We introduce Gene42, a novel family of Genomic Foundation Models (GFMs) designed to manage context lengths of up to 192,000 base pairs (bp) at a single-nucleotide resolution. Gene42 models utilize a decoder-only (LLaMA-style) architecture with a dense self-attention mechanism. Initially trained on fixed-length sequences of 4,096 bp, our models underwent continuous pretraining to extend the context length to 192,000 bp. This iterative extension allowed for the comprehensive processing of large-scale genomic data and the capture of intricate patterns and dependencies within the human genome. Gene42 is the first dense attention model capable of handling such extensive long context lengths in genomics, challenging state-space models that often rely on convolutional operators among other mechanisms. Our pretrained models exhibit notably low perplexity values and high reconstruction accuracy, highlighting their strong ability to model genomic data. Extensive experiments on various genomic benchmarks have demonstrated state-of-the-art performance across multiple tasks, including biotype classification, regulatory region identification, chromatin profiling prediction, variant pathogenicity prediction, and species classification. The models are publicly available at huggingface.co/inceptionai. △ Less

Submitted 20 March, 2025; originally announced March 2025.

arXiv:2501.09825 [pdf, ps, other]

Bridging Language Barriers in Healthcare: A Study on Arabic LLMs

Authors: Nada Saadi, Tathagata Raha, Clément Christophe, Marco AF Pimentel, Ronnie Rajan, Praveen K Kanithi

Abstract: This paper investigates the challenges of developing large language models (LLMs) proficient in both multilingual understanding and medical knowledge. We demonstrate that simply translating medical data does not guarantee strong performance on clinical tasks in the target language. Our experiments reveal that the optimal language mix in training data varies significantly across different medical t… ▽ More This paper investigates the challenges of developing large language models (LLMs) proficient in both multilingual understanding and medical knowledge. We demonstrate that simply translating medical data does not guarantee strong performance on clinical tasks in the target language. Our experiments reveal that the optimal language mix in training data varies significantly across different medical tasks. We find that larger models with carefully calibrated language ratios achieve superior performance on native-language clinical tasks. Furthermore, our results suggest that relying solely on fine-tuning may not be the most effective approach for incorporating new language knowledge into LLMs. Instead, data and computationally intensive pretraining methods may still be necessary to achieve optimal performance in multilingual medical settings. These findings provide valuable guidance for building effective and inclusive medical AI systems for diverse linguistic communities. △ Less

Submitted 16 January, 2025; originally announced January 2025.

arXiv:2410.05046 [pdf, other]

Named Clinical Entity Recognition Benchmark

Authors: Wadood M Abdul, Marco AF Pimentel, Muhammad Umar Salman, Tathagata Raha, Clément Christophe, Praveen K Kanithi, Nasir Hayat, Ronnie Rajan, Shadab Khan

Abstract: This technical report introduces a Named Clinical Entity Recognition Benchmark for evaluating language models in healthcare, addressing the crucial natural language processing (NLP) task of extracting structured information from clinical narratives to support applications like automated coding, clinical trial cohort identification, and clinical decision support. The leaderboard provides a standa… ▽ More This technical report introduces a Named Clinical Entity Recognition Benchmark for evaluating language models in healthcare, addressing the crucial natural language processing (NLP) task of extracting structured information from clinical narratives to support applications like automated coding, clinical trial cohort identification, and clinical decision support. The leaderboard provides a standardized platform for assessing diverse language models, including encoder and decoder architectures, on their ability to identify and classify clinical entities across multiple medical domains. A curated collection of openly available clinical datasets is utilized, encompassing entities such as diseases, symptoms, medications, procedures, and laboratory measurements. Importantly, these entities are standardized according to the Observational Medical Outcomes Partnership (OMOP) Common Data Model, ensuring consistency and interoperability across different healthcare systems and datasets, and a comprehensive evaluation of model performance. Performance of models is primarily assessed using the F1-score, and it is complemented by various assessment modes to provide comprehensive insights into model performance. The report also includes a brief analysis of models evaluated to date, highlighting observed trends and limitations. By establishing this benchmarking framework, the leaderboard aims to promote transparency, facilitate comparative analyses, and drive innovation in clinical entity recognition tasks, addressing the need for robust evaluation methods in healthcare NLP. △ Less

Submitted 7 October, 2024; originally announced October 2024.

Comments: Technical Report

arXiv:2409.14988 [pdf, other]

Beyond Fine-tuning: Unleashing the Potential of Continuous Pretraining for Clinical LLMs

Authors: Clément Christophe, Tathagata Raha, Svetlana Maslenkova, Muhammad Umar Salman, Praveen K Kanithi, Marco AF Pimentel, Shadab Khan

Abstract: Large Language Models (LLMs) have demonstrated significant potential in transforming clinical applications. In this study, we investigate the efficacy of four techniques in adapting LLMs for clinical use-cases: continuous pretraining, instruct fine-tuning, NEFTune, and prompt engineering. We employ these methods on Mistral 7B and Mixtral 8x7B models, leveraging a large-scale clinical pretraining d… ▽ More Large Language Models (LLMs) have demonstrated significant potential in transforming clinical applications. In this study, we investigate the efficacy of four techniques in adapting LLMs for clinical use-cases: continuous pretraining, instruct fine-tuning, NEFTune, and prompt engineering. We employ these methods on Mistral 7B and Mixtral 8x7B models, leveraging a large-scale clinical pretraining dataset of 50 billion tokens and an instruct fine-tuning dataset of 500 million tokens. Our evaluation across various clinical tasks reveals the impact of each technique. While continuous pretraining beyond 250 billion tokens yields marginal improvements on its own, it establishes a strong foundation for instruct fine-tuning. Notably, NEFTune, designed primarily to enhance generation quality, surprisingly demonstrates additional gains on our benchmark. Complex prompt engineering methods further enhance performance. These findings show the importance of tailoring fine-tuning strategies and exploring innovative techniques to optimize LLM performance in the clinical domain. △ Less

Submitted 23 September, 2024; originally announced September 2024.

arXiv:2409.07314 [pdf, other]

MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications

Authors: Praveen K Kanithi, Clément Christophe, Marco AF Pimentel, Tathagata Raha, Nada Saadi, Hamza Javed, Svetlana Maslenkova, Nasir Hayat, Ronnie Rajan, Shadab Khan

Abstract: The rapid development of Large Language Models (LLMs) for healthcare applications has spurred calls for holistic evaluation beyond frequently-cited benchmarks like USMLE, to better reflect real-world performance. While real-world assessments are valuable indicators of utility, they often lag behind the pace of LLM evolution, likely rendering findings obsolete upon deployment. This temporal disconn… ▽ More The rapid development of Large Language Models (LLMs) for healthcare applications has spurred calls for holistic evaluation beyond frequently-cited benchmarks like USMLE, to better reflect real-world performance. While real-world assessments are valuable indicators of utility, they often lag behind the pace of LLM evolution, likely rendering findings obsolete upon deployment. This temporal disconnect necessitates a comprehensive upfront evaluation that can guide model selection for specific clinical applications. We introduce MEDIC, a framework assessing LLMs across five critical dimensions of clinical competence: medical reasoning, ethics and bias, data and language understanding, in-context learning, and clinical safety. MEDIC features a novel cross-examination framework quantifying LLM performance across areas like coverage and hallucination detection, without requiring reference outputs. We apply MEDIC to evaluate LLMs on medical question-answering, safety, summarization, note generation, and other tasks. Our results show performance disparities across model sizes, baseline vs medically finetuned models, and have implications on model selection for applications requiring specific model strengths, such as low hallucination or lower cost of inference. MEDIC's multifaceted evaluation reveals these performance trade-offs, bridging the gap between theoretical capabilities and practical implementation in healthcare settings, ensuring that the most promising models are identified and adapted for diverse healthcare applications. △ Less

Submitted 11 September, 2024; originally announced September 2024.

Comments: Technical report

arXiv:2408.06142 [pdf, ps, other]

Med42-v2: A Suite of Clinical LLMs

Authors: Clément Christophe, Praveen K Kanithi, Tathagata Raha, Shadab Khan, Marco AF Pimentel

Abstract: Med42-v2 introduces a suite of clinical large language models (LLMs) designed to address the limitations of generic models in healthcare settings. These models are built on Llama3 architecture and fine-tuned using specialized clinical data. They underwent multi-stage preference alignment to effectively respond to natural prompts. While generic models are often preference-aligned to avoid answering… ▽ More Med42-v2 introduces a suite of clinical large language models (LLMs) designed to address the limitations of generic models in healthcare settings. These models are built on Llama3 architecture and fine-tuned using specialized clinical data. They underwent multi-stage preference alignment to effectively respond to natural prompts. While generic models are often preference-aligned to avoid answering clinical queries as a precaution, Med42-v2 is specifically trained to overcome this limitation, enabling its use in clinical settings. Med42-v2 models demonstrate superior performance compared to the original Llama3 models in both 8B and 70B parameter configurations and GPT-4 across various medical benchmarks. These LLMs are developed to understand clinical queries, perform reasoning tasks, and provide valuable assistance in clinical environments. The models are now publicly available at \href{https://huggingface.co/m42-health}{https://huggingface.co/m42-health}. △ Less

Submitted 12 August, 2024; originally announced August 2024.

arXiv:2407.21072 [pdf, other]

Beyond Metrics: A Critical Analysis of the Variability in Large Language Model Evaluation Frameworks

Authors: Marco AF Pimentel, Clément Christophe, Tathagata Raha, Prateek Munjal, Praveen K Kanithi, Shadab Khan

Abstract: As large language models (LLMs) continue to evolve, the need for robust and standardized evaluation benchmarks becomes paramount. Evaluating the performance of these models is a complex challenge that requires careful consideration of various linguistic tasks, model architectures, and benchmarking methodologies. In recent years, various frameworks have emerged as noteworthy contributions to the fi… ▽ More As large language models (LLMs) continue to evolve, the need for robust and standardized evaluation benchmarks becomes paramount. Evaluating the performance of these models is a complex challenge that requires careful consideration of various linguistic tasks, model architectures, and benchmarking methodologies. In recent years, various frameworks have emerged as noteworthy contributions to the field, offering comprehensive evaluation tests and benchmarks for assessing the capabilities of LLMs across diverse domains. This paper provides an exploration and critical analysis of some of these evaluation methodologies, shedding light on their strengths, limitations, and impact on advancing the state-of-the-art in natural language processing. △ Less

Submitted 28 July, 2024; originally announced July 2024.

Comments: 15 pages, 3 figures

arXiv:2404.14779 [pdf, other]

Med42 -- Evaluating Fine-Tuning Strategies for Medical LLMs: Full-Parameter vs. Parameter-Efficient Approaches

Authors: Clément Christophe, Praveen K Kanithi, Prateek Munjal, Tathagata Raha, Nasir Hayat, Ronnie Rajan, Ahmed Al-Mahrooqi, Avani Gupta, Muhammad Umar Salman, Gurpreet Gosal, Bhargav Kanakiya, Charles Chen, Natalia Vassilieva, Boulbaba Ben Amor, Marco AF Pimentel, Shadab Khan

Abstract: This study presents a comprehensive analysis and comparison of two predominant fine-tuning methodologies - full-parameter fine-tuning and parameter-efficient tuning - within the context of medical Large Language Models (LLMs). We developed and refined a series of LLMs, based on the Llama-2 architecture, specifically designed to enhance medical knowledge retrieval, reasoning, and question-answering… ▽ More This study presents a comprehensive analysis and comparison of two predominant fine-tuning methodologies - full-parameter fine-tuning and parameter-efficient tuning - within the context of medical Large Language Models (LLMs). We developed and refined a series of LLMs, based on the Llama-2 architecture, specifically designed to enhance medical knowledge retrieval, reasoning, and question-answering capabilities. Our experiments systematically evaluate the effectiveness of these tuning strategies across various well-known medical benchmarks. Notably, our medical LLM Med42 showed an accuracy level of 72% on the US Medical Licensing Examination (USMLE) datasets, setting a new standard in performance for openly available medical LLMs. Through this comparative analysis, we aim to identify the most effective and efficient method for fine-tuning LLMs in the medical domain, thereby contributing significantly to the advancement of AI-driven healthcare applications. △ Less

Submitted 23 April, 2024; originally announced April 2024.

Comments: Published at AAAI 2024 Spring Symposium - Clinical Foundation Models

arXiv:2107.12536 [pdf, other]

doi 10.1109/ACCESS.2021.3108682

A Data-Driven Biophysical Computational Model of Parkinson's Disease based on Marmoset Monkeys

Authors: Caetano M. Ranieri, Jhielson M. Pimentel, Marcelo R. Romano, Leonardo A. Elias, Roseli A. F. Romero, Michael A. Lones, Mariana F. P. Araujo, Patricia A. Vargas, Renan C. Moioli

Abstract: In this work we propose a new biophysical computational model of brain regions relevant to Parkinson's Disease based on local field potential data collected from the brain of marmoset monkeys. Parkinson's disease is a neurodegenerative disorder, linked to the death of dopaminergic neurons at the substantia nigra pars compacta, which affects the normal dynamics of the basal ganglia-thalamus-cortex… ▽ More In this work we propose a new biophysical computational model of brain regions relevant to Parkinson's Disease based on local field potential data collected from the brain of marmoset monkeys. Parkinson's disease is a neurodegenerative disorder, linked to the death of dopaminergic neurons at the substantia nigra pars compacta, which affects the normal dynamics of the basal ganglia-thalamus-cortex neuronal circuit of the brain. Although there are multiple mechanisms underlying the disease, a complete description of those mechanisms and molecular pathogenesis are still missing, and there is still no cure. To address this gap, computational models that resemble neurobiological aspects found in animal models have been proposed. In our model, we performed a data-driven approach in which a set of biologically constrained parameters is optimised using differential evolution. Evolved models successfully resembled single-neuron mean firing rates and spectral signatures of local field potentials from healthy and parkinsonian marmoset brain data. As far as we are concerned, this is the first computational model of Parkinson's Disease based on simultaneous electrophysiological recordings from seven brain regions of Marmoset monkeys. Results show that the proposed model could facilitate the investigation of the mechanisms of PD and support the development of techniques that can indicate new therapies. It could also be applied to other computational neuroscience problems in which biological data could be used to fit multi-scale models of brain circuits. △ Less

Submitted 1 September, 2021; v1 submitted 26 July, 2021; originally announced July 2021.

Journal ref: IEEE Access, 2021

arXiv:2005.13228 [pdf, ps, other]

Oligopoly Dynamics

Authors: Bernardo Melo Pimentel

Abstract: The present notes summarise the oligopoly dynamics lectures professor Luís Cabral gave at the Bank of Portugal in September and October 2017. The lectures discuss a set industrial organisation problems in a dynamic environment, namely learning by doing, switching costs, price wars, networks and platforms, and ladder models of innovation. Methodologically, the materials cover analytical solutions o… ▽ More The present notes summarise the oligopoly dynamics lectures professor Luís Cabral gave at the Bank of Portugal in September and October 2017. The lectures discuss a set industrial organisation problems in a dynamic environment, namely learning by doing, switching costs, price wars, networks and platforms, and ladder models of innovation. Methodologically, the materials cover analytical solutions of known points (e.g., $δ= 0$), the discussion of firms' strategies based on intuitions derived directly from their value functions with no model solving, and the combination of analytical and numerical procedures to reach model solutions. State space analysis is done for both continuous and discrete cases. All errors are my own. △ Less

Submitted 27 May, 2020; originally announced May 2020.

arXiv:2005.08537 [pdf]

doi 10.1088/1361-6579/abba0a

Remote health monitoring and diagnosis in the time of COVID-19

Authors: Joachim A. Behar, Chengyu Liu, Kevin Kotzen, Kenta Tsutsui, Valentina D. A. Corino, Janmajay Singh, Marco A. F. Pimentel, Philip Warrick, Sebastian Zaunseder, Fernando Andreotti, David Sebag, Georgy Popanitsa, Patrick E. McSharry, Walter Karlen, Chandan Karmakar, Gari D. Clifford

Abstract: Coronavirus disease (COVID-19) is caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) that is rapidly spreading across the globe. The clinical spectrum of SARS-CoV-2 pneumonia ranges from mild to critically ill cases and requires early detection and monitoring, within a clinical environment for critical cases and remotely for mild cases. The fear of contamination in clinical… ▽ More Coronavirus disease (COVID-19) is caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) that is rapidly spreading across the globe. The clinical spectrum of SARS-CoV-2 pneumonia ranges from mild to critically ill cases and requires early detection and monitoring, within a clinical environment for critical cases and remotely for mild cases. The fear of contamination in clinical environments has led to a dramatic reduction in on-site referrals for routine care. There has also been a perceived need to continuously monitor non-severe COVID- 19 patients, either from their quarantine site at home, or dedicated quarantine locations (e.g., hotels). Thus, the pandemic has driven incentives to innovate and enhance or create new routes for providing healthcare services at distance. In particular, this has created a dramatic impetus to find innovative ways to remotely and effectively monitor patient health status. In this paper we present a short review of remote health monitoring initiatives taken in 19 states during the time of the pandemic. We emphasize in the discussion particular aspects that are common ground for the reviewed states, in particular the future impact of the pandemic on remote health monitoring and consideration on data privacy. △ Less

Submitted 15 October, 2020; v1 submitted 18 May, 2020; originally announced May 2020.

Comments: 36 pages

arXiv:1802.00030 [pdf, other]

Fusarium Damaged Kernels Detection Using Transfer Learning on Deep Neural Network Architecture

Authors: Márcio Nicolau, Márcia Barrocas Moreira Pimentel, Casiane Salete Tibola, José Mauricio Cunha Fernandes, Willingthon Pavan

Abstract: The present work shows the application of transfer learning for a pre-trained deep neural network (DNN), using a small image dataset ($\approx$ 12,000) on a single workstation with enabled NVIDIA GPU card that takes up to 1 hour to complete the training task and archive an overall average accuracy of $94.7\%$. The DNN presents a $20\%$ score of misclassification for an external test dataset. The a… ▽ More The present work shows the application of transfer learning for a pre-trained deep neural network (DNN), using a small image dataset ($\approx$ 12,000) on a single workstation with enabled NVIDIA GPU card that takes up to 1 hour to complete the training task and archive an overall average accuracy of $94.7\%$. The DNN presents a $20\%$ score of misclassification for an external test dataset. The accuracy of the proposed methodology is equivalent to ones using HSI methodology $(81\%-91\%)$ used for the same task, but with the advantage of being independent on special equipment to classify wheat kernel for FHB symptoms. △ Less

Submitted 31 January, 2018; originally announced February 2018.

Showing 1–12 of 12 results for author: Pimentel, M