Search | arXiv e-print repository

arXiv:2505.10717 [pdf, other]

A Modular Approach for Clinical SLMs Driven by Synthetic Data with Pre-Instruction Tuning, Model Merging, and Clinical-Tasks Alignment

Authors: Jean-Philippe Corbeil, Amin Dada, Jean-Michel Attendu, Asma Ben Abacha, Alessandro Sordoni, Lucas Caccia, François Beaulieu, Thomas Lin, Jens Kleesiek, Paul Vozila

Abstract: High computation costs and latency of large language models such as GPT-4 have limited their deployment in clinical settings. Small language models (SLMs) offer a cost-effective alternative, but their limited capacity requires biomedical domain adaptation, which remains challenging. An additional bottleneck is the unavailability and high sensitivity of clinical data. To address these challenges, w… ▽ More High computation costs and latency of large language models such as GPT-4 have limited their deployment in clinical settings. Small language models (SLMs) offer a cost-effective alternative, but their limited capacity requires biomedical domain adaptation, which remains challenging. An additional bottleneck is the unavailability and high sensitivity of clinical data. To address these challenges, we propose a novel framework for adapting SLMs into high-performing clinical models. We introduce the MediPhi collection of 3.8B-parameter SLMs developed with our novel framework: pre-instruction tuning of experts on relevant medical and clinical corpora (PMC, Medical Guideline, MedWiki, etc.), model merging, and clinical-tasks alignment. To cover most clinical tasks, we extended the CLUE benchmark to CLUE+, doubling its size. Our expert models deliver relative improvements on this benchmark over the base model without any task-specific fine-tuning: 64.3% on medical entities, 49.5% on radiology reports, and 44% on ICD-10 coding (outperforming GPT-4-0125 by 14%). We unify the expert models into MediPhi via model merging, preserving gains across benchmarks. Furthermore, we built the MediFlow collection, a synthetic dataset of 2.5 million high-quality instructions on 14 medical NLP tasks, 98 fine-grained document types, and JSON format support. Alignment of MediPhi using supervised fine-tuning and direct preference optimization achieves further gains of 18.9% on average. △ Less

Submitted 21 May, 2025; v1 submitted 15 May, 2025; originally announced May 2025.

arXiv:2404.15488 [pdf, other]

IryoNLP at MEDIQA-CORR 2024: Tackling the Medical Error Detection & Correction Task On the Shoulders of Medical Agents

Authors: Jean-Philippe Corbeil

Abstract: In natural language processing applied to the clinical domain, utilizing large language models has emerged as a promising avenue for error detection and correction on clinical notes, a knowledge-intensive task for which annotated data is scarce. This paper presents MedReAct'N'MedReFlex, which leverages a suite of four LLM-based medical agents. The MedReAct agent initiates the process by observing,… ▽ More In natural language processing applied to the clinical domain, utilizing large language models has emerged as a promising avenue for error detection and correction on clinical notes, a knowledge-intensive task for which annotated data is scarce. This paper presents MedReAct'N'MedReFlex, which leverages a suite of four LLM-based medical agents. The MedReAct agent initiates the process by observing, analyzing, and taking action, generating trajectories to guide the search to target a potential error in the clinical notes. Subsequently, the MedEval agent employs five evaluators to assess the targeted error and the proposed correction. In cases where MedReAct's actions prove insufficient, the MedReFlex agent intervenes, engaging in reflective analysis and proposing alternative strategies. Finally, the MedFinalParser agent formats the final output, preserving the original style while ensuring the integrity of the error correction process. One core component of our method is our RAG pipeline based on our ClinicalCorp corpora. Among other well-known sources containing clinical guidelines and information, we preprocess and release the open-source MedWiki dataset for clinical RAG application. Our results demonstrate the central role of our RAG approach with ClinicalCorp leveraged through the MedReAct'N'MedReFlex framework. It achieved the ninth rank on the MEDIQA-CORR 2024 final leaderboard. △ Less

Submitted 23 April, 2024; originally announced April 2024.

arXiv:2306.04777 [pdf, other]

Invariant Causal Set Covering Machines

Authors: Thibaud Godon, Baptiste Bauvin, Pascal Germain, Jacques Corbeil, Alexandre Drouin

Abstract: Rule-based models, such as decision trees, appeal to practitioners due to their interpretable nature. However, the learning algorithms that produce such models are often vulnerable to spurious associations and thus, they are not guaranteed to extract causally-relevant insights. In this work, we build on ideas from the invariant causal prediction literature to propose Invariant Causal Set Covering… ▽ More Rule-based models, such as decision trees, appeal to practitioners due to their interpretable nature. However, the learning algorithms that produce such models are often vulnerable to spurious associations and thus, they are not guaranteed to extract causally-relevant insights. In this work, we build on ideas from the invariant causal prediction literature to propose Invariant Causal Set Covering Machines, an extension of the classical Set Covering Machine algorithm for conjunctions/disjunctions of binary-valued rules that provably avoids spurious associations. We demonstrate both theoretically and empirically that our method can identify the causal parents of a variable of interest in polynomial time. △ Less

Submitted 21 March, 2025; v1 submitted 7 June, 2023; originally announced June 2023.

arXiv:2306.03208 [pdf, other]

NLU on Data Diets: Dynamic Data Subset Selection for NLP Classification Tasks

Authors: Jean-Michel Attendu, Jean-Philippe Corbeil

Abstract: Finetuning large language models inflates the costs of NLU applications and remains the bottleneck of development cycles. Recent works in computer vision use data pruning to reduce training time. Pruned data selection with static methods is based on a score calculated for each training example prior to finetuning, which involves important computational overhead. Moreover, the score may not necessa… ▽ More Finetuning large language models inflates the costs of NLU applications and remains the bottleneck of development cycles. Recent works in computer vision use data pruning to reduce training time. Pruned data selection with static methods is based on a score calculated for each training example prior to finetuning, which involves important computational overhead. Moreover, the score may not necessarily be representative of sample importance throughout the entire training duration. We propose to address these issues with a refined version of dynamic data pruning, a curriculum which periodically scores and discards unimportant examples during finetuning. Our method leverages an EL2N metric that we extend to the joint intent and slot classification task, and an initial finetuning phase on the full train set. Our results on the GLUE benchmark and four joint NLU datasets show a better time-accuracy trade-off compared to static methods. Our method preserves full accuracy while training on 50% of the data points and reduces computational times by up to 41%. If we tolerate instead a minor drop of accuracy of 1%, we can prune 80% of the training examples for a reduction in finetuning time reaching 66%. △ Less

Submitted 5 June, 2023; originally announced June 2023.

arXiv:2208.12886 [pdf, other]

Building the Intent Landscape of Real-World Conversational Corpora with Extractive Question-Answering Transformers

Authors: Jean-Philippe Corbeil, Mia Taige Li, Hadi Abdi Ghavidel

Abstract: For companies with customer service, mapping intents inside their conversational data is crucial in building applications based on natural language understanding (NLU). Nevertheless, there is no established automated technique to gather the intents from noisy online chats or voice transcripts. Simple clustering approaches are not suited to intent-sparse dialogues. To solve this intent-landscape ta… ▽ More For companies with customer service, mapping intents inside their conversational data is crucial in building applications based on natural language understanding (NLU). Nevertheless, there is no established automated technique to gather the intents from noisy online chats or voice transcripts. Simple clustering approaches are not suited to intent-sparse dialogues. To solve this intent-landscape task, we propose an unsupervised pipeline that extracts the intents and the taxonomy of intents from real-world dialogues. Our pipeline mines intent-span candidates with an extractive Question-Answering Electra model and leverages sentence embeddings to apply a low-level density clustering followed by a top-level hierarchical clustering. Our results demonstrate the generalization ability of an ELECTRA large model fine-tuned on the SQuAD2 dataset to understand dialogues. With the right prompting question, this model achieves a rate of linguistic validation on intent spans beyond 85%. We furthermore reconstructed the intent schemes of five domains from the MultiDoGo dataset with an average recall of 94.3%. △ Less

Submitted 30 August, 2022; v1 submitted 26 August, 2022; originally announced August 2022.

arXiv:2208.06436 [pdf, other]

RandomSCM: interpretable ensembles of sparse classifiers tailored for omics data

Authors: Thibaud Godon, Pier-Luc Plante, Baptiste Bauvin, Elina Francovic-Fontaine, Alexandre Drouin, Jacques Corbeil

Abstract: Background: Understanding the relationship between the Omics and the phenotype is a central problem in precision medicine. The high dimensionality of metabolomics data challenges learning algorithms in terms of scalability and generalization. Most learning algorithms do not produce interpretable models -- Method: We propose an ensemble learning algorithm based on conjunctions or disjunctions of de… ▽ More Background: Understanding the relationship between the Omics and the phenotype is a central problem in precision medicine. The high dimensionality of metabolomics data challenges learning algorithms in terms of scalability and generalization. Most learning algorithms do not produce interpretable models -- Method: We propose an ensemble learning algorithm based on conjunctions or disjunctions of decision rules. -- Results : Applications on metabolomics data shows that it produces models that achieves high predictive performances. The interpretability of the models makes them useful for biomarker discovery and patterns discovery in high dimensional data. △ Less

Submitted 11 August, 2022; originally announced August 2022.

Comments: 3 pages, 2 figures

arXiv:2009.12452 [pdf, other]

BET: A Backtranslation Approach for Easy Data Augmentation in Transformer-based Paraphrase Identification Context

Authors: Jean-Philippe Corbeil, Hadi Abdi Ghadivel

Abstract: Newly-introduced deep learning architectures, namely BERT, XLNet, RoBERTa and ALBERT, have been proved to be robust on several NLP tasks. However, the datasets trained on these architectures are fixed in terms of size and generalizability. To relieve this issue, we apply one of the most inexpensive solutions to update these datasets. We call this approach BET by which we analyze the backtranslatio… ▽ More Newly-introduced deep learning architectures, namely BERT, XLNet, RoBERTa and ALBERT, have been proved to be robust on several NLP tasks. However, the datasets trained on these architectures are fixed in terms of size and generalizability. To relieve this issue, we apply one of the most inexpensive solutions to update these datasets. We call this approach BET by which we analyze the backtranslation data augmentation on the transformer-based architectures. Using the Google Translate API with ten intermediary languages from ten different language families, we externally evaluate the results in the context of automatic paraphrase identification in a transformer-based framework. Our findings suggest that BET improves the paraphrase identification performance on the Microsoft Research Paraphrase Corpus (MRPC) to more than 3% on both accuracy and F1 score. We also analyze the augmentation in the low-data regime with downsampled versions of MRPC, Twitter Paraphrase Corpus (TPC) and Quora Question Pairs. In many low-data cases, we observe a switch from a failing model on the test set to reasonable performances. The results demonstrate that BET is a highly promising data augmentation technique: to push the current state-of-the-art of existing datasets and to bootstrap the utilization of deep learning architectures in the low-data regime of a hundred samples. △ Less

Submitted 25 September, 2020; originally announced September 2020.

MSC Class: 68T50

arXiv:1612.01030 [pdf, other]

Large scale modeling of antimicrobial resistance with interpretable classifiers

Authors: Alexandre Drouin, Frédéric Raymond, Gaël Letarte St-Pierre, Mario Marchand, Jacques Corbeil, François Laviolette

Abstract: Antimicrobial resistance is an important public health concern that has implications in the practice of medicine worldwide. Accurately predicting resistance phenotypes from genome sequences shows great promise in promoting better use of antimicrobial agents, by determining which antibiotics are likely to be effective in specific clinical cases. In healthcare, this would allow for the design of tre… ▽ More Antimicrobial resistance is an important public health concern that has implications in the practice of medicine worldwide. Accurately predicting resistance phenotypes from genome sequences shows great promise in promoting better use of antimicrobial agents, by determining which antibiotics are likely to be effective in specific clinical cases. In healthcare, this would allow for the design of treatment plans tailored for specific individuals, likely resulting in better clinical outcomes for patients with bacterial infections. In this work, we present the recent work of Drouin et al. (2016) on using Set Covering Machines to learn highly interpretable models of antibiotic resistance and complement it by providing a large scale application of their method to the entire PATRIC database. We report prediction results for 36 new datasets and present the Kover AMR platform, a new web-based tool allowing the visualization and interpretation of the generated models. △ Less

Submitted 3 December, 2016; originally announced December 2016.

Comments: Peer-reviewed and accepted for presentation at the Machine Learning for Health Workshop, NIPS 2016, Barcelona, Spain

arXiv:1505.06249 [pdf, other]

Greedy Biomarker Discovery in the Genome with Applications to Antimicrobial Resistance

Authors: Alexandre Drouin, Sébastien Giguère, Maxime Déraspe, François Laviolette, Mario Marchand, Jacques Corbeil

Abstract: The Set Covering Machine (SCM) is a greedy learning algorithm that produces sparse classifiers. We extend the SCM for datasets that contain a huge number of features. The whole genetic material of living organisms is an example of such a case, where the number of feature exceeds 10^7. Three human pathogens were used to evaluate the performance of the SCM at predicting antimicrobial resistance. Our… ▽ More The Set Covering Machine (SCM) is a greedy learning algorithm that produces sparse classifiers. We extend the SCM for datasets that contain a huge number of features. The whole genetic material of living organisms is an example of such a case, where the number of feature exceeds 10^7. Three human pathogens were used to evaluate the performance of the SCM at predicting antimicrobial resistance. Our results show that the SCM compares favorably in terms of sparsity and accuracy against L1 and L2 regularized Support Vector Machines and CART decision trees. Moreover, the SCM was the only algorithm that could consider the full feature space. For all other algorithms, the latter had to be filtered as a preprocessing step. △ Less

Submitted 22 May, 2015; originally announced May 2015.

Comments: Peer-reviewed and accepted for an oral presentation in the Greed is Great workshop at the International Conference on Machine Learning, Lille, France, 2015

arXiv:1412.1074 [pdf, other]

Learning interpretable models of phenotypes from whole genome sequences with the Set Covering Machine

Authors: Alexandre Drouin, Sébastien Giguère, Vladana Sagatovich, Maxime Déraspe, François Laviolette, Mario Marchand, Jacques Corbeil

Abstract: The increased affordability of whole genome sequencing has motivated its use for phenotypic studies. We address the problem of learning interpretable models for discrete phenotypes from whole genomes. We propose a general approach that relies on the Set Covering Machine and a k-mer representation of the genomes. We show results for the problem of predicting the resistance of Pseudomonas Aeruginosa… ▽ More The increased affordability of whole genome sequencing has motivated its use for phenotypic studies. We address the problem of learning interpretable models for discrete phenotypes from whole genomes. We propose a general approach that relies on the Set Covering Machine and a k-mer representation of the genomes. We show results for the problem of predicting the resistance of Pseudomonas Aeruginosa, an important human pathogen, against 4 antibiotics. Our results demonstrate that extremely sparse models which are biologically relevant can be learnt using this approach. △ Less

Submitted 2 December, 2014; originally announced December 2014.

Comments: Presented at Machine Learning in Computational Biology 2014, Montréal, Québec, Canada

arXiv:1311.3573 [pdf, ps, other]

Improved design and screening of high bioactivity peptides for drug discovery

Authors: Sébastien Giguère, François Laviolette, Mario Marchand, Denise Tremblay, Sylvain Moineau, Éric Biron, Jacques Corbeil

Abstract: The discovery of peptides having high biological activity is very challenging mainly because there is an enormous diversity of compounds and only a minority have the desired properties. To lower cost and reduce the time to obtain promising compounds, machine learning approaches can greatly assist in the process and even replace expensive laboratory experiments by learning a predictor with existing… ▽ More The discovery of peptides having high biological activity is very challenging mainly because there is an enormous diversity of compounds and only a minority have the desired properties. To lower cost and reduce the time to obtain promising compounds, machine learning approaches can greatly assist in the process and even replace expensive laboratory experiments by learning a predictor with existing data. Unfortunately, selecting ligands having the greatest predicted bioactivity requires a prohibitive amount of computational time. For this combinatorial problem, heuristics and stochastic optimization methods are not guaranteed to find adequate compounds. We propose an efficient algorithm based on De Bruijn graphs, guaranteed to find the peptides of maximal predicted bioactivity. We demonstrate how this algorithm can be part of an iterative combinatorial chemistry procedure to speed up the discovery and the validation of peptide leads. Moreover, the proposed approach does not require the use of known ligands for the target protein since it can leverage recent multi-target machine learning predictors where ligands for similar targets can serve as initial training data. Finally, we validated the proposed approach in vitro with the discovery of new cationic anti-microbial peptides. Source code is freely available at http://graal.ift.ulaval.ca/peptide-design/. △ Less

Submitted 10 April, 2014; v1 submitted 14 November, 2013; originally announced November 2013.

MSC Class: 92B05 ACM Class: I.2.6; J.3; G.3; G.4; I.5.2

arXiv:1301.5406 [pdf]

doi 10.1186/2047-217X-2-10

Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species

Authors: Keith R. Bradnam, Joseph N. Fass, Anton Alexandrov, Paul Baranay, Michael Bechner, İnanç Birol, Sébastien Boisvert, Jarrod A. Chapman, Guillaume Chapuis, Rayan Chikhi, Hamidreza Chitsaz, Wen-Chi Chou, Jacques Corbeil, Cristian Del Fabbro, T. Roderick Docking, Richard Durbin, Dent Earl, Scott Emrich, Pavel Fedotov, Nuno A. Fonseca, Ganeshkumar Ganapathy, Richard A. Gibbs, Sante Gnerre, Élénie Godzaridis, Steve Goldstein , et al. (66 additional authors not shown)

Abstract: Background - The process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ greatly in terms of their performance (speed, scalability, hardware requirements, acceptance of newer read technologies) and… ▽ More Background - The process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ greatly in terms of their performance (speed, scalability, hardware requirements, acceptance of newer read technologies) and in their final output (composition of assembled sequence). More importantly, it remains largely unclear how to best assess the quality of assembled genome sequences. The Assemblathon competitions are intended to assess current state-of-the-art methods in genome assembly. Results - In Assemblathon 2, we provided a variety of sequence data to be assembled for three vertebrate species (a bird, a fish, and snake). This resulted in a total of 43 submitted assemblies from 21 participating teams. We evaluated these assemblies using a combination of optical map data, Fosmid sequences, and several statistical methods. From over 100 different metrics, we chose ten key measures by which to assess the overall quality of the assemblies. Conclusions - Many current genome assemblers produced useful assemblies, containing a significant representation of their genes, regulatory sequences, and overall genome structure. However, the high degree of variability between the entries suggests that there is still much room for improvement in the field of genome assembly and that approaches which work well in assembling the genome of one species may not necessarily work well for another. △ Less

Submitted 27 June, 2013; v1 submitted 23 January, 2013; originally announced January 2013.

Comments: Additional files available at http://korflab.ucdavis.edu/Datasets/Assemblathon/Assemblathon2/Additional_files/ Major changes 1. Accessions for the 3 read data sets have now been included 2. New file: spreadsheet containing details of all Study, Sample, Run, & Experiment identifiers 3. Made miscellaneous changes to address reviewers comments. DOIs added to GigaDB datasets

Journal ref: GigaScience 2:10 (2013)

arXiv:1207.7253 [pdf, other]

doi 10.1186/1471-2105-14-82

Learning a peptide-protein binding affinity predictor with kernel ridge regression

Authors: Sébastien Giguère, Mario Marchand, François Laviolette, Alexandre Drouin, Jacques Corbeil

Abstract: We propose a specialized string kernel for small bio-molecules, peptides and pseudo-sequences of binding interfaces. The kernel incorporates physico-chemical properties of amino acids and elegantly generalize eight kernels, such as the Oligo, the Weighted Degree, the Blended Spectrum, and the Radial Basis Function. We provide a low complexity dynamic programming algorithm for the exact computation… ▽ More We propose a specialized string kernel for small bio-molecules, peptides and pseudo-sequences of binding interfaces. The kernel incorporates physico-chemical properties of amino acids and elegantly generalize eight kernels, such as the Oligo, the Weighted Degree, the Blended Spectrum, and the Radial Basis Function. We provide a low complexity dynamic programming algorithm for the exact computation of the kernel and a linear time algorithm for it's approximation. Combined with kernel ridge regression and SupCK, a novel binding pocket kernel, the proposed kernel yields biologically relevant and good prediction accuracy on the PepX database. For the first time, a machine learning predictor is capable of accurately predicting the binding affinity of any peptide to any protein. The method was also applied to both single-target and pan-specific Major Histocompatibility Complex class II benchmark datasets and three Quantitative Structure Affinity Model benchmark datasets. On all benchmarks, our method significantly (p-value < 0.057) outperforms the current state-of-the-art methods at predicting peptide-protein binding affinities. The proposed approach is flexible and can be applied to predict any quantitative biological activity. The method should be of value to a large segment of the research community with the potential to accelerate peptide-based drug and vaccine development. △ Less

Submitted 31 July, 2012; originally announced July 2012.

Comments: 22 pages, 4 figures, 5 tables

MSC Class: 92B05 ACM Class: I.2.6; J.3; G.3; G.4; I.5.2

Journal ref: BMC Bioinformatics 2013, 14:82

arXiv:1005.0530 [pdf, ps, other]

Feature Selection with Conjunctions of Decision Stumps and Learning from Microarray Data

Authors: Mohak Shah, Mario Marchand, Jacques Corbeil

Abstract: One of the objectives of designing feature selection learning algorithms is to obtain classifiers that depend on a small number of attributes and have verifiable future performance guarantees. There are few, if any, approaches that successfully address the two goals simultaneously. Performance guarantees become crucial for tasks such as microarray data analysis due to very small sample sizes resul… ▽ More One of the objectives of designing feature selection learning algorithms is to obtain classifiers that depend on a small number of attributes and have verifiable future performance guarantees. There are few, if any, approaches that successfully address the two goals simultaneously. Performance guarantees become crucial for tasks such as microarray data analysis due to very small sample sizes resulting in limited empirical evaluation. To the best of our knowledge, such algorithms that give theoretical bounds on the future performance have not been proposed so far in the context of the classification of gene expression data. In this work, we investigate the premise of learning a conjunction (or disjunction) of decision stumps in Occam's Razor, Sample Compression, and PAC-Bayes learning settings for identifying a small subset of attributes that can be used to perform reliable classification tasks. We apply the proposed approaches for gene identification from DNA microarray data and compare our results to those of well known successful approaches proposed for the task. We show that our algorithm not only finds hypotheses with much smaller number of genes while giving competitive classification accuracy but also have tight risk guarantees on future performance unlike other approaches. The proposed approaches are general and extensible in terms of both designing novel algorithms and application to other domains. △ Less

Submitted 4 May, 2010; originally announced May 2010.

arXiv:0908.3516 [pdf, other]

doi 10.1364/OL.35.000499

A microstructured fiber source of photon pairs at widely separated wavelengths

Authors: Joshua A. Slater, Jean-Simon Corbeil, Stephane Virally, Felix Bussieres, Alexandre Kudlinski, Geraud Bouwmans, Suzanne Lacroix, Nicolas Godbout, Wolfgang Tittel

Abstract: We demonstrate a source of photon pairs with widely separated wavelengths, 810 nm and 1548 nm, generated through spontaneous four-wave mixing in a microstructured fiber. The second-order auto-correlation function g^{(2)}(0) was measured to confirm the non-classical nature of a heralded single photon source constructed from the fiber. The microstructured fiber presented herein has the interesting… ▽ More We demonstrate a source of photon pairs with widely separated wavelengths, 810 nm and 1548 nm, generated through spontaneous four-wave mixing in a microstructured fiber. The second-order auto-correlation function g^{(2)}(0) was measured to confirm the non-classical nature of a heralded single photon source constructed from the fiber. The microstructured fiber presented herein has the interesting property of generating photon pairs with wavelengths suitable for a quantum repeater able to link free-space channels with fiber channels, as well as for a high quality telecommunication wavelength heralded single photon source. It also has the advantage of straightforward coupling into optical fiber. These reasons make this photon pair source particularly interesting for long distance quantum communication. △ Less

Submitted 5 March, 2010; v1 submitted 24 August, 2009; originally announced August 2009.

Comments: 3 pages, 3 figures. Published version

Journal ref: Opt. Lett. 35, 499-501 (2010)

Showing 1–15 of 15 results for author: Corbeil, J