Skip to main content

Showing 1–15 of 15 results for author: Corbeil, J

.
  1. arXiv:2505.10717  [pdf, other

    cs.CL cs.AI

    A Modular Approach for Clinical SLMs Driven by Synthetic Data with Pre-Instruction Tuning, Model Merging, and Clinical-Tasks Alignment

    Authors: Jean-Philippe Corbeil, Amin Dada, Jean-Michel Attendu, Asma Ben Abacha, Alessandro Sordoni, Lucas Caccia, François Beaulieu, Thomas Lin, Jens Kleesiek, Paul Vozila

    Abstract: High computation costs and latency of large language models such as GPT-4 have limited their deployment in clinical settings. Small language models (SLMs) offer a cost-effective alternative, but their limited capacity requires biomedical domain adaptation, which remains challenging. An additional bottleneck is the unavailability and high sensitivity of clinical data. To address these challenges, w… ▽ More

    Submitted 21 May, 2025; v1 submitted 15 May, 2025; originally announced May 2025.

  2. arXiv:2404.15488  [pdf, other

    cs.CL cs.AI cs.MA

    IryoNLP at MEDIQA-CORR 2024: Tackling the Medical Error Detection & Correction Task On the Shoulders of Medical Agents

    Authors: Jean-Philippe Corbeil

    Abstract: In natural language processing applied to the clinical domain, utilizing large language models has emerged as a promising avenue for error detection and correction on clinical notes, a knowledge-intensive task for which annotated data is scarce. This paper presents MedReAct'N'MedReFlex, which leverages a suite of four LLM-based medical agents. The MedReAct agent initiates the process by observing,… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

  3. arXiv:2306.04777  [pdf, other

    cs.LG stat.ME stat.ML

    Invariant Causal Set Covering Machines

    Authors: Thibaud Godon, Baptiste Bauvin, Pascal Germain, Jacques Corbeil, Alexandre Drouin

    Abstract: Rule-based models, such as decision trees, appeal to practitioners due to their interpretable nature. However, the learning algorithms that produce such models are often vulnerable to spurious associations and thus, they are not guaranteed to extract causally-relevant insights. In this work, we build on ideas from the invariant causal prediction literature to propose Invariant Causal Set Covering… ▽ More

    Submitted 21 March, 2025; v1 submitted 7 June, 2023; originally announced June 2023.

  4. arXiv:2306.03208  [pdf, other

    cs.CL cs.AI

    NLU on Data Diets: Dynamic Data Subset Selection for NLP Classification Tasks

    Authors: Jean-Michel Attendu, Jean-Philippe Corbeil

    Abstract: Finetuning large language models inflates the costs of NLU applications and remains the bottleneck of development cycles. Recent works in computer vision use data pruning to reduce training time. Pruned data selection with static methods is based on a score calculated for each training example prior to finetuning, which involves important computational overhead. Moreover, the score may not necessa… ▽ More

    Submitted 5 June, 2023; originally announced June 2023.

  5. arXiv:2208.12886  [pdf, other

    cs.CL cs.AI cs.LG

    Building the Intent Landscape of Real-World Conversational Corpora with Extractive Question-Answering Transformers

    Authors: Jean-Philippe Corbeil, Mia Taige Li, Hadi Abdi Ghavidel

    Abstract: For companies with customer service, mapping intents inside their conversational data is crucial in building applications based on natural language understanding (NLU). Nevertheless, there is no established automated technique to gather the intents from noisy online chats or voice transcripts. Simple clustering approaches are not suited to intent-sparse dialogues. To solve this intent-landscape ta… ▽ More

    Submitted 30 August, 2022; v1 submitted 26 August, 2022; originally announced August 2022.

  6. arXiv:2208.06436  [pdf, other

    cs.LG

    RandomSCM: interpretable ensembles of sparse classifiers tailored for omics data

    Authors: Thibaud Godon, Pier-Luc Plante, Baptiste Bauvin, Elina Francovic-Fontaine, Alexandre Drouin, Jacques Corbeil

    Abstract: Background: Understanding the relationship between the Omics and the phenotype is a central problem in precision medicine. The high dimensionality of metabolomics data challenges learning algorithms in terms of scalability and generalization. Most learning algorithms do not produce interpretable models -- Method: We propose an ensemble learning algorithm based on conjunctions or disjunctions of de… ▽ More

    Submitted 11 August, 2022; originally announced August 2022.

    Comments: 3 pages, 2 figures

  7. arXiv:2009.12452  [pdf, other

    cs.CL

    BET: A Backtranslation Approach for Easy Data Augmentation in Transformer-based Paraphrase Identification Context

    Authors: Jean-Philippe Corbeil, Hadi Abdi Ghadivel

    Abstract: Newly-introduced deep learning architectures, namely BERT, XLNet, RoBERTa and ALBERT, have been proved to be robust on several NLP tasks. However, the datasets trained on these architectures are fixed in terms of size and generalizability. To relieve this issue, we apply one of the most inexpensive solutions to update these datasets. We call this approach BET by which we analyze the backtranslatio… ▽ More

    Submitted 25 September, 2020; originally announced September 2020.

    MSC Class: 68T50

  8. arXiv:1612.01030  [pdf, other

    q-bio.GN cs.LG stat.ML

    Large scale modeling of antimicrobial resistance with interpretable classifiers

    Authors: Alexandre Drouin, Frédéric Raymond, Gaël Letarte St-Pierre, Mario Marchand, Jacques Corbeil, François Laviolette

    Abstract: Antimicrobial resistance is an important public health concern that has implications in the practice of medicine worldwide. Accurately predicting resistance phenotypes from genome sequences shows great promise in promoting better use of antimicrobial agents, by determining which antibiotics are likely to be effective in specific clinical cases. In healthcare, this would allow for the design of tre… ▽ More

    Submitted 3 December, 2016; originally announced December 2016.

    Comments: Peer-reviewed and accepted for presentation at the Machine Learning for Health Workshop, NIPS 2016, Barcelona, Spain

  9. arXiv:1505.06249  [pdf, other

    q-bio.GN cs.LG stat.ML

    Greedy Biomarker Discovery in the Genome with Applications to Antimicrobial Resistance

    Authors: Alexandre Drouin, Sébastien Giguère, Maxime Déraspe, François Laviolette, Mario Marchand, Jacques Corbeil

    Abstract: The Set Covering Machine (SCM) is a greedy learning algorithm that produces sparse classifiers. We extend the SCM for datasets that contain a huge number of features. The whole genetic material of living organisms is an example of such a case, where the number of feature exceeds 10^7. Three human pathogens were used to evaluate the performance of the SCM at predicting antimicrobial resistance. Our… ▽ More

    Submitted 22 May, 2015; originally announced May 2015.

    Comments: Peer-reviewed and accepted for an oral presentation in the Greed is Great workshop at the International Conference on Machine Learning, Lille, France, 2015

  10. arXiv:1412.1074  [pdf, other

    q-bio.GN cs.CE cs.LG stat.ML

    Learning interpretable models of phenotypes from whole genome sequences with the Set Covering Machine

    Authors: Alexandre Drouin, Sébastien Giguère, Vladana Sagatovich, Maxime Déraspe, François Laviolette, Mario Marchand, Jacques Corbeil

    Abstract: The increased affordability of whole genome sequencing has motivated its use for phenotypic studies. We address the problem of learning interpretable models for discrete phenotypes from whole genomes. We propose a general approach that relies on the Set Covering Machine and a k-mer representation of the genomes. We show results for the problem of predicting the resistance of Pseudomonas Aeruginosa… ▽ More

    Submitted 2 December, 2014; originally announced December 2014.

    Comments: Presented at Machine Learning in Computational Biology 2014, Montréal, Québec, Canada

  11. arXiv:1311.3573  [pdf, ps, other

    q-bio.QM

    Improved design and screening of high bioactivity peptides for drug discovery

    Authors: Sébastien Giguère, François Laviolette, Mario Marchand, Denise Tremblay, Sylvain Moineau, Éric Biron, Jacques Corbeil

    Abstract: The discovery of peptides having high biological activity is very challenging mainly because there is an enormous diversity of compounds and only a minority have the desired properties. To lower cost and reduce the time to obtain promising compounds, machine learning approaches can greatly assist in the process and even replace expensive laboratory experiments by learning a predictor with existing… ▽ More

    Submitted 10 April, 2014; v1 submitted 14 November, 2013; originally announced November 2013.

    MSC Class: 92B05 ACM Class: I.2.6; J.3; G.3; G.4; I.5.2

  12. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species

    Authors: Keith R. Bradnam, Joseph N. Fass, Anton Alexandrov, Paul Baranay, Michael Bechner, İnanç Birol, Sébastien Boisvert, Jarrod A. Chapman, Guillaume Chapuis, Rayan Chikhi, Hamidreza Chitsaz, Wen-Chi Chou, Jacques Corbeil, Cristian Del Fabbro, T. Roderick Docking, Richard Durbin, Dent Earl, Scott Emrich, Pavel Fedotov, Nuno A. Fonseca, Ganeshkumar Ganapathy, Richard A. Gibbs, Sante Gnerre, Élénie Godzaridis, Steve Goldstein , et al. (66 additional authors not shown)

    Abstract: Background - The process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ greatly in terms of their performance (speed, scalability, hardware requirements, acceptance of newer read technologies) and… ▽ More

    Submitted 27 June, 2013; v1 submitted 23 January, 2013; originally announced January 2013.

    Comments: Additional files available at http://korflab.ucdavis.edu/Datasets/Assemblathon/Assemblathon2/Additional_files/ Major changes 1. Accessions for the 3 read data sets have now been included 2. New file: spreadsheet containing details of all Study, Sample, Run, & Experiment identifiers 3. Made miscellaneous changes to address reviewers comments. DOIs added to GigaDB datasets

    Journal ref: GigaScience 2:10 (2013)

  13. arXiv:1207.7253  [pdf, other

    q-bio.QM cs.LG q-bio.BM stat.ML

    Learning a peptide-protein binding affinity predictor with kernel ridge regression

    Authors: Sébastien Giguère, Mario Marchand, François Laviolette, Alexandre Drouin, Jacques Corbeil

    Abstract: We propose a specialized string kernel for small bio-molecules, peptides and pseudo-sequences of binding interfaces. The kernel incorporates physico-chemical properties of amino acids and elegantly generalize eight kernels, such as the Oligo, the Weighted Degree, the Blended Spectrum, and the Radial Basis Function. We provide a low complexity dynamic programming algorithm for the exact computation… ▽ More

    Submitted 31 July, 2012; originally announced July 2012.

    Comments: 22 pages, 4 figures, 5 tables

    MSC Class: 92B05 ACM Class: I.2.6; J.3; G.3; G.4; I.5.2

    Journal ref: BMC Bioinformatics 2013, 14:82

  14. arXiv:1005.0530  [pdf, ps, other

    cs.LG cs.AI stat.ML

    Feature Selection with Conjunctions of Decision Stumps and Learning from Microarray Data

    Authors: Mohak Shah, Mario Marchand, Jacques Corbeil

    Abstract: One of the objectives of designing feature selection learning algorithms is to obtain classifiers that depend on a small number of attributes and have verifiable future performance guarantees. There are few, if any, approaches that successfully address the two goals simultaneously. Performance guarantees become crucial for tasks such as microarray data analysis due to very small sample sizes resul… ▽ More

    Submitted 4 May, 2010; originally announced May 2010.

  15. A microstructured fiber source of photon pairs at widely separated wavelengths

    Authors: Joshua A. Slater, Jean-Simon Corbeil, Stephane Virally, Felix Bussieres, Alexandre Kudlinski, Geraud Bouwmans, Suzanne Lacroix, Nicolas Godbout, Wolfgang Tittel

    Abstract: We demonstrate a source of photon pairs with widely separated wavelengths, 810 nm and 1548 nm, generated through spontaneous four-wave mixing in a microstructured fiber. The second-order auto-correlation function g^{(2)}(0) was measured to confirm the non-classical nature of a heralded single photon source constructed from the fiber. The microstructured fiber presented herein has the interesting… ▽ More

    Submitted 5 March, 2010; v1 submitted 24 August, 2009; originally announced August 2009.

    Comments: 3 pages, 3 figures. Published version

    Journal ref: Opt. Lett. 35, 499-501 (2010)