-
Scalable and Cost-Efficient de Novo Template-Based Molecular Generation
Authors:
Piotr Gaiński,
Oussama Boussif,
Andrei Rekesh,
Dmytro Shevchuk,
Ali Parviz,
Mike Tyers,
Robert A. Batey,
Michał Koziarski
Abstract:
Template-based molecular generation offers a promising avenue for drug design by ensuring generated compounds are synthetically accessible through predefined reaction templates and building blocks. In this work, we tackle three core challenges in template-based GFlowNets: (1) minimizing synthesis cost, (2) scaling to large building block libraries, and (3) effectively utilizing small fragment sets…
▽ More
Template-based molecular generation offers a promising avenue for drug design by ensuring generated compounds are synthetically accessible through predefined reaction templates and building blocks. In this work, we tackle three core challenges in template-based GFlowNets: (1) minimizing synthesis cost, (2) scaling to large building block libraries, and (3) effectively utilizing small fragment sets. We propose \textbf{Recursive Cost Guidance}, a backward policy framework that employs auxiliary machine learning models to approximate synthesis cost and viability. This guidance steers generation toward low-cost synthesis pathways, significantly enhancing cost-efficiency, molecular diversity, and quality, especially when paired with an \textbf{Exploitation Penalty} that balances the trade-off between exploration and exploitation. To enhance performance in smaller building block libraries, we develop a \textbf{Dynamic Library} mechanism that reuses intermediate high-reward states to construct full synthesis trees. Our approach establishes state-of-the-art results in template-based molecular generation.
△ Less
Submitted 10 June, 2025;
originally announced June 2025.
-
RGFN: Synthesizable Molecular Generation Using GFlowNets
Authors:
Michał Koziarski,
Andrei Rekesh,
Dmytro Shevchuk,
Almer van der Sloot,
Piotr Gaiński,
Yoshua Bengio,
Cheng-Hao Liu,
Mike Tyers,
Robert A. Batey
Abstract:
Generative models hold great promise for small molecule discovery, significantly increasing the size of search space compared to traditional in silico screening libraries. However, most existing machine learning methods for small molecule generation suffer from poor synthesizability of candidate compounds, making experimental validation difficult. In this paper we propose Reaction-GFlowNet (RGFN),…
▽ More
Generative models hold great promise for small molecule discovery, significantly increasing the size of search space compared to traditional in silico screening libraries. However, most existing machine learning methods for small molecule generation suffer from poor synthesizability of candidate compounds, making experimental validation difficult. In this paper we propose Reaction-GFlowNet (RGFN), an extension of the GFlowNet framework that operates directly in the space of chemical reactions, thereby allowing out-of-the-box synthesizability while maintaining comparable quality of generated candidates. We demonstrate that with the proposed set of reactions and building blocks, it is possible to obtain a search space of molecules orders of magnitude larger than existing screening libraries coupled with low cost of synthesis. We also show that the approach scales to very large fragment libraries, further increasing the number of potential molecules. We demonstrate the effectiveness of the proposed approach across a range of oracle models, including pretrained proxy models and GPU-accelerated docking.
△ Less
Submitted 6 November, 2024; v1 submitted 1 June, 2024;
originally announced June 2024.
-
Generative Active Learning for the Search of Small-molecule Protein Binders
Authors:
Maksym Korablyov,
Cheng-Hao Liu,
Moksh Jain,
Almer M. van der Sloot,
Eric Jolicoeur,
Edward Ruediger,
Andrei Cristian Nica,
Emmanuel Bengio,
Kostiantyn Lapchevskyi,
Daniel St-Cyr,
Doris Alexandra Schuetz,
Victor Ion Butoi,
Jarrid Rector-Brooks,
Simon Blackburn,
Leo Feng,
Hadi Nekoei,
SaiKrishna Gottipati,
Priyesh Vijayan,
Prateek Gupta,
Ladislav Rampášek,
Sasikanth Avancha,
Pierre-Luc Bacon,
William L. Hamilton,
Brooks Paige,
Sanchit Misra
, et al. (9 additional authors not shown)
Abstract:
Despite substantial progress in machine learning for scientific discovery in recent years, truly de novo design of small molecules which exhibit a property of interest remains a significant challenge. We introduce LambdaZero, a generative active learning approach to search for synthesizable molecules. Powered by deep reinforcement learning, LambdaZero learns to search over the vast space of molecu…
▽ More
Despite substantial progress in machine learning for scientific discovery in recent years, truly de novo design of small molecules which exhibit a property of interest remains a significant challenge. We introduce LambdaZero, a generative active learning approach to search for synthesizable molecules. Powered by deep reinforcement learning, LambdaZero learns to search over the vast space of molecules to discover candidates with a desired property. We apply LambdaZero with molecular docking to design novel small molecules that inhibit the enzyme soluble Epoxide Hydrolase 2 (sEH), while enforcing constraints on synthesizability and drug-likeliness. LambdaZero provides an exponential speedup in terms of the number of calls to the expensive molecular docking oracle, and LambdaZero de novo designed molecules reach docking scores that would otherwise require the virtual screening of a hundred billion molecules. Importantly, LambdaZero discovers novel scaffolds of synthesizable, drug-like inhibitors for sEH. In in vitro experimental validation, a series of ligands from a generated quinazoline-based scaffold were synthesized, and the lead inhibitor N-(4,6-di(pyrrolidin-1-yl)quinazolin-2-yl)-N-methylbenzamide (UM0152893) displayed sub-micromolar enzyme inhibition of sEH.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
OmniLingo: Listening- and speaking-based language learning
Authors:
Francis M. Tyers,
Nicholas Howell
Abstract:
In this demo paper we present OmniLingo, an architecture for distributing data for listening- and speaking-based language learning applications and a demonstration client built using the architecture. The architecture is based on the Interplanetary Filesystem (IPFS) and puts at the forefront user sovereignty over data.
In this demo paper we present OmniLingo, an architecture for distributing data for listening- and speaking-based language learning applications and a demonstration client built using the architecture. The architecture is based on the Interplanetary Filesystem (IPFS) and puts at the forefront user sovereignty over data.
△ Less
Submitted 10 October, 2023;
originally announced October 2023.
-
Graph-Based Active Machine Learning Method for Diverse and Novel Antimicrobial Peptides Generation and Selection
Authors:
Bonaventure F. P. Dossou,
Dianbo Liu,
Xu Ji,
Moksh Jain,
Almer M. van der Sloot,
Roger Palou,
Michael Tyers,
Yoshua Bengio
Abstract:
As antibiotic-resistant bacterial strains are rapidly spreading worldwide, infections caused by these strains are emerging as a global crisis causing the death of millions of people every year. Antimicrobial Peptides (AMPs) are one of the candidates to tackle this problem because of their potential diversity, and ability to favorably modulate the host immune response. However, large-scale screenin…
▽ More
As antibiotic-resistant bacterial strains are rapidly spreading worldwide, infections caused by these strains are emerging as a global crisis causing the death of millions of people every year. Antimicrobial Peptides (AMPs) are one of the candidates to tackle this problem because of their potential diversity, and ability to favorably modulate the host immune response. However, large-scale screening of new AMP candidates is expensive, time-consuming, and now affordable in developing countries, which need the treatments the most. In this work, we propose a novel active machine learning-based framework that statistically minimizes the number of wet-lab experiments needed to design new AMPs, while ensuring a high diversity and novelty of generated AMPs sequences, in multi-rounds of wet-lab AMP screening settings. Combining recurrent neural network models and a graph-based filter (GraphCC), our proposed approach delivers novel and diverse candidates and demonstrates better performances according to our defined metrics.
△ Less
Submitted 18 September, 2022;
originally announced September 2022.
-
Yet Another Format of Universal Dependencies for Korean
Authors:
Yige Chen,
Eunkyul Leah Jo,
Yundong Yao,
KyungTae Lim,
Miikka Silfverberg,
Francis M. Tyers,
Jungyeul Park
Abstract:
In this study, we propose a morpheme-based scheme for Korean dependency parsing and adopt the proposed scheme to Universal Dependencies. We present the linguistic rationale that illustrates the motivation and the necessity of adopting the morpheme-based format, and develop scripts that convert between the original format used by Universal Dependencies and the proposed morpheme-based format automat…
▽ More
In this study, we propose a morpheme-based scheme for Korean dependency parsing and adopt the proposed scheme to Universal Dependencies. We present the linguistic rationale that illustrates the motivation and the necessity of adopting the morpheme-based format, and develop scripts that convert between the original format used by Universal Dependencies and the proposed morpheme-based format automatically. The effectiveness of the proposed format for Korean dependency parsing is then testified by both statistical and neural models, including UDPipe and Stanza, with our carefully constructed morpheme-based word embedding for Korean. morphUD outperforms parsing results for all Korean UD treebanks, and we also present detailed error analyses.
△ Less
Submitted 20 September, 2022;
originally announced September 2022.
-
UniMorph 4.0: Universal Morphology
Authors:
Khuyagbaatar Batsuren,
Omer Goldman,
Salam Khalifa,
Nizar Habash,
Witold Kieraś,
Gábor Bella,
Brian Leonard,
Garrett Nicolai,
Kyle Gorman,
Yustinus Ghanggo Ate,
Maria Ryskina,
Sabrina J. Mielke,
Elena Budianskaya,
Charbel El-Khaissi,
Tiago Pimentel,
Michael Gasser,
William Lane,
Mohit Raj,
Matt Coler,
Jaime Rafael Montoya Samame,
Delio Siticonatzi Camaiteri,
Benoît Sagot,
Esaú Zumaeta Rojas,
Didier López Francis,
Arturo Oncevay
, et al. (71 additional authors not shown)
Abstract:
The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This pa…
▽ More
The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements made on several fronts over the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 67 new languages, including 30 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g. missing gender and macron information. We have also amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet.
△ Less
Submitted 19 June, 2022; v1 submitted 7 May, 2022;
originally announced May 2022.
-
RECOVER: sequential model optimization platform for combination drug repurposing identifies novel synergistic compounds in vitro
Authors:
Paul Bertin,
Jarrid Rector-Brooks,
Deepak Sharma,
Thomas Gaudelet,
Andrew Anighoro,
Torsten Gross,
Francisco Martinez-Pena,
Eileen L. Tang,
Suraj M S,
Cristian Regep,
Jeremy Hayter,
Maksym Korablyov,
Nicholas Valiante,
Almer van der Sloot,
Mike Tyers,
Charles Roberts,
Michael M. Bronstein,
Luke L. Lairson,
Jake P. Taylor-King,
Yoshua Bengio
Abstract:
For large libraries of small molecules, exhaustive combinatorial chemical screens become infeasible to perform when considering a range of disease models, assay conditions, and dose ranges. Deep learning models have achieved state of the art results in silico for the prediction of synergy scores. However, databases of drug combinations are biased towards synergistic agents and these results do not…
▽ More
For large libraries of small molecules, exhaustive combinatorial chemical screens become infeasible to perform when considering a range of disease models, assay conditions, and dose ranges. Deep learning models have achieved state of the art results in silico for the prediction of synergy scores. However, databases of drug combinations are biased towards synergistic agents and these results do not necessarily generalise out of distribution. We employ a sequential model optimization search utilising a deep learning model to quickly discover synergistic drug combinations active against a cancer cell line, requiring substantially less screening than an exhaustive evaluation. Our small scale wet lab experiments only account for evaluation of ~5% of the total search space. After only 3 rounds of ML-guided in vitro experimentation (including a calibration round), we find that the set of drug pairs queried is enriched for highly synergistic combinations; two additional rounds of ML-guided experiments were performed to ensure reproducibility of trends. Remarkably, we rediscover drug combinations later confirmed to be under study within clinical trials. Moreover, we find that drug embeddings generated using only structural information begin to reflect mechanisms of action. Prior in silico benchmarking suggests we can enrich search queries by a factor of ~5-10x for highly synergistic drug combinations by using sequential rounds of evaluation when compared to random selection, or by a factor of >3x when using a pretrained model selecting all drug combinations at a single time point.
△ Less
Submitted 2 March, 2023; v1 submitted 6 February, 2022;
originally announced February 2022.
-
What shall we do with an hour of data? Speech recognition for the un- and under-served languages of Common Voice
Authors:
Francis M. Tyers,
Josh Meyer
Abstract:
This technical report describes the methods and results of a three-week sprint to produce deployable speech recognition models for 31 under-served languages of the Common Voice project. We outline the preprocessing steps, hyperparameter selection, and resulting accuracy on official testing sets. In addition to this we evaluate the models on multiple tasks: closed-vocabulary speech recognition, pre…
▽ More
This technical report describes the methods and results of a three-week sprint to produce deployable speech recognition models for 31 under-served languages of the Common Voice project. We outline the preprocessing steps, hyperparameter selection, and resulting accuracy on official testing sets. In addition to this we evaluate the models on multiple tasks: closed-vocabulary speech recognition, pre-transcription, forced alignment, and key-word spotting. The following experiments use Coqui STT, a toolkit for training and deployment of neural Speech-to-Text models.
△ Less
Submitted 10 May, 2021;
originally announced May 2021.
-
A bandit approach to curriculum generation for automatic speech recognition
Authors:
Anastasia Kuznetsova,
Anurag Kumar,
Francis M. Tyers
Abstract:
The Automated Speech Recognition (ASR) task has been a challenging domain especially for low data scenarios with few audio examples. This is the main problem in training ASR systems on the data from low-resource or marginalized languages. In this paper we present an approach to mitigate the lack of training data by employing Automated Curriculum Learning in combination with an adversarial bandit a…
▽ More
The Automated Speech Recognition (ASR) task has been a challenging domain especially for low data scenarios with few audio examples. This is the main problem in training ASR systems on the data from low-resource or marginalized languages. In this paper we present an approach to mitigate the lack of training data by employing Automated Curriculum Learning in combination with an adversarial bandit approach inspired by Reinforcement learning. The goal of the approach is to optimize the training sequence of mini-batches ranked by the level of difficulty and compare the ASR performance metrics against the random training sequence and discrete curriculum. We test our approach on a truly low-resource language and show that the bandit framework has a good improvement over the baseline transfer-learning model.
△ Less
Submitted 6 February, 2021;
originally announced February 2021.
-
Common Voice: A Massively-Multilingual Speech Corpus
Authors:
Rosana Ardila,
Megan Branson,
Kelly Davis,
Michael Henretty,
Michael Kohler,
Josh Meyer,
Reuben Morais,
Lindsay Saunders,
Francis M. Tyers,
Gregor Weber
Abstract:
The Common Voice corpus is a massively-multilingual collection of transcribed speech intended for speech technology research and development. Common Voice is designed for Automatic Speech Recognition purposes but can be useful in other domains (e.g. language identification). To achieve scale and sustainability, the Common Voice project employs crowdsourcing for both data collection and data valida…
▽ More
The Common Voice corpus is a massively-multilingual collection of transcribed speech intended for speech technology research and development. Common Voice is designed for Automatic Speech Recognition purposes but can be useful in other domains (e.g. language identification). To achieve scale and sustainability, the Common Voice project employs crowdsourcing for both data collection and data validation. The most recent release includes 29 languages, and as of November 2019 there are a total of 38 languages collecting data. Over 50,000 individuals have participated so far, resulting in 2,500 hours of collected audio. To our knowledge this is the largest audio corpus in the public domain for speech recognition, both in terms of number of hours and number of languages. As an example use case for Common Voice, we present speech recognition experiments using Mozilla's DeepSpeech Speech-to-Text toolkit. By applying transfer learning from a source English model, we find an average Character Error Rate improvement of 5.99 +/- 5.48 for twelve target languages (German, French, Italian, Turkish, Catalan, Slovenian, Welsh, Irish, Breton, Tatar, Chuvash, and Kabyle). For most of these languages, these are the first ever published results on end-to-end Automatic Speech Recognition.
△ Less
Submitted 5 March, 2020; v1 submitted 13 December, 2019;
originally announced December 2019.
-
Can LSTM Learn to Capture Agreement? The Case of Basque
Authors:
Shauli Ravfogel,
Francis M. Tyers,
Yoav Goldberg
Abstract:
Sequential neural networks models are powerful tools in a variety of Natural Language Processing (NLP) tasks. The sequential nature of these models raises the questions: to what extent can these models implicitly learn hierarchical structures typical to human language, and what kind of grammatical phenomena can they acquire?
We focus on the task of agreement prediction in Basque, as a case study…
▽ More
Sequential neural networks models are powerful tools in a variety of Natural Language Processing (NLP) tasks. The sequential nature of these models raises the questions: to what extent can these models implicitly learn hierarchical structures typical to human language, and what kind of grammatical phenomena can they acquire?
We focus on the task of agreement prediction in Basque, as a case study for a task that requires implicit understanding of sentence structure and the acquisition of a complex but consistent morphological system. Analyzing experimental results from two syntactic prediction tasks -- verb number prediction and suffix recovery -- we find that sequential models perform worse on agreement prediction in Basque than one might expect on the basis of a previous agreement prediction work in English. Tentative findings based on diagnostic classifiers suggest the network makes use of local heuristics as a proxy for the hierarchical structure of the sentence. We propose the Basque agreement prediction task as challenging benchmark for models that attempt to learn regularities in human language.
△ Less
Submitted 26 November, 2018; v1 submitted 11 September, 2018;
originally announced September 2018.