-
Natural Language Processing tools for Pharmaceutical Manufacturing Information Extraction from Patents
Authors:
Diego Alvarado-Maldonado,
Blair Johnston,
Cameron J. Brown
Abstract:
Abundant and diverse data on medicines manufacturing and other lifecycle components has been made easily accessible in the last decades. However, a significant proportion of this information is characterised by not being tabulated and usable for machine learning purposes. Thus, natural language processing tools have been used to build databases in domains such as biomedical and chemical to address…
▽ More
Abundant and diverse data on medicines manufacturing and other lifecycle components has been made easily accessible in the last decades. However, a significant proportion of this information is characterised by not being tabulated and usable for machine learning purposes. Thus, natural language processing tools have been used to build databases in domains such as biomedical and chemical to address this limitation. This has allowed the development of artificial intelligence applications, which have improved drug discovery and treatments. In the pharmaceutical manufacturing context, some initiatives and datasets for primary processing can be found, but the manufacturing of drug products is an area which is still lacking, to the best of our knowledge. This works aims to explore and adapt NLP tools used in other domains to extract information on both primary and secondary manufacturing, employing patents as the main source of data. Thus, two independent, but complementary, models were developed comprising a method to select fragments of text that contain manufacturing data, and a named entity recognition system that enables extracting information on operations, materials, and conditions of a process. For the first model, the identification of relevant sections was achieved using an unsupervised approach combining Latent Dirichlet Allocation and k-Means clustering. The performance of this model measured as a Cohen's kappa between model output and manual revision was higher than 90%. NER model consisted of a deep neural network, and an f1-score micro average of 84.2% was obtained which is comparable to other works. Some considerations for these tools to be used in data extraction are discussed throughout this document.
△ Less
Submitted 1 May, 2025; v1 submitted 29 April, 2025;
originally announced April 2025.
-
Last-layer committee machines for uncertainty estimations of benthic imagery
Authors:
H. Martin Gillis,
Isaac Xu,
Benjamin Misiuk,
Craig J. Brown,
Thomas Trappenberg
Abstract:
Automating the annotation of benthic imagery (i.e., images of the seafloor and its associated organisms, habitats, and geological features) is critical for monitoring rapidly changing ocean ecosystems. Deep learning approaches have succeeded in this purpose; however, consistent annotation remains challenging due to ambiguous seafloor images, potential inter-user annotation disagreements, and out-o…
▽ More
Automating the annotation of benthic imagery (i.e., images of the seafloor and its associated organisms, habitats, and geological features) is critical for monitoring rapidly changing ocean ecosystems. Deep learning approaches have succeeded in this purpose; however, consistent annotation remains challenging due to ambiguous seafloor images, potential inter-user annotation disagreements, and out-of-distribution samples. Marine scientists implementing deep learning models often obtain predictions based on one-hot representations trained using a cross-entropy loss objective with softmax normalization, resulting with a single set of model parameters. While efficient, this approach may lead to overconfident predictions for context-challenging datasets, raising reliability concerns that present risks for downstream tasks such as benthic habitat mapping and marine spatial planning. In this study, we investigated classification uncertainty as a tool to improve the labeling of benthic habitat imagery. We developed a framework for two challenging sub-datasets of the recently publicly available BenthicNet dataset using Bayesian neural networks, Monte Carlo dropout inference sampling, and a proposed single last-layer committee machine. This approach resulted with a > 95% reduction of network parameters to obtain per-sample uncertainties while obtaining near-identical performance compared to computationally more expensive strategies such as Bayesian neural networks, Monte Carlo dropout, and deep ensembles. The method proposed in this research provides a strategy for obtaining prioritized lists of uncertain samples for human-in-the-loop interventions to identify ambiguous, mislabeled, out-of-distribution, and/or difficult images for enhancing existing annotation tools for benthic mapping and other applications.
△ Less
Submitted 22 April, 2025;
originally announced April 2025.
-
Hierarchical Multi-Label Classification with Missing Information for Benthic Habitat Imagery
Authors:
Isaac Xu,
Benjamin Misiuk,
Scott C. Lowe,
Martin Gillis,
Craig J. Brown,
Thomas Trappenberg
Abstract:
In this work, we apply state-of-the-art self-supervised learning techniques on a large dataset of seafloor imagery, \textit{BenthicNet}, and study their performance for a complex hierarchical multi-label (HML) classification downstream task. In particular, we demonstrate the capacity to conduct HML training in scenarios where there exist multiple levels of missing annotation information, an import…
▽ More
In this work, we apply state-of-the-art self-supervised learning techniques on a large dataset of seafloor imagery, \textit{BenthicNet}, and study their performance for a complex hierarchical multi-label (HML) classification downstream task. In particular, we demonstrate the capacity to conduct HML training in scenarios where there exist multiple levels of missing annotation information, an important scenario for handling heterogeneous real-world data collected by multiple research groups with differing data collection protocols. We find that, when using smaller one-hot image label datasets typical of local or regional scale benthic science projects, models pre-trained with self-supervision on a larger collection of in-domain benthic data outperform models pre-trained on ImageNet. In the HML setting, we find the model can attain a deeper and more precise classification if it is pre-trained with self-supervision on in-domain data. We hope this work can establish a benchmark for future models in the field of automated underwater image annotation tasks and can guide work in other domains with hierarchical annotations of mixed resolution.
△ Less
Submitted 10 September, 2024;
originally announced September 2024.
-
BenthicNet: A global compilation of seafloor images for deep learning applications
Authors:
Scott C. Lowe,
Benjamin Misiuk,
Isaac Xu,
Shakhboz Abdulazizov,
Amit R. Baroi,
Alex C. Bastos,
Merlin Best,
Vicki Ferrini,
Ariell Friedman,
Deborah Hart,
Ove Hoegh-Guldberg,
Daniel Ierodiaconou,
Julia Mackin-McLaughlin,
Kathryn Markey,
Pedro S. Menandro,
Jacquomo Monk,
Shreya Nemani,
John O'Brien,
Elizabeth Oh,
Luba Y. Reshitnyk,
Katleen Robert,
Chris M. Roelfsema,
Jessica A. Sameoto,
Alexandre C. G. Schimel,
Jordan A. Thomson
, et al. (4 additional authors not shown)
Abstract:
Advances in underwater imaging enable collection of extensive seafloor image datasets necessary for monitoring important benthic ecosystems. The ability to collect seafloor imagery has outpaced our capacity to analyze it, hindering mobilization of this crucial environmental information. Machine learning approaches provide opportunities to increase the efficiency with which seafloor imagery is anal…
▽ More
Advances in underwater imaging enable collection of extensive seafloor image datasets necessary for monitoring important benthic ecosystems. The ability to collect seafloor imagery has outpaced our capacity to analyze it, hindering mobilization of this crucial environmental information. Machine learning approaches provide opportunities to increase the efficiency with which seafloor imagery is analyzed, yet large and consistent datasets to support development of such approaches are scarce. Here we present BenthicNet: a global compilation of seafloor imagery designed to support the training and evaluation of large-scale image recognition models. An initial set of over 11.4 million images was collected and curated to represent a diversity of seafloor environments using a representative subset of 1.3 million images. These are accompanied by 3.1 million annotations translated to the CATAMI scheme, which span 190,000 of the images. A large deep learning model was trained on this compilation and preliminary results suggest it has utility for automating large and small-scale image analysis tasks. The compilation and model are made openly available for reuse at https://doi.org/10.20383/103.0614.
△ Less
Submitted 18 February, 2025; v1 submitted 8 May, 2024;
originally announced May 2024.
-
Multi-Sample $ζ$-mixup: Richer, More Realistic Synthetic Samples from a $p$-Series Interpolant
Authors:
Kumar Abhishek,
Colin J. Brown,
Ghassan Hamarneh
Abstract:
Modern deep learning training procedures rely on model regularization techniques such as data augmentation methods, which generate training samples that increase the diversity of data and richness of label information. A popular recent method, mixup, uses convex combinations of pairs of original samples to generate new samples. However, as we show in our experiments, mixup can produce undesirable…
▽ More
Modern deep learning training procedures rely on model regularization techniques such as data augmentation methods, which generate training samples that increase the diversity of data and richness of label information. A popular recent method, mixup, uses convex combinations of pairs of original samples to generate new samples. However, as we show in our experiments, mixup can produce undesirable synthetic samples, where the data is sampled off the manifold and can contain incorrect labels. We propose $ζ$-mixup, a generalization of mixup with provably and demonstrably desirable properties that allows convex combinations of $N \geq 2$ samples, leading to more realistic and diverse outputs that incorporate information from $N$ original samples by using a $p$-series interpolant. We show that, compared to mixup, $ζ$-mixup better preserves the intrinsic dimensionality of the original datasets, which is a desirable property for training generalizable models. Furthermore, we show that our implementation of $ζ$-mixup is faster than mixup, and extensive evaluation on controlled synthetic and 24 real-world natural and medical image classification datasets shows that $ζ$-mixup outperforms mixup and traditional data augmentation techniques.
△ Less
Submitted 7 April, 2022;
originally announced April 2022.
-
A planet within the debris disk around the pre-main-sequence star AU Microscopii
Authors:
Peter Plavchan,
Thomas Barclay,
Jonathan Gagné,
Peter Gao,
Bryson Cale,
William Matzko,
Diana Dragomir,
Sam Quinn,
Dax Feliz,
Keivan Stassun,
Ian J. M. Crossfield,
David A. Berardo,
David W. Latham,
Ben Tieu,
Guillem Anglada-Escudé,
George Ricker,
Roland Vanderspek,
Sara Seager,
Joshua N. Winn,
Jon M. Jenkins,
Stephen Rinehart,
Akshata Krishnamurthy,
Scott Dynes,
John Doty,
Fred Adams
, et al. (62 additional authors not shown)
Abstract:
AU Microscopii (AU Mic) is the second closest pre main sequence star, at a distance of 9.79 parsecs and with an age of 22 million years. AU Mic possesses a relatively rare and spatially resolved3 edge-on debris disk extending from about 35 to 210 astronomical units from the star, and with clumps exhibiting non-Keplerian motion. Detection of newly formed planets around such a star is challenged by…
▽ More
AU Microscopii (AU Mic) is the second closest pre main sequence star, at a distance of 9.79 parsecs and with an age of 22 million years. AU Mic possesses a relatively rare and spatially resolved3 edge-on debris disk extending from about 35 to 210 astronomical units from the star, and with clumps exhibiting non-Keplerian motion. Detection of newly formed planets around such a star is challenged by the presence of spots, plage, flares and other manifestations of magnetic activity on the star. Here we report observations of a planet transiting AU Mic. The transiting planet, AU Mic b, has an orbital period of 8.46 days, an orbital distance of 0.07 astronomical units, a radius of 0.4 Jupiter radii, and a mass of less than 0.18 Jupiter masses at 3 sigma confidence. Our observations of a planet co-existing with a debris disk offer the opportunity to test the predictions of current models of planet formation and evolution.
△ Less
Submitted 25 June, 2020; v1 submitted 23 June, 2020;
originally announced June 2020.
-
Monitoring crystal breakage in wet milling processes using inline imaging and chord length distribution measurements
Authors:
Okpeafoh S. Agimelen,
Vaclav Svoboda,
Bilal Ahmed,
Javier Cardona,
Jerzy Dziewierz,
Cameron J. Brown,
Thomas McGlone,
Alison Cleary,
Christos Tachtatzis,
Craig Michie,
Alastair J. Florence,
Ivan Andonovic,
Anthony J. Mulholland,
Jan Sefcik
Abstract:
The success of the various secondary operations involved in the production of particulate products depends on the production of particles with a desired size and shape from a previous primary operation such as crystallisation. This is because these properties of size and shape affect the behaviour of the particles in the secondary processes. The size and the shape of the particles are very sensiti…
▽ More
The success of the various secondary operations involved in the production of particulate products depends on the production of particles with a desired size and shape from a previous primary operation such as crystallisation. This is because these properties of size and shape affect the behaviour of the particles in the secondary processes. The size and the shape of the particles are very sensitive to the conditions of the crystallisation processes, and so control of these processes is essential. This control requires the development of software tools that can effectively and efficiently process the sensor data captured in situ. However, these tools have various strengths and limitations depending on the process conditions and the nature of the particles.
In this work, we employ wet milling of crystalline particles as a case study of a process which produces effects typical to crystallisation processes. We study some of the strengths and limitations of our previously introduced tools for estimating the particle size distribution (PSD) and the aspect ratio from chord length distribution (CLD) and imaging data. We find situations where the CLD tool works better than the imaging tool and vice versa. However, in general both tools complement each other, and can therefore be employed in a suitable multi-objective optimisation approach to estimate PSD and aspect ratio.
△ Less
Submitted 27 March, 2017;
originally announced March 2017.
-
Machine Learning on Human Connectome Data from MRI
Authors:
Colin J Brown,
Ghassan Hamarneh
Abstract:
Functional MRI (fMRI) and diffusion MRI (dMRI) are non-invasive imaging modalities that allow in-vivo analysis of a patient's brain network (known as a connectome). Use of these technologies has enabled faster and better diagnoses and treatments of neurological disorders and a deeper understanding of the human brain. Recently, researchers have been exploring the application of machine learning mod…
▽ More
Functional MRI (fMRI) and diffusion MRI (dMRI) are non-invasive imaging modalities that allow in-vivo analysis of a patient's brain network (known as a connectome). Use of these technologies has enabled faster and better diagnoses and treatments of neurological disorders and a deeper understanding of the human brain. Recently, researchers have been exploring the application of machine learning models to connectome data in order to predict clinical outcomes and analyze the importance of subnetworks in the brain. Connectome data has unique properties, which present both special challenges and opportunities when used for machine learning. The purpose of this work is to review the literature on the topic of applying machine learning models to MRI-based connectome data. This field is growing rapidly and now encompasses a large body of research. To summarize the research done to date, we provide a comparative, structured summary of 77 relevant works, tabulated according to different criteria, that represent the majority of the literature on this topic. (We also published a living version of this table online at http://connectomelearning.cs.sfu.ca that the community can continue to contribute to.) After giving an overview of how connectomes are constructed from dMRI and fMRI data, we discuss the variety of machine learning tasks that have been explored with connectome data. We then compare the advantages and drawbacks of different machine learning approaches that have been employed, discussing different feature selection and feature extraction schemes, as well as the learning models and regularization penalties themselves. Throughout this discussion, we focus particularly on how the methods are adapted to the unique nature of graphical connectome data. Finally, we conclude by summarizing the current state of the art and by outlining what we believe are strategic directions for future research.
△ Less
Submitted 26 November, 2016;
originally announced November 2016.
-
Magnetic fields and differential rotation on the pre-main sequence II: The early-G star HD 141943 - coronal magnetic field, H-alpha emission and differential rotation
Authors:
S. C. Marsden,
M. M. Jardine,
J. C. Ramírez Vélez,
E. Alecian,
C. J. Brown,
B. D. Carter,
J. F. Donati,
N. Dunstone,
R. Hart,
M. Semel,
I. A. Waite
Abstract:
Spectropolarimetric observations of the pre-main sequence early-G star HD 141943 were obtained at three observing epochs (2007, 2009 and 2010). The observations were obtained using the 3.9-m Anglo-Australian telescope with the UCLES echelle spectrograph and the SEMPOL spectropolarimeter visitor instrument. The brightness and surface magnetic field topologies (given in Paper I) were used to determi…
▽ More
Spectropolarimetric observations of the pre-main sequence early-G star HD 141943 were obtained at three observing epochs (2007, 2009 and 2010). The observations were obtained using the 3.9-m Anglo-Australian telescope with the UCLES echelle spectrograph and the SEMPOL spectropolarimeter visitor instrument. The brightness and surface magnetic field topologies (given in Paper I) were used to determine the star's surface differential rotation and reconstruct the coronal magnetic field of the star.
The coronal magnetic field at the 3 epochs shows on the largest scales that the field structure is dominated by the dipole component with possible evidence for the tilt of the dipole axis shifting between observations. We find very high levels of differential rotation on HD 141943 (~8 times the solar value for the magnetic features and ~5 times solar for the brightness features) similar to that evidenced by another young early-G star, HD 171488. These results indicate that a significant increase in the level of differential rotation occurs for young stars around a spectral type of early-G. Also we find for the 2010 observations that there is a large difference in the differential rotation measured from the brightness and magnetic features, similar to that seen on early-K stars, but with the difference being much larger. We find only tentative evidence for temporal evolution in the differential rotation of HD 141943.
△ Less
Submitted 31 January, 2011;
originally announced January 2011.
-
Magnetic fields and differential rotation on the pre-main sequence I: The early-G star HD 141943 - brightness and magnetic topologies
Authors:
S. C. Marsden,
M. M. Jardine,
J. C. Ramírez Vélez,
E. Alecian,
C. J. Brown,
B. D. Carter,
J. F. Donati,
N. Dunstone,
R. Hart,
M. Semel,
I. A. Waite
Abstract:
Spectroscopic and spectropolarimetric observations of the pre-main sequence early-G star HD 141943 were obtained at four observing epochs (in 2006, 2007, 2009 and 2010). The observations were undertaken at the 3.9-m Anglo-Australian Telescope using the UCLES echelle spectrograph and the SEMPOL spectropolarimeter visitor instrument. Brightness and surface magnetic field topologies were reconstructe…
▽ More
Spectroscopic and spectropolarimetric observations of the pre-main sequence early-G star HD 141943 were obtained at four observing epochs (in 2006, 2007, 2009 and 2010). The observations were undertaken at the 3.9-m Anglo-Australian Telescope using the UCLES echelle spectrograph and the SEMPOL spectropolarimeter visitor instrument. Brightness and surface magnetic field topologies were reconstructed for the star using the technique of least-squares deconvolution to increase the signal-to-noise of the data.
The reconstructed brightness maps show that HD 141943 had a weak polar spot and a significant amount of low latitude features, with little change in the latitude distribution of the spots over the 4 years of observations. The surface magnetic field was reconstructed at three of the epochs from a high order (l <= 30) spherical harmonic expansion of the spectropolarimetric observations. The reconstructed magnetic topologies show that in 2007 and 2010 the surface magnetic field was reasonably balanced between poloidal and toroidal components. However we find tentative evidence of a change in the poloidal/toroidal ratio in 2009 with the poloidal component becoming more dominant. At all epochs the radial magnetic field is predominantly non-axisymmetric while the azimuthal field is predominantly axisymmetric with a ring of positive azimuthal field around the pole similar to that seen on other active stars.
△ Less
Submitted 31 January, 2011;
originally announced January 2011.
-
Computational Estimates of Binding Affinities for Estrogen Receptor Isoforms in Rainbow Trout
Authors:
Conrad Shyu,
Celeste J. Brown,
F. Marty Ytreberg
Abstract:
Molecular dynamics simulations were used to determine the binding affinities between between the hormone 17 beta-estradiol (E2) and different estrogen receptor (ER) isoforms in the rainbow trout, Oncorhynchus mykiss. Previous phylogenetic analysis indicates that a whole genome duplication prior to the divergence of ray-finned fish led to two distinct ER beta isoforms, ER beta 1 and ER beta 2, an…
▽ More
Molecular dynamics simulations were used to determine the binding affinities between between the hormone 17 beta-estradiol (E2) and different estrogen receptor (ER) isoforms in the rainbow trout, Oncorhynchus mykiss. Previous phylogenetic analysis indicates that a whole genome duplication prior to the divergence of ray-finned fish led to two distinct ER beta isoforms, ER beta 1 and ER beta 2, and the recent whole genome duplication in the ancestral salmonid created two ER alpha isoforms, ER alpha 1 and ER alpha 2. The objective of our computational studies is to provide insight into the underlying evolutionary pressures on these isoforms. For the ER alpha subtype our results show that E2 binds preferentially to ER alpha 1 over ER alpha 2. Tests of lineage specific dN/dS ratios indicate that the ligand binding domain of the ER alpha 2 gene is evolving under relaxed selection relative to all other ER alpha genes. Comparison with the highly conserved DNA binding domain suggests that ER alpha 2 may be undergoing neofunctionalization possibly by binding to another ligand. By contrast, both ER beta 1 and ER beta 2 bind similarly to E2 and the best fitting model of selection indicates that the ligand binding domain of all ER beta genes are evolving under the same level of purifying selection, comparable to ER alpha 1.
△ Less
Submitted 2 February, 2010; v1 submitted 4 September, 2009;
originally announced September 2009.