-
Biomedical Open Source Software: Crucial Packages and Hidden Heroes
Authors:
Andrew Nesbitt,
Boris Veytsman,
Daniel Mietchen,
Eva Maxfield Brown,
James Howison,
João Felipe Pimentel,
Laurent Hébert-Dufresne,
Stephan Druskat
Abstract:
Despite the importance of scientific software for research, it is often not formally recognized and rewarded. This is especially true for foundation libraries, which are used by the software packages visible to the users, being ``hidden'' themselves. The funders and other organizations need to understand the complex network of computer programs that the modern research relies upon.
In this work…
▽ More
Despite the importance of scientific software for research, it is often not formally recognized and rewarded. This is especially true for foundation libraries, which are used by the software packages visible to the users, being ``hidden'' themselves. The funders and other organizations need to understand the complex network of computer programs that the modern research relies upon.
In this work we used CZ Software Mentions Dataset to map the dependencies of the software used in biomedical papers and find the packages critical to the software ecosystems. We propose the centrality metrics for the network of software dependencies, analyze three ecosystems (PyPi, CRAN, Bioconductor) and determine the packages with the highest centrality.
△ Less
Submitted 19 May, 2025; v1 submitted 9 April, 2024;
originally announced April 2024.
-
Scale dependence of distributions of hotspots
Authors:
Michael Wilkinson,
Boris Veytsman
Abstract:
We consider a random field $φ(\mathbf{r})$ in $d$ dimensions which is largely concentrated around small `hotspots', with `weights', $w_i$. These weights may have a very broad distribution, such that their mean does not exist, or else is not a useful estimate. In such cases, the median $\overline W$ of the total weight $W$ in a region of size $R$ is an informative characterisation of the weights. W…
▽ More
We consider a random field $φ(\mathbf{r})$ in $d$ dimensions which is largely concentrated around small `hotspots', with `weights', $w_i$. These weights may have a very broad distribution, such that their mean does not exist, or else is not a useful estimate. In such cases, the median $\overline W$ of the total weight $W$ in a region of size $R$ is an informative characterisation of the weights. We define the function $F$ by $\ln \overline W=F(\ln R)$. If $F'(x)>d$, the distribution of hotspots is dominated by the largest weights. In the case where $F'(x)-d$ approaches a constant positive value when $R\to \infty$, the hotspots distribution has a type of scale-invariance which is different from that of fractal sets, and which we term \emph{ultradimensional}. The form of the function $F(x)$ is determined for a model of diffusion in a random potential.
△ Less
Submitted 8 November, 2023;
originally announced November 2023.
-
A large dataset of software mentions in the biomedical literature
Authors:
Ana-Maria Istrate,
Donghui Li,
Dario Taraborelli,
Michaela Torkar,
Boris Veytsman,
Ivana Williams
Abstract:
We describe the CZ Software Mentions dataset, a new dataset of software mentions in biomedical papers. Plain-text software mentions are extracted with a trained SciBERT model from several sources: the NIH PubMed Central collection and from papers provided by various publishers to the Chan Zuckerberg Initiative. The dataset provides sources, context and metadata, and, for a number of mentions, the…
▽ More
We describe the CZ Software Mentions dataset, a new dataset of software mentions in biomedical papers. Plain-text software mentions are extracted with a trained SciBERT model from several sources: the NIH PubMed Central collection and from papers provided by various publishers to the Chan Zuckerberg Initiative. The dataset provides sources, context and metadata, and, for a number of mentions, the disambiguated software entities and links. We extract 1.12 million unique string software mentions from 2.4 million papers in the NIH PMC-OA Commercial subset, 481k unique mentions from the NIH PMC-OA Non-Commercial subset (both gathered in October 2021) and 934k unique mentions from 3 million papers in the Publishers' collection. There is variation in how software is mentioned in papers and extracted by the NER algorithm. We propose a clustering-based disambiguation algorithm to map plain-text software mentions into distinct software entities and apply it on the NIH PubMed Central Commercial collection. Through this methodology, we disambiguate 1.12 million unique strings extracted by the NER model into 97600 unique software entities, covering 78% of all software-paper links. We link 185000 of the mentions to a repository, covering about 55% of all software-paper links. We describe in detail the process of building the datasets, disambiguating and linking the software mentions, as well as opportunities and challenges that come with a dataset of this size. We make all data and code publicly available as a new resource to help assess the impact of software (in particular scientific open source projects) on science.
△ Less
Submitted 27 September, 2022; v1 submitted 1 September, 2022;
originally announced September 2022.
-
Statistical topology of the streamlines of a two-dimensional flow
Authors:
M. Kamb,
J. Byrum,
G. Huber,
G. Le Treut,
S. Mehta,
B. Veytsman,
D. Yllanes
Abstract:
Recent experiments on mucociliary clearance, an important defense against airborne pathogens, have raised questions about the topology of two-dimensional (2D) flows. We introduce a framework for studying ensembles of 2D time-invariant flow fields and estimating the probability for a particle to leave a finite area (to clear out). We establish two upper bounds on this probability by leveraging diff…
▽ More
Recent experiments on mucociliary clearance, an important defense against airborne pathogens, have raised questions about the topology of two-dimensional (2D) flows. We introduce a framework for studying ensembles of 2D time-invariant flow fields and estimating the probability for a particle to leave a finite area (to clear out). We establish two upper bounds on this probability by leveraging different insights about the distribution of flow velocities on the closed and open streamlines. We also deduce an exact power-series expression for the trapped area based on the asymptotic dynamics of flow-field trajectories and complement our analytical results with numerical simulations.
△ Less
Submitted 11 February, 2023; v1 submitted 20 October, 2021;
originally announced October 2021.
-
Using BibTeX to Automatically Generate Labeled Data for Citation Field Extraction
Authors:
Dung Thai,
Zhiyang Xu,
Nicholas Monath,
Boris Veytsman,
Andrew McCallum
Abstract:
Accurate parsing of citation reference strings is crucial to automatically construct scholarly databases such as Google Scholar or Semantic Scholar. Citation field extraction (CFE) is precisely this task---given a reference label which tokens refer to the authors, venue, title, editor, journal, pages, etc. Most methods for CFE are supervised and rely on training from labeled datasets that are quit…
▽ More
Accurate parsing of citation reference strings is crucial to automatically construct scholarly databases such as Google Scholar or Semantic Scholar. Citation field extraction (CFE) is precisely this task---given a reference label which tokens refer to the authors, venue, title, editor, journal, pages, etc. Most methods for CFE are supervised and rely on training from labeled datasets that are quite small compared to the great variety of reference formats. BibTeX, the widely used reference management tool, provides a natural method to automatically generate and label training data for CFE. In this paper, we describe a technique for using BibTeX to generate, automatically, a large-scale 41M labeled strings), labeled dataset, that is four orders of magnitude larger than the current largest CFE dataset, namely the UMass Citation Field Extraction dataset [Anzaroot and McCallum, 2013]. We experimentally demonstrate how our dataset can be used to improve the performance of the UMass CFE using a RoBERTa-based [Liu et al., 2019] model. In comparison to previous SoTA, we achieve a 24.48% relative error reduction, achieving span level F1-scores of 96.3%.
△ Less
Submitted 9 June, 2020;
originally announced June 2020.
-
Simple Mathematical Model Of Pathologic Microsatellite Expansions: When Self-Reparation Does Not Work
Authors:
Boris Veytsman,
Leila Akhmadeyeva
Abstract:
We propose a simple model of pathologic microsatellite expansion, and describe an inherent self-repairing mechanism working against expansion. We prove that if the probabilities of elementary expansions and contractions are equal, microsatellite expansions are always self-repairing. If these probabilities are different, self-reparation does not work. Mosaicism, anticipation and reverse mutation…
▽ More
We propose a simple model of pathologic microsatellite expansion, and describe an inherent self-repairing mechanism working against expansion. We prove that if the probabilities of elementary expansions and contractions are equal, microsatellite expansions are always self-repairing. If these probabilities are different, self-reparation does not work. Mosaicism, anticipation and reverse mutation cases are discussed in the framework of the model. We explain these phenomena and provide some theoretical evidence for their properties, for example the rarity of reverse mutations.
△ Less
Submitted 29 July, 2007;
originally announced July 2007.