-
KG-FGNN: Knowledge-guided GNN Foundation Model for Fertilisation-oriented Soil GHG Flux Prediction
Authors:
Yu Zhang,
Gaoshan Bi,
Simon Jeffery,
Max Davis,
Yang Li,
Qing Xue,
Po Yang
Abstract:
Precision soil greenhouse gas (GHG) flux prediction is essential in agricultural systems for assessing environmental impacts, developing emission mitigation strategies and promoting sustainable agriculture. Due to the lack of advanced sensor and network technologies on majority of farms, there are challenges in obtaining comprehensive and diverse agricultural data. As a result, the scarcity of agr…
▽ More
Precision soil greenhouse gas (GHG) flux prediction is essential in agricultural systems for assessing environmental impacts, developing emission mitigation strategies and promoting sustainable agriculture. Due to the lack of advanced sensor and network technologies on majority of farms, there are challenges in obtaining comprehensive and diverse agricultural data. As a result, the scarcity of agricultural data seriously obstructs the application of machine learning approaches in precision soil GHG flux prediction. This research proposes a knowledge-guided graph neural network framework that addresses the above challenges by integrating knowledge embedded in an agricultural process-based model and graph neural network techniques. Specifically, we utilise the agricultural process-based model to simulate and generate multi-dimensional agricultural datasets for 47 countries that cover a wide range of agricultural variables. To extract key agricultural features and integrate correlations among agricultural features in the prediction process, we propose a machine learning framework that integrates the autoencoder and multi-target multi-graph based graph neural networks, which utilises the autoencoder to selectively extract significant agricultural features from the agricultural process-based model simulation data and the graph neural network to integrate correlations among agricultural features for accurately predict fertilisation-oriented soil GHG fluxes. Comprehensive experiments were conducted with both the agricultural simulation dataset and real-world agricultural dataset to evaluate the proposed approach in comparison with well-known baseline and state-of-the-art regression methods. The results demonstrate that our proposed approach provides superior accuracy and stability in fertilisation-oriented soil GHG prediction.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
A Case Study on Virtual and Physical I/O Throughputs
Authors:
T. Mirzoev,
B. Yang,
M. Davis,
T. Williams
Abstract:
Input/Output (I/O) performance is one of the key areas that need to be carefully examined to better support IT services. With the rapid development and deployment of virtualization technology, many essential business applications have been migrated to the virtualized platform due to reduced cost and improved agility. However, the impact of such transition on the I/O performance is not very well st…
▽ More
Input/Output (I/O) performance is one of the key areas that need to be carefully examined to better support IT services. With the rapid development and deployment of virtualization technology, many essential business applications have been migrated to the virtualized platform due to reduced cost and improved agility. However, the impact of such transition on the I/O performance is not very well studied. In this research project, the authors investigated the disk write request performance on a virtual storage interface and on a physical storage interface. Specifically, the study aimed to identify whether a virtual SCSI disk controller can process 4KB and 32KB I/O write requests faster than a standard physical IDE controller. The experiments of this study were constructed in a way to best emulate real world IT configurations. The results were carefully analyzed. The results reveal that a virtual SCSI controller can process smaller write requests (4KB) faster than the physical IDE controller but it is outperformed by its physical counterpart if the sizes of write request are bigger (32KB). This manuscript presents the details of this research along with recommendations for improving virtual I/O performance.
△ Less
Submitted 8 February, 2025;
originally announced February 2025.
-
Automatically Detecting Heterogeneous Bugs in High-Performance Computing Scientific Software
Authors:
Matthew Davis,
Aakash Kulkarni,
Ziyan Chen,
Yunhan Qiao,
Christopher Terrazas,
Manish Motwani
Abstract:
Scientific advancements rely on high-performance computing (HPC) applications that model real-world phenomena through simulations. These applications process vast amounts of data on specialized accelerators (eg., GPUs) using special libraries. Heterogeneous bugs occur in these applications when managing data movement across different platforms, such as CPUs and GPUs, leading to divergent behavior…
▽ More
Scientific advancements rely on high-performance computing (HPC) applications that model real-world phenomena through simulations. These applications process vast amounts of data on specialized accelerators (eg., GPUs) using special libraries. Heterogeneous bugs occur in these applications when managing data movement across different platforms, such as CPUs and GPUs, leading to divergent behavior when using heterogeneous platforms compared to using only CPUs. Existing software testing techniques often fail to detect such bugs because either they do not account for platform-specific characteristics or target specific platforms. To address this problem, we present HeteroBugDetect, an automated approach to detect platform-dependent heterogeneous bugs in HPC scientific applications. HeteroBugDetect combines natural-language processing, off-target testing, custom fuzzing, and differential testing to provide an end-to-end solution for detecting platform-specific bugs in scientific applications. We evaluate HeteroBugDetect on LAMMPS, a molecular dynamics simulator, where it detected multiple heterogeneous bugs, enhancing its reliability across diverse HPC environments.
△ Less
Submitted 16 January, 2025;
originally announced January 2025.
-
Automated external cervical resorption segmentation in cone-beam CT using local texture features
Authors:
Sadhana Ravikumar,
Asma A. Khan,
Matthew C. Davis,
Beatriz Paniagua
Abstract:
External cervical resorption (ECR) is a resorptive process affecting teeth. While in some patients, active resorption ceases and gets replaced by osseous tissue, in other cases, the resorption progresses and ultimately results in tooth loss. For proper ECR assessment, cone-beam computed tomography (CBCT) is the recommended imaging modality, enabling a 3-D characterization of these lesions. While i…
▽ More
External cervical resorption (ECR) is a resorptive process affecting teeth. While in some patients, active resorption ceases and gets replaced by osseous tissue, in other cases, the resorption progresses and ultimately results in tooth loss. For proper ECR assessment, cone-beam computed tomography (CBCT) is the recommended imaging modality, enabling a 3-D characterization of these lesions. While it is possible to manually identify and measure ECR resorption in CBCT scans, this process can be time intensive and highly subject to human error. Therefore, there is an urgent need to develop an automated method to identify and quantify the severity of ECR resorption using CBCT. Here, we present a method for ECR lesion segmentation that is based on automatic, binary classification of locally extracted voxel-wise texture features. We evaluate our method on 6 longitudinal CBCT datasets and show that certain texture-features can be used to accurately detect subtle CBCT signal changes due to ECR. We also present preliminary analyses clustering texture features within a lesion to stratify the defects and identify patterns indicative of calcification. These methods are important steps in developing prognostic biomarkers to predict whether ECR will continue to progress or cease, ultimately informing treatment decisions.
△ Less
Submitted 9 January, 2025;
originally announced January 2025.
-
The RSNA Abdominal Traumatic Injury CT (RATIC) Dataset
Authors:
Jeffrey D. Rudie,
Hui-Ming Lin,
Robyn L. Ball,
Sabeena Jalal,
Luciano M. Prevedello,
Savvas Nicolaou,
Brett S. Marinelli,
Adam E. Flanders,
Kirti Magudia,
George Shih,
Melissa A. Davis,
John Mongan,
Peter D. Chang,
Ferco H. Berger,
Sebastiaan Hermans,
Meng Law,
Tyler Richards,
Jan-Peter Grunz,
Andreas Steven Kunz,
Shobhit Mathur,
Sandro Galea-Soler,
Andrew D. Chung,
Saif Afat,
Chin-Chi Kuo,
Layal Aweidah
, et al. (15 additional authors not shown)
Abstract:
The RSNA Abdominal Traumatic Injury CT (RATIC) dataset is the largest publicly available collection of adult abdominal CT studies annotated for traumatic injuries. This dataset includes 4,274 studies from 23 institutions across 14 countries. The dataset is freely available for non-commercial use via Kaggle at https://www.kaggle.com/competitions/rsna-2023-abdominal-trauma-detection. Created for the…
▽ More
The RSNA Abdominal Traumatic Injury CT (RATIC) dataset is the largest publicly available collection of adult abdominal CT studies annotated for traumatic injuries. This dataset includes 4,274 studies from 23 institutions across 14 countries. The dataset is freely available for non-commercial use via Kaggle at https://www.kaggle.com/competitions/rsna-2023-abdominal-trauma-detection. Created for the RSNA 2023 Abdominal Trauma Detection competition, the dataset encourages the development of advanced machine learning models for detecting abdominal injuries on CT scans. The dataset encompasses detection and classification of traumatic injuries across multiple organs, including the liver, spleen, kidneys, bowel, and mesentery. Annotations were created by expert radiologists from the American Society of Emergency Radiology (ASER) and Society of Abdominal Radiology (SAR). The dataset is annotated at multiple levels, including the presence of injuries in three solid organs with injury grading, image-level annotations for active extravasations and bowel injury, and voxelwise segmentations of each of the potentially injured organs. With the release of this dataset, we hope to facilitate research and development in machine learning and abdominal trauma that can lead to improved patient care and outcomes.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Twins in rotational spectroscopy: Does a rotational spectrum uniquely identify a molecule?
Authors:
Marcus Schwarting,
Nathan A. Seifert,
Michael J. Davis,
Ben Blaiszik,
Ian Foster,
Kirill Prozument
Abstract:
Rotational spectroscopy is the most accurate method for determining structures of molecules in the gas phase. It is often assumed that a rotational spectrum is a unique "fingerprint" of a molecule. The availability of large molecular databases and the development of artificial intelligence methods for spectroscopy makes the testing of this assumption timely. In this paper, we pose the determinatio…
▽ More
Rotational spectroscopy is the most accurate method for determining structures of molecules in the gas phase. It is often assumed that a rotational spectrum is a unique "fingerprint" of a molecule. The availability of large molecular databases and the development of artificial intelligence methods for spectroscopy makes the testing of this assumption timely. In this paper, we pose the determination of molecular structures from rotational spectra as an inverse problem. Within this framework, we adopt a funnel-based approach to search for molecular twins, which are two or more molecules, which have similar rotational spectra but distinctly different molecular structures. We demonstrate that there are twins within standard levels of computational accuracy by generating rotational constants for many molecules from several large molecular databases, indicating the inverse problem is ill-posed. However, some twins can be distinguished by increasing the accuracy of the theoretical methods or by performing additional experiments.
△ Less
Submitted 5 April, 2024;
originally announced April 2024.
-
Contradicted by the Brain: Predicting Individual and Group Preferences via Brain-Computer Interfacing
Authors:
Keith M. Davis III,
Michiel Spapé,
Tuukka Ruotsalo
Abstract:
We investigate inferring individual preferences and the contradiction of individual preferences with group preferences through direct measurement of the brain. We report an experiment where brain activity collected from 31 participants produced in response to viewing images is associated with their self-reported preferences. First, we show that brain responses present a graded response to preferen…
▽ More
We investigate inferring individual preferences and the contradiction of individual preferences with group preferences through direct measurement of the brain. We report an experiment where brain activity collected from 31 participants produced in response to viewing images is associated with their self-reported preferences. First, we show that brain responses present a graded response to preferences, and that brain responses alone can be used to train classifiers that reliably estimate preferences. Second, we show that brain responses reveal additional preference information that correlates with group preference, even when participants self-reported having no such preference. Our analysis of brain responses carries significant implications for researchers in general, as it suggests an individual's explicit preferences are not always aligned with the preferences inferred from their brain responses. These findings call into question the reliability of explicit and behavioral signals. They also imply that additional, multimodal sources of information may be necessary to infer reliable preference information.
△ Less
Submitted 15 December, 2023;
originally announced December 2023.
-
Towards Distributed Quantum Computing by Qubit and Gate Graph Partitioning Techniques
Authors:
Marc Grau Davis,
Joaquin Chung,
Dirk Englund,
Rajkumar Kettimuthu
Abstract:
Distributed quantum computing is motivated by the difficulty in building large-scale, individual quantum computers. To solve that problem, a large quantum circuit is partitioned and distributed to small quantum computers for execution. Partitions running on different quantum computers share quantum information using entangled Bell pairs. However, entanglement generation and purification introduces…
▽ More
Distributed quantum computing is motivated by the difficulty in building large-scale, individual quantum computers. To solve that problem, a large quantum circuit is partitioned and distributed to small quantum computers for execution. Partitions running on different quantum computers share quantum information using entangled Bell pairs. However, entanglement generation and purification introduces both a runtime and memory overhead on distributed quantum computing. In this paper we study that trade-off by proposing two techniques for partitioning large quantum circuits and for distribution to small quantum computers. Our techniques map a quantum circuit to a graph representation. We study two approaches: one that considers only gate teleportation, and another that considers both gate and state teleportation to achieve the distributed execution. Then we apply the METIS graph partitioning algorithm to obtain the partitions and the number of entanglement requests between them. We use the SeQUeNCe quantum communication simulator to measure the time required for generating all the entanglements required to execute the distributed circuit. We find that the best partitioning technique will depend on the specific circuit of interest.
△ Less
Submitted 5 October, 2023;
originally announced October 2023.
-
Exploring Gender-Based Toxic Speech on Twitter in Context of the #MeToo movement: A Mixed Methods Approach
Authors:
Sayak Saha Roy,
Ohad Gilbar,
Christina Palantza,
Maxine Davis,
Shirin Nilizadeh
Abstract:
The #MeToo movement has catalyzed widespread public discourse surrounding sexual harassment and assault, empowering survivors to share their stories and holding perpetrators accountable. While the movement has had a substantial and largely positive influence, this study aims to examine the potential negative consequences in the form of increased hostility against women and men on the social media…
▽ More
The #MeToo movement has catalyzed widespread public discourse surrounding sexual harassment and assault, empowering survivors to share their stories and holding perpetrators accountable. While the movement has had a substantial and largely positive influence, this study aims to examine the potential negative consequences in the form of increased hostility against women and men on the social media platform Twitter. By analyzing tweets shared between October 2017 and January 2020 by more than 47.1k individuals who had either disclosed their own sexual abuse experiences on Twitter or engaged in discussions about the movement, we identify the overall increase in gender-based hostility towards both women and men since the start of the movement. We also monitor 16 pivotal real-life events that shaped the #MeToo movement to identify how these events may have amplified negative discussions targeting the opposite gender on Twitter. Furthermore, we conduct a thematic content analysis of a subset of gender-based hostile tweets, which helps us identify recurring themes and underlying motivations driving the expressions of anger and resentment from both men and women concerning the #MeToo movement. This study highlights the need for a nuanced understanding of the impact of social movements on online discourse and underscores the importance of addressing gender-based hostility in the digital sphere.
△ Less
Submitted 24 August, 2023;
originally announced August 2023.
-
Geodesic complexity of a cube
Authors:
Donald M. Davis
Abstract:
The topological (resp. geodesic) complexity of a topological (resp. metric) space is roughly the smallest number of continuous rules required to choose paths (resp. shortest paths) between any points of the space. We prove that the geodesic complexity of a cube exceeds its topological complexity by exactly 2. The proof involves a careful analysis of cut loci of the cube.
The topological (resp. geodesic) complexity of a topological (resp. metric) space is roughly the smallest number of continuous rules required to choose paths (resp. shortest paths) between any points of the space. We prove that the geodesic complexity of a cube exceeds its topological complexity by exactly 2. The proof involves a careful analysis of cut loci of the cube.
△ Less
Submitted 8 August, 2023;
originally announced August 2023.
-
Frame Size Optimization Using a Machine Learning Approach in WLAN Downlink MU-MIMO Channel
Authors:
Lemlem Kassa,
Jianhua Deng,
Mark Davis,
Jingye Cai
Abstract:
The IEEE 802.11ac/n introduced frame aggregation technology to accommodate the growing traffic demand and increase the performance of transmission efficiency and channel utilization. This is achieved by allowing many packets to be aggregated per transmission which realized a significant enhancement in the throughput performance of WLAN. However, it is difficult to efficiently utilize the benefits…
▽ More
The IEEE 802.11ac/n introduced frame aggregation technology to accommodate the growing traffic demand and increase the performance of transmission efficiency and channel utilization. This is achieved by allowing many packets to be aggregated per transmission which realized a significant enhancement in the throughput performance of WLAN. However, it is difficult to efficiently utilize the benefits of frame aggregation in the downlink MU-MIMO channels as stations have heterogeneous transmission demands and data transmission rates. As a result of this, wasted space channel time will occur which degrades transmission efficiency. In addressing these challenges, the existing studies have proposed different approaches. However, most of these approaches did not consider a machine-Learning based optimization solution. The main contribution of this paper is to propose a machine-learning-based frame size optimization solution to maximize the system throughput of WLAN in the downlink MU-MIMO channel. In this approach, the Access Point (AP) performs the maximum system throughput measurement and collected frame size-system throughput patterns which contain knowledge about the effects of traffic patterns, channel conditions, and number of stations(STAs). Based on these patterns,our approach uses a neural network to correctly model the system throughput as a function of the system frame size. After training the neural network, we obtain the gradient information to adjust the frame size. the performance of the proposed Machine learning(ML) approach is evaluated over the FIFO aggregation algorithm under the effects of heterogenous traffic patterns for VoIP and video applications, channel conditions, and number of stations.
△ Less
Submitted 20 September, 2022;
originally announced September 2022.
-
Large-scale protein-protein post-translational modification extraction with distant supervision and confidence calibrated BioBERT
Authors:
Aparna Elangovan,
Yuan Li,
Douglas E. V. Pires,
Melissa J. Davis,
Karin Verspoor
Abstract:
Protein-protein interactions (PPIs) are critical to normal cellular function and are related to many disease pathways. However, only 4% of PPIs are annotated with PTMs in biological knowledge databases such as IntAct, mainly performed through manual curation, which is neither time nor cost-effective. We use the IntAct PPI database to create a distant supervised dataset annotated with interacting p…
▽ More
Protein-protein interactions (PPIs) are critical to normal cellular function and are related to many disease pathways. However, only 4% of PPIs are annotated with PTMs in biological knowledge databases such as IntAct, mainly performed through manual curation, which is neither time nor cost-effective. We use the IntAct PPI database to create a distant supervised dataset annotated with interacting protein pairs, their corresponding PTM type, and associated abstracts from the PubMed database. We train an ensemble of BioBERT models - dubbed PPI-BioBERT-x10 to improve confidence calibration. We extend the use of ensemble average confidence approach with confidence variation to counteract the effects of class imbalance to extract high confidence predictions. The PPI-BioBERT-x10 model evaluated on the test set resulted in a modest F1-micro 41.3 (P =5 8.1, R = 32.1). However, by combining high confidence and low variation to identify high quality predictions, tuning the predictions for precision, we retained 19% of the test predictions with 100% precision. We evaluated PPI-BioBERT-x10 on 18 million PubMed abstracts and extracted 1.6 million (546507 unique PTM-PPI triplets) PTM-PPI predictions, and filter ~ 5700 (4584 unique) high confidence predictions. Of the 5700, human evaluation on a small randomly sampled subset shows that the precision drops to 33.7% despite confidence calibration and highlights the challenges of generalisability beyond the test set even with confidence calibration. We circumvent the problem by only including predictions associated with multiple papers, improving the precision to 58.8%. In this work, we highlight the benefits and challenges of deep learning-based text mining in practice, and the need for increased emphasis on confidence calibration to facilitate human curation efforts.
△ Less
Submitted 6 January, 2022;
originally announced January 2022.
-
Adverse Media Mining for KYC and ESG Compliance
Authors:
Rupinder Paul Khandpur,
Albert Aristotle Nanda,
Mathew Davis,
Chen Li,
Daulet Nurmanbetov,
Sankalp Gaur,
Ashit Talukder
Abstract:
In recent years, institutions operating in the global market economy face growing risks stemming from non-financial risk factors such as cyber, third-party, and reputational outweighing traditional risks of credit and liquidity. Adverse media or negative news screening is crucial for the identification of such non-financial risks. Typical tools for screening are not real-time, involve manual searc…
▽ More
In recent years, institutions operating in the global market economy face growing risks stemming from non-financial risk factors such as cyber, third-party, and reputational outweighing traditional risks of credit and liquidity. Adverse media or negative news screening is crucial for the identification of such non-financial risks. Typical tools for screening are not real-time, involve manual searches, require labor-intensive monitoring of information sources. Moreover, they are costly processes to maintain up-to-date with complex regulatory requirements and the institution's evolving risk appetite.
In this extended abstract, we present an automated system to conduct both real-time and batch search of adverse media for users' queries (person or organization entities) using news and other open-source, unstructured sources of information. Our scalable, machine-learning driven approach to high-precision, adverse news filtering is based on four perspectives - relevance to risk domains, search query (entity) relevance, adverse sentiment analysis, and risk encoding. With the help of model evaluations and case studies, we summarize the performance of our deployed application.
△ Less
Submitted 21 October, 2021;
originally announced October 2021.
-
LEAP: Scaling Numerical Optimization Based Synthesis Using an Incremental Approach
Authors:
Ethan Smith,
Marc G. Davis,
Jeffrey Larson,
Ed Younis,
Costin Iancu,
Wim Lavrijsen
Abstract:
While showing great promise, circuit synthesis techniques that combine numerical optimization with search over circuit structures face scalability challenges due to a large number of parameters, exponential search spaces, and complex objective functions. The LEAP algorithm improves scaling across these dimensions using iterative circuit synthesis, incremental re-optimization, dimensionality reduct…
▽ More
While showing great promise, circuit synthesis techniques that combine numerical optimization with search over circuit structures face scalability challenges due to a large number of parameters, exponential search spaces, and complex objective functions. The LEAP algorithm improves scaling across these dimensions using iterative circuit synthesis, incremental re-optimization, dimensionality reduction, and improved numerical optimization. LEAP draws on the design of the optimal synthesis algorithm QSearch by extending it with an incremental approach to determine constant prefix solutions for a circuit. By narrowing the search space, LEAP improves scalability from four to six qubit circuits. LEAP was evaluated with known quantum circuits such as QFT and physical simulation circuits like the VQE, TFIM, and QITE. LEAP can compile four qubit unitaries up to $59\times$ faster than QSearch and five and six qubit unitaries with up to $1.2\times$ fewer CNOTs compared to the QFAST package. LEAP can reduce the CNOT count by up to $36\times$, or $7\times$ on average, compared to the CQC Tket compiler. Despite its heuristics, LEAP has generated optimal circuits for many test cases with a priori known solutions. The techniques introduced by LEAP are applicable to other numerical-optimization-based synthesis approaches.
△ Less
Submitted 17 December, 2021; v1 submitted 21 June, 2021;
originally announced June 2021.
-
Real-time Pose and Shape Reconstruction of Two Interacting Hands With a Single Depth Camera
Authors:
Franziska Mueller,
Micah Davis,
Florian Bernard,
Oleksandr Sotnychenko,
Mickeal Verschoor,
Miguel A. Otaduy,
Dan Casas,
Christian Theobalt
Abstract:
We present a novel method for real-time pose and shape reconstruction of two strongly interacting hands. Our approach is the first two-hand tracking solution that combines an extensive list of favorable properties, namely it is marker-less, uses a single consumer-level depth camera, runs in real time, handles inter- and intra-hand collisions, and automatically adjusts to the user's hand shape. In…
▽ More
We present a novel method for real-time pose and shape reconstruction of two strongly interacting hands. Our approach is the first two-hand tracking solution that combines an extensive list of favorable properties, namely it is marker-less, uses a single consumer-level depth camera, runs in real time, handles inter- and intra-hand collisions, and automatically adjusts to the user's hand shape. In order to achieve this, we embed a recent parametric hand pose and shape model and a dense correspondence predictor based on a deep neural network into a suitable energy minimization framework. For training the correspondence prediction network, we synthesize a two-hand dataset based on physical simulations that includes both hand pose and shape annotations while at the same time avoiding inter-hand penetrations. To achieve real-time rates, we phrase the model fitting in terms of a nonlinear least-squares problem so that the energy can be optimized based on a highly efficient GPU-based Gauss-Newton optimizer. We show state-of-the-art results in scenes that exceed the complexity level demonstrated by previous work, including tight two-hand grasps, significant inter-hand occlusions, and gesture interaction.
△ Less
Submitted 15 June, 2021;
originally announced June 2021.
-
Assigning function to protein-protein interactions: a weakly supervised BioBERT based approach using PubMed abstracts
Authors:
Aparna Elangovan,
Melissa Davis,
Karin Verspoor
Abstract:
Motivation: Protein-protein interactions (PPI) are critical to the function of proteins in both normal and diseased cells, and many critical protein functions are mediated by interactions.Knowledge of the nature of these interactions is important for the construction of networks to analyse biological data. However, only a small percentage of PPIs captured in protein interaction databases have anno…
▽ More
Motivation: Protein-protein interactions (PPI) are critical to the function of proteins in both normal and diseased cells, and many critical protein functions are mediated by interactions.Knowledge of the nature of these interactions is important for the construction of networks to analyse biological data. However, only a small percentage of PPIs captured in protein interaction databases have annotations of function available, e.g. only 4% of PPI are functionally annotated in the IntAct database. Here, we aim to label the function type of PPIs by extracting relationships described in PubMed abstracts.
Method: We create a weakly supervised dataset from the IntAct PPI database containing interacting protein pairs with annotated function and associated abstracts from the PubMed database. We apply a state-of-the-art deep learning technique for biomedical natural language processing tasks, BioBERT, to build a model - dubbed PPI-BioBERT - for identifying the function of PPIs. In order to extract high quality PPI functions at large scale, we use an ensemble of PPI-BioBERT models to improve uncertainty estimation and apply an interaction type-specific threshold to counteract the effects of variations in the number of training samples per interaction type.
Results: We scan 18 million PubMed abstracts to automatically identify 3253 new typed PPIs, including phosphorylation and acetylation interactions, with an overall precision of 46% (87% for acetylation) based on a human-reviewed sample. This work demonstrates that analysis of biomedical abstracts for PPI function extraction is a feasible approach to substantially increasing the number of interactions annotated with function captured in online databases.
△ Less
Submitted 6 January, 2022; v1 submitted 19 August, 2020;
originally announced August 2020.
-
Open Source Software Sustainability Models: Initial White Paper from the Informatics Technology for Cancer Research Sustainability and Industry Partnership Work Group
Authors:
Y. Ye,
R. D. Boyce,
M. K. Davis,
K. Elliston,
C. Davatzikos,
A. Fedorov,
J. C. Fillion-Robin,
I. Foster,
J. Gilbertson,
M. Heiskanen,
J. Klemm,
A. Lasso,
J. V. Miller,
M. Morgan,
S. Pieper,
B. Raumann,
B. Sarachan,
G. Savova,
J. C. Silverstein,
D. Taylor,
J. Zelnis,
G. Q. Zhang,
M. J. Becich
Abstract:
The Sustainability and Industry Partnership Work Group (SIP-WG) is a part of the National Cancer Institute Informatics Technology for Cancer Research (ITCR) program. The charter of the SIP-WG is to investigate options of long-term sustainability of open source software (OSS) developed by the ITCR, in part by developing a collection of business model archetypes that can serve as sustainability plan…
▽ More
The Sustainability and Industry Partnership Work Group (SIP-WG) is a part of the National Cancer Institute Informatics Technology for Cancer Research (ITCR) program. The charter of the SIP-WG is to investigate options of long-term sustainability of open source software (OSS) developed by the ITCR, in part by developing a collection of business model archetypes that can serve as sustainability plans for ITCR OSS development initiatives. The workgroup assembled models from the ITCR program, from other studies, and via engagement of its extensive network of relationships with other organizations (e.g., Chan Zuckerberg Initiative, Open Source Initiative and Software Sustainability Institute). This article reviews existing sustainability models and describes ten OSS use cases disseminated by the SIP-WG and others, and highlights five essential attributes (alignment with unmet scientific needs, dedicated development team, vibrant user community, feasible licensing model, and sustainable financial model) to assist academic software developers in achieving best practice in software sustainability.
△ Less
Submitted 1 January, 2020; v1 submitted 27 December, 2019;
originally announced December 2019.
-
Heuristics for Quantum Compiling with a Continuous Gate Set
Authors:
Marc Grau Davis,
Ethan Smith,
Ana Tudor,
Koushik Sen,
Irfan Siddiqi,
Costin Iancu
Abstract:
We present an algorithm for compiling arbitrary unitaries into a sequence of gates native to a quantum processor. As accurate CNOT gates are hard for the foreseeable Noisy- Intermediate-Scale Quantum devices era, our A* inspired algorithm attempts to minimize their count, while accounting for connectivity. We discuss the search strategy together with metrics to expand the solution frontier. For a…
▽ More
We present an algorithm for compiling arbitrary unitaries into a sequence of gates native to a quantum processor. As accurate CNOT gates are hard for the foreseeable Noisy- Intermediate-Scale Quantum devices era, our A* inspired algorithm attempts to minimize their count, while accounting for connectivity. We discuss the search strategy together with metrics to expand the solution frontier. For a workload of circuits with complexity appropriate for the NISQ era, we produce solutions well within the best upper bounds published in literature and match or exceed hand tuned implementations, as well as other existing synthesis alternatives. In particular, when comparing against state-of-the-art available synthesis packages we show 2.4x average (up to 5.3x) reduction in CNOT count. We also show how to re-target the algorithm for a different chip topology and native gate set, while obtaining similar quality results. We believe that empirical tools like ours can facilitate algorithmic exploration, gate set discovery for quantum processor designers, as well as providing useful optimization blocks within the quantum compilation tool-chain.
△ Less
Submitted 5 December, 2019;
originally announced December 2019.
-
Transcriptional Response of SK-N-AS Cells to Methamidophos
Authors:
Akos Vertes,
Albert-Baskar Arul,
Peter Avar,
Andrew R. Korte,
Lida Parvin,
Ziad J. Sahab,
Deborah I. Bunin,
Merrill Knapp,
Denise Nishita,
Andrew Poggio,
Mark-Oliver Stehr,
Carolyn L. Talcott,
Brian M. Davis,
Christine A. Morton,
Christopher J. Sevinsky,
Maria I. Zavodszky
Abstract:
Transcriptomics response of SK-N-AS cells to methamidophos (an acetylcholine esterase inhibitor) exposure was measured at 10 time points between 0.5 and 48 h. The data was analyzed using a combination of traditional statistical methods and novel machine learning algorithms for detecting anomalous behavior and infer causal relations between time profiles. We identified several processes that appear…
▽ More
Transcriptomics response of SK-N-AS cells to methamidophos (an acetylcholine esterase inhibitor) exposure was measured at 10 time points between 0.5 and 48 h. The data was analyzed using a combination of traditional statistical methods and novel machine learning algorithms for detecting anomalous behavior and infer causal relations between time profiles. We identified several processes that appeared to be upregulated in cells treated with methamidophos including: unfolded protein response, response to cAMP, calcium ion response, and cell-cell signaling. The data confirmed the expected consequence of acetylcholine buildup. In addition, transcripts with potentially key roles were identified and causal networks relating these transcripts were inferred using two different computational methods: Siamese convolutional networks and time warp causal inference. Two types of anomaly detection algorithms, one based on Autoencoders and the other one based on Generative Adversarial Networks (GANs), were applied to narrow down the set of relevant transcripts.
△ Less
Submitted 10 August, 2019;
originally announced August 2019.
-
Learning Causality: Synthesis of Large-Scale Causal Networks from High-Dimensional Time Series Data
Authors:
Mark-Oliver Stehr,
Peter Avar,
Andrew R. Korte,
Lida Parvin,
Ziad J. Sahab,
Deborah I. Bunin,
Merrill Knapp,
Denise Nishita,
Andrew Poggio,
Carolyn L. Talcott,
Brian M. Davis,
Christine A. Morton,
Christopher J. Sevinsky,
Maria I. Zavodszky,
Akos Vertes
Abstract:
There is an abundance of complex dynamic systems that are critical to our daily lives and our society but that are hardly understood, and even with today's possibilities to sense and collect large amounts of experimental data, they are so complex and continuously evolving that it is unlikely that their dynamics will ever be understood in full detail. Nevertheless, through computational tools we ca…
▽ More
There is an abundance of complex dynamic systems that are critical to our daily lives and our society but that are hardly understood, and even with today's possibilities to sense and collect large amounts of experimental data, they are so complex and continuously evolving that it is unlikely that their dynamics will ever be understood in full detail. Nevertheless, through computational tools we can try to make the best possible use of the current technologies and available data. We believe that the most useful models will have to take into account the imbalance between system complexity and available data in the context of limited knowledge or multiple hypotheses. The complex system of biological cells is a prime example of such a system that is studied in systems biology and has motivated the methods presented in this paper. They were developed as part of the DARPA Rapid Threat Assessment (RTA) program, which is concerned with understanding of the mechanism of action (MoA) of toxins or drugs affecting human cells. Using a combination of Gaussian processes and abstract network modeling, we present three fundamentally different machine-learning-based approaches to learn causal relations and synthesize causal networks from high-dimensional time series data. While other types of data are available and have been analyzed and integrated in our RTA work, we focus on transcriptomics (that is gene expression) data obtained from high-throughput microarray experiments in this paper to illustrate capabilities and limitations of our algorithms. Our algorithms make different but overall relatively few biological assumptions, so that they are applicable to other types of biological data and potentially even to other complex systems that exhibit high dimensionality but are not of biological nature.
△ Less
Submitted 6 May, 2019;
originally announced May 2019.
-
Security Attacks on Smart Grid Scheduling and Their Defences: A Game-Theoretic Approach
Authors:
Matthias Pilz,
Fariborz Baghaei Naeini,
Ketil Grammont,
Coline Smagghe,
Mastaneh Davis,
Jean-Christophe Nebel,
Luluwah Al-Fagih,
Eckhard Pfluegel
Abstract:
The introduction of advanced communication infrastructure into the power grid raises a plethora of new opportunities to tackle climate change. This paper is concerned with the security of energy management systems which are expected to be implemented in the future smart grid. The existence of a novel class of false data injection attacks that are based on modifying forecasted demand data is demons…
▽ More
The introduction of advanced communication infrastructure into the power grid raises a plethora of new opportunities to tackle climate change. This paper is concerned with the security of energy management systems which are expected to be implemented in the future smart grid. The existence of a novel class of false data injection attacks that are based on modifying forecasted demand data is demonstrated, and the impact of the attacks on a typical system's parameters is identified, using a simulated scenario. Monitoring strategies that the utility company may employ in order to detect the attacks are proposed and a game--theoretic approach is used to support the utility company's decision--making process for the allocation of their defence resources. Informed by these findings, a generic security game is devised and solved, revealing the existence of several Nash Equilibrium strategies. The practical outcomes of these results for the utility company are discussed in detail and a proposal is made, suggesting how the generic model may be applied to other scenarios.
△ Less
Submitted 17 October, 2018;
originally announced October 2018.
-
Optical Network Virtualisation using Multi-technology Monitoring and SDN-enabled Optical Transceiver
Authors:
Yanni Ou,
Matthew Davis,
Alejandro Aguado,
Fanchao Meng,
Reza Nejabati,
Dimitra Simeonidou
Abstract:
We introduce the real-time multi-technology transport layer monitoring to facilitate the coordinated virtualisation of optical and Ethernet networks supported by optical virtualise-able transceivers (V-BVT). A monitoring and network resource configuration scheme is proposed to include the hardware monitoring in both Ethernet and Optical layers. The scheme depicts the data and control interactions…
▽ More
We introduce the real-time multi-technology transport layer monitoring to facilitate the coordinated virtualisation of optical and Ethernet networks supported by optical virtualise-able transceivers (V-BVT). A monitoring and network resource configuration scheme is proposed to include the hardware monitoring in both Ethernet and Optical layers. The scheme depicts the data and control interactions among multiple network layers under the software defined network (SDN) background, as well as the application that analyses the monitored data obtained from the database. We also present a re-configuration algorithm to adaptively modify the composition of virtual optical networks based on two criteria. The proposed monitoring scheme is experimentally demonstrated with OpenFlow (OF) extensions for a holistic (re-)configuration across both layers in Ethernet switches and V-BVTs.
△ Less
Submitted 15 January, 2018; v1 submitted 12 October, 2017;
originally announced October 2017.
-
Cited Half-Life of the Journal Literature
Authors:
Philip M. Davis,
Angela Cochran
Abstract:
Analyzing 13,455 journals listed in the Journal Citation Report (Thomson Reuters) from 1997 through 2013, we report that the mean cited half-life of the scholarly literature is 6.5 years and growing at a rate of 0.13 years per annum. Focusing on a subset of journals (N=4,937) for which we have a continuous series of half-life observations, 209 of 229 (91%) subject categories experienced increasing…
▽ More
Analyzing 13,455 journals listed in the Journal Citation Report (Thomson Reuters) from 1997 through 2013, we report that the mean cited half-life of the scholarly literature is 6.5 years and growing at a rate of 0.13 years per annum. Focusing on a subset of journals (N=4,937) for which we have a continuous series of half-life observations, 209 of 229 (91%) subject categories experienced increasing cited half-lives. Contrary to the overall trend, engineering and chemistry journals experienced declining cited half-lives. Last, as journals attracted more citations, a larger proportion of them were directed toward older papers. The trend to cite older papers is not fully explained by technology (digital publishing, search and retrieval, etc.), but may be the result of a structural shift to fund incremental and applied research over fundamental science.
△ Less
Submitted 29 April, 2015; v1 submitted 28 April, 2015;
originally announced April 2015.
-
Energy-Throughput Trade-offs in a Wireless Sensor Network with Mobile Relay
Authors:
Guanghua Zhu,
Linda M. Davis,
Terence Chan
Abstract:
In this paper we analyze the trade-offs between energy and throughput for links in a wireless sensor network. Our application of interest is one in which a number of low-powered sensors need to wirelessly communicate their measurements to a communications sink, or destination node, for communication to a central processor. We focus on one particular sensor source, and consider the case where the d…
▽ More
In this paper we analyze the trade-offs between energy and throughput for links in a wireless sensor network. Our application of interest is one in which a number of low-powered sensors need to wirelessly communicate their measurements to a communications sink, or destination node, for communication to a central processor. We focus on one particular sensor source, and consider the case where the distance to the destination is beyond the peak power of the source. A relay node is required. Transmission energy of the sensor and the relay can be adjusted to minimize the total energy for a given throughput of the connection from sensor source to destination. We introduce a bounded random walk model for movement of the relay between the sensor and destination nodes, and characterize the total transmission energy and throughput performance using Markov steady state analysis. Based on the trade-offs between total energy and throughput we propose a new time-sharing protocol to exploit the movement of the relay to reduce the total energy. We demonstrate the effectiveness of time-sharing for minimizing the total energy consumption while achieving the throughput requirement. We then show that the time-sharing scheme is more energy efficient than the popular sleep mode scheme.
△ Less
Submitted 23 March, 2014;
originally announced March 2014.
-
Best Practices for Scientific Computing
Authors:
Greg Wilson,
D. A. Aruliah,
C. Titus Brown,
Neil P. Chue Hong,
Matt Davis,
Richard T. Guy,
Steven H. D. Haddock,
Katy Huff,
Ian M. Mitchell,
Mark Plumbley,
Ben Waugh,
Ethan P. White,
Paul Wilson
Abstract:
Scientists spend an increasing amount of time building and using software. However, most scientists are never taught how to do this efficiently. As a result, many are unaware of tools and practices that would allow them to write more reliable and maintainable code with less effort. We describe a set of best practices for scientific software development that have solid foundations in research and e…
▽ More
Scientists spend an increasing amount of time building and using software. However, most scientists are never taught how to do this efficiently. As a result, many are unaware of tools and practices that would allow them to write more reliable and maintainable code with less effort. We describe a set of best practices for scientific software development that have solid foundations in research and experience, and that improve scientists' productivity and the reliability of their software.
△ Less
Submitted 26 September, 2013; v1 submitted 30 September, 2012;
originally announced October 2012.
-
Studies on access: a review
Authors:
Philip M. Davis
Abstract:
A review of the empirical literature on access to scholarly information. This review focuses on surveys of authors, article download and citation analysis.
A review of the empirical literature on access to scholarly information. This review focuses on surveys of authors, article download and citation analysis.
△ Less
Submitted 19 December, 2009;
originally announced December 2009.
-
On the Capacity of the Discrete-Time Channel with Uniform Output Quantization
Authors:
Yiyue Wu,
Linda M. Davis,
Robert Calderbank
Abstract:
This paper provides new insight into the classical problem of determining both the capacity of the discrete-time channel with uniform output quantization and the capacity achieving input distribution. It builds on earlier work by Gallager and Witsenhausen to provide a detailed analysis of two particular quantization schemes. The first is saturation quantization where overflows are mapped to the…
▽ More
This paper provides new insight into the classical problem of determining both the capacity of the discrete-time channel with uniform output quantization and the capacity achieving input distribution. It builds on earlier work by Gallager and Witsenhausen to provide a detailed analysis of two particular quantization schemes. The first is saturation quantization where overflows are mapped to the nearest quantization bin, and the second is wrapping quantization where overflows are mapped to the nearest quantization bin after reduction by some modulus. Both the capacity of wrapping quantization and the capacity achieving input distribution are determined. When the additive noise is gaussian and relatively small, the capacity of saturation quantization is shown to be bounded below by that of wrapping quantization. In the limit of arbitrarily many uniform quantization levels, it is shown that the difference between the upper and lower bounds on capacity given by Ihara is only 0.26 bits.
△ Less
Submitted 16 January, 2009;
originally announced January 2009.
-
Author-choice open access publishing in the biological and medical literature: a citation analysis
Authors:
Philip M. Davis
Abstract:
In this article, we analyze the citations to articles published in 11 biological and medical journals from 2003 to 2007 that employ author-choice open access models. Controlling for known explanatory predictors of citations, only 2 of the 11 journals show positive and significant open access effects. Analyzing all journals together, we report a small but significant increase in article citations…
▽ More
In this article, we analyze the citations to articles published in 11 biological and medical journals from 2003 to 2007 that employ author-choice open access models. Controlling for known explanatory predictors of citations, only 2 of the 11 journals show positive and significant open access effects. Analyzing all journals together, we report a small but significant increase in article citations of 17%. In addition, there is strong evidence to suggest that the open access advantage is declining by about 7% per year, from 32% in 2004 to 11% in 2007.
△ Less
Submitted 12 December, 2008; v1 submitted 18 August, 2008;
originally announced August 2008.
-
Eigenfactor : Does the Principle of Repeated Improvement Result in Better Journal Impact Estimates than Raw Citation Counts?
Authors:
Philip M. Davis
Abstract:
Eigenfactor.org, a journal evaluation tool which uses an iterative algorithm to weight citations (similar to the PageRank algorithm used for Google) has been proposed as a more valid method for calculating the impact of journals. The purpose of this brief communication is to investigate whether the principle of repeated improvement provides different rankings of journals than does a simple unwei…
▽ More
Eigenfactor.org, a journal evaluation tool which uses an iterative algorithm to weight citations (similar to the PageRank algorithm used for Google) has been proposed as a more valid method for calculating the impact of journals. The purpose of this brief communication is to investigate whether the principle of repeated improvement provides different rankings of journals than does a simple unweighted citation count (the method used by ISI).
△ Less
Submitted 27 October, 2008; v1 submitted 16 July, 2008;
originally announced July 2008.
-
Informed Traders
Authors:
Dorje C. Brody,
Mark H. A. Davis,
Robyn L. Friedman,
Lane P. Hughston
Abstract:
An asymmetric information model is introduced for the situation in which there is a small agent who is more susceptible to the flow of information in the market than the general market participant, and who tries to implement strategies based on the additional information. In this model market participants have access to a stream of noisy information concerning the future return of an asset, wher…
▽ More
An asymmetric information model is introduced for the situation in which there is a small agent who is more susceptible to the flow of information in the market than the general market participant, and who tries to implement strategies based on the additional information. In this model market participants have access to a stream of noisy information concerning the future return of an asset, whereas the informed trader has access to a further information source which is obscured by an additional noise that may be correlated with the market noise. The informed trader uses the extraneous information source to seek statistical arbitrage opportunities, while at the same time accommodating the additional risk. The amount of information available to the general market participant concerning the asset return is measured by the mutual information of the asset price and the associated cash flow. The worth of the additional information source is then measured in terms of the difference of mutual information between the general market participant and the informed trader. This difference is shown to be nonnegative when the signal-to-noise ratio of the information flow is known in advance. Explicit trading strategies leading to statistical arbitrage opportunities, taking advantage of the additional information, are constructed, illustrating how excess information can be translated into profit.
△ Less
Submitted 17 November, 2008; v1 submitted 8 July, 2008;
originally announced July 2008.
-
Citation advantage of Open Access articles likely explained by quality differential and media effects
Authors:
Philip M. Davis
Abstract:
In a study of articles published in the Proceedings of the National Academy of Sciences, Gunther Eysenbach discovered a significant citation advantage for those articles made freely-available upon publication (Eysenbach 2006). While the author attempted to control for confounding factors that may have explained the citation differential, the study was unable to control for characteristics of the…
▽ More
In a study of articles published in the Proceedings of the National Academy of Sciences, Gunther Eysenbach discovered a significant citation advantage for those articles made freely-available upon publication (Eysenbach 2006). While the author attempted to control for confounding factors that may have explained the citation differential, the study was unable to control for characteristics of the article that may have led some authors to pay the additional page charges ($1,000) for immediate OA status. OA articles published in PNAS were more than twice as likely to be featured on the front cover of the journal (3.3% vs. 1.4%), nearly twice as likely to be picked up by the media (15% vs. 8%) and when cited reached, on average, nearly twice as many news outlets as subscription-based articles (4.2 vs. 2.6). The citation advantage of Open Access articles in PNAS may likely be explained by a quality differential and the amplification of media effects.
△ Less
Submitted 16 January, 2007;
originally announced January 2007.
-
Does the arXiv lead to higher citations and reduced publisher downloads for mathematics articles?
Authors:
Philip M. Davis,
Michael J. Fromerth
Abstract:
An analysis of 2,765 articles published in four math journals from 1997 to 2005 indicate that articles deposited in the arXiv received 35% more citations on average than non-deposited articles (an advantage of about 1.1 citations per article), and that this difference was most pronounced for highly-cited articles. Open Access, Early View, and Quality Differential were examined as three non-exclu…
▽ More
An analysis of 2,765 articles published in four math journals from 1997 to 2005 indicate that articles deposited in the arXiv received 35% more citations on average than non-deposited articles (an advantage of about 1.1 citations per article), and that this difference was most pronounced for highly-cited articles. Open Access, Early View, and Quality Differential were examined as three non-exclusive postulates for explaining the citation advantage. There was little support for a universal Open Access explanation, and no empirical support for Early View. There was some inferential support for a Quality Differential brought about by more highly-citable articles being deposited in the arXiv. In spite of their citation advantage, arXiv-deposited articles received 23% fewer downloads from the publisher's website (about 10 fewer downloads per article) in all but the most recent two years after publication. The data suggest that arXiv and the publisher's website may be fulfilling distinct functional needs of the reader.
△ Less
Submitted 6 February, 2007; v1 submitted 14 March, 2006;
originally announced March 2006.
-
eJournal interface can influence usage statistics: implications for libraries, publishers, and Project COUNTER
Authors:
Philip M. Davis,
Jason S. Price
Abstract:
The design of a publisher's electronic interface can have a measurable effect on electronic journal usage statistics. A study of journal usage from six COUNTER-compliant publishers at thirty-two research institutions in the United States, the United Kingdom and Sweden indicates that the ratio of PDF to HTML views is not consistent across publisher interfaces, even after controlling for differenc…
▽ More
The design of a publisher's electronic interface can have a measurable effect on electronic journal usage statistics. A study of journal usage from six COUNTER-compliant publishers at thirty-two research institutions in the United States, the United Kingdom and Sweden indicates that the ratio of PDF to HTML views is not consistent across publisher interfaces, even after controlling for differences in publisher content. The number of fulltext downloads may be artificially inflated when publishers require users to view HTML versions before accessing PDF versions or when linking mechanisms, such as CrossRef, direct users to the full text, rather than the abstract, of each article. These results suggest that usage reports from COUNTER-compliant publishers are not directly comparable in their current form. One solution may be to modify publisher numbers with adjustment factors deemed to be representative of the benefit or disadvantage due to its interface. Standardization of some interface and linking protocols may obviate these differences and allow for more accurate cross-publisher comparisons.
△ Less
Submitted 16 February, 2006;
originally announced February 2006.
-
Search Using N-gram Technique Based Statistical Analysis for Knowledge Extraction in Case Based Reasoning Systems
Authors:
M. N. Karthik,
Moshe Davis
Abstract:
Searching techniques for Case Based Reasoning systems involve extensive methods of elimination. In this paper, we look at a new method of arriving at the right solution by performing a series of transformations upon the data. These involve N-gram based comparison and deduction of the input data with the case data, using Morphemes and Phonemes as the deciding parameters. A similar technique for e…
▽ More
Searching techniques for Case Based Reasoning systems involve extensive methods of elimination. In this paper, we look at a new method of arriving at the right solution by performing a series of transformations upon the data. These involve N-gram based comparison and deduction of the input data with the case data, using Morphemes and Phonemes as the deciding parameters. A similar technique for eliminating possible errors using a noise removal function is performed. The error tracking and elimination is performed through a statistical analysis of obtained data, where the entire data set is analyzed as sub-categories of various etymological derivatives. A probability analysis for the closest match is then performed, which yields the final expression. This final expression is referred to the Case Base. The output is redirected through an Expert System based on best possible match. The threshold for the match is customizable, and could be set by the Knowledge-Architect.
△ Less
Submitted 2 July, 2004;
originally announced July 2004.