-
Soft Sensing Transformer: Hundreds of Sensors are Worth a Single Word
Authors:
Chao Zhang,
Jaswanth Yella,
Yu Huang,
Xiaoye Qian,
Sergei Petrov,
Andrey Rzhetsky,
Sthitie Bom
Abstract:
With the rapid development of AI technology in recent years, there have been many studies with deep learning models in soft sensing area. However, the models have become more complex, yet, the data sets remain limited: researchers are fitting million-parameter models with hundreds of data samples, which is insufficient to exercise the effectiveness of their models and thus often fail to perform wh…
▽ More
With the rapid development of AI technology in recent years, there have been many studies with deep learning models in soft sensing area. However, the models have become more complex, yet, the data sets remain limited: researchers are fitting million-parameter models with hundreds of data samples, which is insufficient to exercise the effectiveness of their models and thus often fail to perform when implemented in industrial applications. To solve this long-lasting problem, we are providing large scale, high dimensional time series manufacturing sensor data from Seagate Technology to the public. We demonstrate the challenges and effectiveness of modeling industrial big data by a Soft Sensing Transformer model on these data sets. Transformer is used because, it has outperformed state-of-the-art techniques in Natural Language Processing, and since then has also performed well in the direct application to computer vision without introduction of image-specific inductive biases. We observe the similarity of a sentence structure to the sensor readings and process the multi-variable sensor readings in a time series in a similar manner of sentences in natural language. The high-dimensional time-series data is formatted into the same shape of embedded sentences and fed into the transformer model. The results show that transformer model outperforms the benchmark models in soft sensing field based on auto-encoder and long short-term memory (LSTM) models. To the best of our knowledge, we are the first team in academia or industry to benchmark the performance of original transformer model with large-scale numerical soft sensing data.
△ Less
Submitted 10 November, 2021;
originally announced November 2021.
-
Lightweight Mobile Automated Assistant-to-physician for Global Lower-resource Areas
Authors:
Chao Zhang,
Hanxin Zhang,
Atif Khan,
Ted Kim,
Olasubomi Omoleye,
Oluwamayomikun Abiona,
Amy Lehman,
Christopher O. Olopade,
Olufunmilayo I. Olopade,
Pedro Lopes,
Andrey Rzhetsky
Abstract:
Importance: Lower-resource areas in Africa and Asia face a unique set of healthcare challenges: the dual high burden of communicable and non-communicable diseases; a paucity of highly trained primary healthcare providers in both rural and densely populated urban areas; and a lack of reliable, inexpensive internet connections. Objective: To address these challenges, we designed an artificial intell…
▽ More
Importance: Lower-resource areas in Africa and Asia face a unique set of healthcare challenges: the dual high burden of communicable and non-communicable diseases; a paucity of highly trained primary healthcare providers in both rural and densely populated urban areas; and a lack of reliable, inexpensive internet connections. Objective: To address these challenges, we designed an artificial intelligence assistant to help primary healthcare providers in lower-resource areas document demographic and medical sign/symptom data and to record and share diagnostic data in real-time with a centralized database. Design: We trained our system using multiple data sets, including US-based electronic medical records (EMRs) and open-source medical literature and developed an adaptive, general medical assistant system based on machine learning algorithms. Main outcomes and Measure: The application collects basic information from patients and provides primary care providers with diagnoses and prescriptions suggestions. The application is unique from existing systems in that it covers a wide range of common diseases, signs, and medication typical in lower-resource countries; the application works with or without an active internet connection. Results: We have built and implemented an adaptive learning system that assists trained primary care professionals by means of an Android smartphone application, which interacts with a central database and collects real-time data. The application has been tested by dozens of primary care providers. Conclusions and Relevance: Our application would provide primary healthcare providers in lower-resource areas with a tool that enables faster and more accurate documentation of medical encounters. This application could be leveraged to automatically populate local or national EMR systems.
△ Less
Submitted 28 October, 2021;
originally announced October 2021.
-
Detecting signal from science:The structure of research communities and prior knowledge improves prediction of genetic regulatory experiments
Authors:
Alexander V. Belikov,
Andrey Rzhetsky,
James Evans
Abstract:
The explosive growth of scientists, scientific journals, articles and findings in recent years exponentially increases the difficulty scientists face in navigating prior knowledge. This challenge is exacerbated by uncertainty about the reproducibility of published findings. The availability of massive digital archives, machine reading and extraction tools on the one hand, and automated high-throug…
▽ More
The explosive growth of scientists, scientific journals, articles and findings in recent years exponentially increases the difficulty scientists face in navigating prior knowledge. This challenge is exacerbated by uncertainty about the reproducibility of published findings. The availability of massive digital archives, machine reading and extraction tools on the one hand, and automated high-throughput experiments on the other, allow us to evaluate these challenges at scale and identify novel opportunities for accelerating scientific advance. Here we demonstrate a Bayesian calculus that enables the positive prediction of robust, replicable scientific claims with findings automatically extracted from published literature on gene interactions. We matched these findings, filtered by science, with unfiltered gene interactions measured by the massive LINCS L1000 high-throughput experiment to identify and counteract sources of bias. Our calculus is built on easily extracted publication meta-data regarding the position of a scientific claim within the web of prior knowledge, and its breadth of support across institutions, authors and communities, revealing that scientifically focused but socially and institutionally independent research activity is most likely to replicate. These findings recommend policies that go against the common practice of channeling biomedical research funding into centralized research consortia and institutes rather than dispersing it more broadly. Our results demonstrate that robust scientific findings hinge upon a delicate balance of shared focus and independence, and that this complex pattern can be computationally exploited to decode bias and predict the replicability of published findings. These insights provide guidance for scientists navigating the research literature and for science funders seeking to improve it.
△ Less
Submitted 23 August, 2020;
originally announced August 2020.
-
Centralized "big science" communities more likely generate non-replicable results
Authors:
Valentin Danchev,
Andrey Rzhetsky,
James A. Evans
Abstract:
Growing concern that most published results, including those widely agreed upon, may be false are rarely examined against rapidly expanding research production. Replications have only occurred on small scales due to prohibitive expense and limited professional incentive. We introduce a novel, high-throughput replication strategy aligning 51,292 published claims about drug-gene interactions with hi…
▽ More
Growing concern that most published results, including those widely agreed upon, may be false are rarely examined against rapidly expanding research production. Replications have only occurred on small scales due to prohibitive expense and limited professional incentive. We introduce a novel, high-throughput replication strategy aligning 51,292 published claims about drug-gene interactions with high-throughput experiments performed through the NIH LINCS L1000 program. We show (1) that unique claims replicate 19% more frequently than at random, while those widely agreed upon replicate 45% more frequently, manifesting collective correction mechanisms in science; but (2) centralized scientific communities perpetuate claims that are less likely to replicate even if widely agreed upon, demonstrating how centralized, overlapping collaborations weaken collective understanding. Decentralized research communities involve more independent teams and use more diverse methodologies, generating the most robust, replicable results. Our findings highlight the importance of science policies that foster decentralized collaboration to promote robust scientific advance.
△ Less
Submitted 15 January, 2018;
originally announced January 2018.
-
RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning
Authors:
Ji-Sung Kim,
Xin Gao,
Andrey Rzhetsky
Abstract:
Anonymized electronic medical records are an increasingly popular source of research data. However, these datasets often lack race and ethnicity information. This creates problems for researchers modeling human disease, as race and ethnicity are powerful confounders for many health exposures and treatment outcomes; race and ethnicity are closely linked to population-specific genetic variation. We…
▽ More
Anonymized electronic medical records are an increasingly popular source of research data. However, these datasets often lack race and ethnicity information. This creates problems for researchers modeling human disease, as race and ethnicity are powerful confounders for many health exposures and treatment outcomes; race and ethnicity are closely linked to population-specific genetic variation. We showed that deep neural networks generate more accurate estimates for missing racial and ethnic information than competing methods (e.g., logistic regression, random forest). RIDDLE yielded significantly better classification performance across all metrics that were considered: accuracy, cross-entropy loss (error), and area under the curve for receiver operating characteristic plots (all $p < 10^{-6}$). We made specific efforts to interpret the trained neural network models to identify, quantify, and visualize medical features which are predictive of race and ethnicity. We used these characterizations of informative features to perform a systematic comparison of differential disease patterns by race and ethnicity. The fact that clinical histories are informative for imputing race and ethnicity could reflect (1) a skewed distribution of blue- and white-collar professions across racial and ethnic groups, (2) uneven accessibility and subjective importance of prophylactic health, (3) possible variation in lifestyle, such as dietary habits, and (4) differences in background genetic variation which predispose to diseases.
△ Less
Submitted 27 April, 2018; v1 submitted 5 July, 2017;
originally announced July 2017.
-
Tradition and Innovation in Scientists' Research Strategies
Authors:
Jacob G. Foster,
Andrey Rzhetsky,
James A. Evans
Abstract:
What factors affect a scientist's choice of research problem? Qualitative research in the history, philosophy, and sociology of science suggests that this choice is shaped by an "essential tension" between the professional demand for productivity and a conflicting drive toward risky innovation. We examine this tension empirically in the context of biomedical chemistry. We use complex networks to r…
▽ More
What factors affect a scientist's choice of research problem? Qualitative research in the history, philosophy, and sociology of science suggests that this choice is shaped by an "essential tension" between the professional demand for productivity and a conflicting drive toward risky innovation. We examine this tension empirically in the context of biomedical chemistry. We use complex networks to represent the evolving state of scientific knowledge, as expressed in publications. We then define research strategies relative to these networks. Scientists can introduce novel chemicals or chemical relationships--or delve deeper into known ones. They can consolidate existing knowledge clusters, or bridge distant ones. Analyzing such choices in aggregate, we find that the distribution of strategies remains remarkably stable, even as chemical knowledge grows dramatically. High-risk strategies, which explore new chemical relationships, are less prevalent in the literature, reflecting a growing focus on established knowledge at the expense of new opportunities. Research following a risky strategy is more likely to be ignored but also more likely to achieve high impact and recognition. While the outcome of a risky strategy has a higher expected reward than the outcome of a conservative strategy, the additional reward is insufficient to compensate for the additional risk. By studying the winners of 137 different prizes in biomedicine and chemistry, we show that the occasional "gamble" for extraordinary impact is the most plausible explanation for observed levels of risk-taking. Our empirical demonstration and unpacking of the "essential tension" suggests policy interventions that may foster more innovative research.
△ Less
Submitted 27 February, 2013;
originally announced February 2013.