-
Semantic Decomposition Improves Learning of Large Language Models on EHR Data
Authors:
David A. Bloore,
Romane Gauriau,
Anna L. Decker,
Jacob Oppenheim
Abstract:
Electronic health records (EHR) are widely believed to hold a profusion of actionable insights, encrypted in an irregular, semi-structured format, amidst a loud noise background. To simplify learning patterns of health and disease, medical codes in EHR can be decomposed into semantic units connected by hierarchical graphs. Building on earlier synergy between Bidirectional Encoder Representations f…
▽ More
Electronic health records (EHR) are widely believed to hold a profusion of actionable insights, encrypted in an irregular, semi-structured format, amidst a loud noise background. To simplify learning patterns of health and disease, medical codes in EHR can be decomposed into semantic units connected by hierarchical graphs. Building on earlier synergy between Bidirectional Encoder Representations from Transformers (BERT) and Graph Attention Networks (GAT), we present H-BERT, which ingests complete graph tree expansions of hierarchical medical codes as opposed to only ingesting the leaves and pushes patient-level labels down to each visit. This methodology significantly improves prediction of patient membership in over 500 medical diagnosis classes as measured by aggregated AUC and APS, and creates distinct representations of patients in closely related but clinically distinct phenotypes.
△ Less
Submitted 14 November, 2022;
originally announced December 2022.
-
CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks
Authors:
Ruchir Puri,
David S. Kung,
Geert Janssen,
Wei Zhang,
Giacomo Domeniconi,
Vladimir Zolotov,
Julian Dolby,
Jie Chen,
Mihir Choudhury,
Lindsey Decker,
Veronika Thost,
Luca Buratti,
Saurabh Pujar,
Shyam Ramji,
Ulrich Finkler,
Susan Malaika,
Frederick Reiss
Abstract:
Over the last several decades, software has been woven into the fabric of every aspect of our society. As software development surges and code infrastructure of enterprise applications ages, it is now more critical than ever to increase software development productivity and modernize legacy applications. Advances in deep learning and machine learning algorithms have enabled numerous breakthroughs,…
▽ More
Over the last several decades, software has been woven into the fabric of every aspect of our society. As software development surges and code infrastructure of enterprise applications ages, it is now more critical than ever to increase software development productivity and modernize legacy applications. Advances in deep learning and machine learning algorithms have enabled numerous breakthroughs, motivating researchers to leverage AI techniques to improve software development efficiency. Thus, the fast-emerging research area of AI for Code has garnered new interest and gathered momentum. In this paper, we present a large-scale dataset CodeNet, consisting of over 14 million code samples and about 500 million lines of code in 55 different programming languages, which is aimed at teaching AI to code. In addition to its large scale, CodeNet has a rich set of high-quality annotations to benchmark and help accelerate research in AI techniques for a variety of critical coding tasks, including code similarity and classification, code translation between a large variety of programming languages, and code performance (runtime and memory) improvement techniques. Additionally, CodeNet provides sample input and output test sets for 98.5% of the code samples, which can be used as an oracle for determining code correctness and potentially guide reinforcement learning for code quality improvements. As a usability feature, we provide several pre-processing tools in CodeNet to transform source code into representations that can be readily used as inputs into machine learning models. Results of code classification and code similarity experiments using the CodeNet dataset are provided as a reference. We hope that the scale, diversity and rich, high-quality annotations of CodeNet will offer unprecedented research opportunities at the intersection of AI and Software Engineering.
△ Less
Submitted 29 August, 2021; v1 submitted 24 May, 2021;
originally announced May 2021.
-
Adaptive Multiplane Image Generation from a Single Internet Picture
Authors:
Diogo C. Luvizon,
Gustavo Sutter P. Carvalho,
Andreza A. dos Santos,
Jhonatas S. Conceicao,
Jose L. Flores-Campana,
Luis G. L. Decker,
Marcos R. Souza,
Helio Pedrini,
Antonio Joia,
Otavio A. B. Penatti
Abstract:
In the last few years, several works have tackled the problem of novel view synthesis from stereo images or even from a single picture. However, previous methods are computationally expensive, specially for high-resolution images. In this paper, we address the problem of generating a multiplane image (MPI) from a single high-resolution picture. We present the adaptive-MPI representation, which all…
▽ More
In the last few years, several works have tackled the problem of novel view synthesis from stereo images or even from a single picture. However, previous methods are computationally expensive, specially for high-resolution images. In this paper, we address the problem of generating a multiplane image (MPI) from a single high-resolution picture. We present the adaptive-MPI representation, which allows rendering novel views with low computational requirements. To this end, we propose an adaptive slicing algorithm that produces an MPI with a variable number of image planes. We present a new lightweight CNN for depth estimation, which is learned by knowledge distillation from a larger network. Occluded regions in the adaptive-MPI are inpainted also by a lightweight CNN. We show that our method is capable of producing high-quality predictions with one order of magnitude less parameters compared to previous approaches. The robustness of our method is evidenced on challenging pictures from the Internet.
△ Less
Submitted 26 November, 2020;
originally announced November 2020.
-
Parallax Motion Effect Generation Through Instance Segmentation And Depth Estimation
Authors:
Allan Pinto,
Manuel A. Córdova,
Luis G. L. Decker,
Jose L. Flores-Campana,
Marcos R. Souza,
Andreza A. dos Santos,
Jhonatas S. Conceição,
Henrique F. Gagliardi,
Diogo C. Luvizon,
Ricardo da S. Torres,
Helio Pedrini
Abstract:
Stereo vision is a growing topic in computer vision due to the innumerable opportunities and applications this technology offers for the development of modern solutions, such as virtual and augmented reality applications. To enhance the user's experience in three-dimensional virtual environments, the motion parallax estimation is a promising technique to achieve this objective. In this paper, we p…
▽ More
Stereo vision is a growing topic in computer vision due to the innumerable opportunities and applications this technology offers for the development of modern solutions, such as virtual and augmented reality applications. To enhance the user's experience in three-dimensional virtual environments, the motion parallax estimation is a promising technique to achieve this objective. In this paper, we propose an algorithm for generating parallax motion effects from a single image, taking advantage of state-of-the-art instance segmentation and depth estimation approaches. This work also presents a comparison against such algorithms to investigate the trade-off between efficiency and quality of the parallax motion effects, taking into consideration a multi-task learning network capable of estimating instance segmentation and depth estimation at once. Experimental results and visual quality assessment indicate that the PyD-Net network (depth estimation) combined with Mask R-CNN or FBNet networks (instance segmentation) can produce parallax motion effects with good visual quality.
△ Less
Submitted 6 October, 2020;
originally announced October 2020.
-
Comparison of Evolving Granular Classifiers applied to Anomaly Detection for Predictive Maintenance in Computing Centers
Authors:
Leticia Decker,
Daniel Leite,
Fabio Viola,
Daniele Bonacorsi
Abstract:
Log-based predictive maintenance of computing centers is a main concern regarding the worldwide computing grid that supports the CERN (European Organization for Nuclear Research) physics experiments. A log, as event-oriented adhoc information, is quite often given as unstructured big data. Log data processing is a time-consuming computational task. The goal is to grab essential information from a…
▽ More
Log-based predictive maintenance of computing centers is a main concern regarding the worldwide computing grid that supports the CERN (European Organization for Nuclear Research) physics experiments. A log, as event-oriented adhoc information, is quite often given as unstructured big data. Log data processing is a time-consuming computational task. The goal is to grab essential information from a continuously changeable grid environment to construct a classification model. Evolving granular classifiers are suited to learn from time-varying log streams and, therefore, perform online classification of the severity of anomalies. We formulated a 4-class online anomaly classification problem, and employed time windows between landmarks and two granular computing methods, namely, Fuzzy-set-Based evolving Modeling (FBeM) and evolving Granular Neural Network (eGNN), to model and monitor logging activity rate. The results of classification are of utmost importance for predictive maintenance because priority can be given to specific time intervals in which the classifier indicates the existence of high or medium severity anomalies.
△ Less
Submitted 8 April, 2020;
originally announced May 2020.
-
Real-Time Anomaly Detection in Data Centers for Log-based Predictive Maintenance using an Evolving Fuzzy-Rule-Based Approach
Authors:
Leticia Decker,
Daniel Leite,
Luca Giommi,
Daniele Bonacorsi
Abstract:
Detection of anomalous behaviors in data centers is crucial to predictive maintenance and data safety. With data centers, we mean any computer network that allows users to transmit and exchange data and information. In particular, we focus on the Tier-1 data center of the Italian Institute for Nuclear Physics (INFN), which supports the high-energy physics experiments at the Large Hadron Collider (…
▽ More
Detection of anomalous behaviors in data centers is crucial to predictive maintenance and data safety. With data centers, we mean any computer network that allows users to transmit and exchange data and information. In particular, we focus on the Tier-1 data center of the Italian Institute for Nuclear Physics (INFN), which supports the high-energy physics experiments at the Large Hadron Collider (LHC) in Geneva. The center provides resources and services needed for data processing, storage, analysis, and distribution. Log records in the data center is a stochastic and non-stationary phenomenon in nature. We propose a real-time approach to monitor and classify log records based on sliding time windows, and a time-varying evolving fuzzy-rule-based classification model. The most frequent log pattern according to a control chart is taken as the normal system status. We extract attributes from time windows to gradually develop and update an evolving Gaussian Fuzzy Classifier (eGFC) on the fly. The real-time anomaly monitoring system has to provide encouraging results in terms of accuracy, compactness, and real-time operation.
△ Less
Submitted 25 April, 2020;
originally announced April 2020.
-
EGFC: Evolving Gaussian Fuzzy Classifier from Never-Ending Semi-Supervised Data Streams -- With Application to Power Quality Disturbance Detection and Classification
Authors:
Daniel Leite,
Leticia Decker,
Marcio Santana,
Paulo Souza
Abstract:
Power-quality disturbances lead to several drawbacks such as limitation of the production capacity, increased line and equipment currents, and consequent ohmic losses; higher operating temperatures, premature faults, reduction of life expectancy of machines, malfunction of equipment, and unplanned outages. Real-time detection and classification of disturbances are deemed essential to industry stan…
▽ More
Power-quality disturbances lead to several drawbacks such as limitation of the production capacity, increased line and equipment currents, and consequent ohmic losses; higher operating temperatures, premature faults, reduction of life expectancy of machines, malfunction of equipment, and unplanned outages. Real-time detection and classification of disturbances are deemed essential to industry standards. We propose an Evolving Gaussian Fuzzy Classification (EGFC) framework for semi-supervised disturbance detection and classification combined with a hybrid Hodrick-Prescott and Discrete-Fourier-Transform attribute-extraction method applied over a landmark window of voltage waveforms. Disturbances such as spikes, notching, harmonics, and oscillatory transient are considered. Different from other monitoring systems, which require offline training of models based on a limited amount of data and occurrences, the proposed online data-stream-based EGFC method is able to learn disturbance patterns autonomously from never-ending data streams by adapting the parameters and structure of a fuzzy rule base on the fly. Moreover, the fuzzy model obtained is linguistically interpretable, which improves model acceptability. We show encouraging classification results.
△ Less
Submitted 17 April, 2020;
originally announced April 2020.