-
Assessing Chronic Kidney Disease from Office Visit Records Using Hierarchical Meta-Classification of an Imbalanced Dataset
Authors:
M. Bhattacharya,
C. Jurkovitz,
H. Shatkay
Abstract:
Chronic Kidney Disease (CKD) is an increasingly prevalent condition affecting 13% of the US population. The disease is often a silent condition, making its diagnosis challenging. Identifying CKD stages from standard office visit records can help in early detection of the disease and lead to timely intervention. The dataset we use is highly imbalanced. We propose a hierarchical meta-classification…
▽ More
Chronic Kidney Disease (CKD) is an increasingly prevalent condition affecting 13% of the US population. The disease is often a silent condition, making its diagnosis challenging. Identifying CKD stages from standard office visit records can help in early detection of the disease and lead to timely intervention. The dataset we use is highly imbalanced. We propose a hierarchical meta-classification method, aiming to stratify CKD by severity levels, employing simple quantitative non-text features gathered from office visit records, while addressing data imbalance. Our method effectively stratifies CKD severity levels obtaining high average sensitivity, precision and F-measure (~93%). We also conduct experiments in which the dimensionality of the data is significantly reduced to include only the most salient features. Our results show that the good performance of our system is retained even when using the reduced feature sets, as well as under much reduced training sets, indicating that our method is stable and generalizable.
△ Less
Submitted 17 November, 2017;
originally announced December 2017.
-
Protein (Multi-)Location Prediction: Using Location Inter-Dependencies in a Probabilistic Framework
Authors:
Ramanuja Simha,
Hagit Shatkay
Abstract:
Knowing the location of a protein within the cell is important for understanding its function, role in biological processes, and potential use as a drug target. Much progress has been made in developing computational methods that predict single locations for proteins, assuming that proteins localize to a single location. However, it has been shown that proteins localize to multiple locations. Whil…
▽ More
Knowing the location of a protein within the cell is important for understanding its function, role in biological processes, and potential use as a drug target. Much progress has been made in developing computational methods that predict single locations for proteins, assuming that proteins localize to a single location. However, it has been shown that proteins localize to multiple locations. While a few recent systems have attempted to predict multiple locations of proteins, they typically treat locations as independent or capture inter-dependencies by treating each locations-combination present in the training set as an individual location-class. We present a new method and a preliminary system we have developed that directly incorporates inter-dependencies among locations into the multiple-location-prediction process, using a collection of Bayesian network classifiers. We evaluate our system on a dataset of single- and multi-localized proteins. Our results, obtained by incorporating inter-dependencies are significantly higher than those obtained by classifiers that do not use inter-dependencies. The performance of our system on multi-localized proteins is comparable to a top performing system (YLoc+), without restricting predictions to be based only on location-combinations present in the training set.
△ Less
Submitted 29 July, 2013;
originally announced July 2013.
-
A Linear Classifier Based on Entity Recognition Tools and a Statistical Approach to Method Extraction in the Protein-Protein Interaction Literature
Authors:
AnĂ¡lia Lourenço,
Michael Conover,
Andrew Wong,
Azadeh Nematzadeh,
Fengxia Pan,
Hagit Shatkay,
Luis M. Rocha
Abstract:
We participated, in the Article Classification and the Interaction Method subtasks (ACT and IMT, respectively) of the Protein-Protein Interaction task of the BioCreative III Challenge. For the ACT, we pursued an extensive testing of available Named Entity Recognition and dictionary tools, and used the most promising ones to extend our Variable Trigonometric Threshold linear classifier. For the IMT…
▽ More
We participated, in the Article Classification and the Interaction Method subtasks (ACT and IMT, respectively) of the Protein-Protein Interaction task of the BioCreative III Challenge. For the ACT, we pursued an extensive testing of available Named Entity Recognition and dictionary tools, and used the most promising ones to extend our Variable Trigonometric Threshold linear classifier. For the IMT, we experimented with a primarily statistical approach, as opposed to employing a deeper natural language processing strategy. Finally, we also studied the benefits of integrating the method extraction approach that we have used for the IMT into the ACT pipeline. For the ACT, our linear article classifier leads to a ranking and classification performance significantly higher than all the reported submissions. For the IMT, our results are comparable to those of other systems, which took very different approaches. For the ACT, we show that the use of named entity recognition tools leads to a substantial improvement in the ranking and classification of articles relevant to protein-protein interaction. Thus, we show that our substantially expanded linear classifier is a very competitive classifier in this domain. Moreover, this classifier produces interpretable surfaces that can be understood as "rules" for human understanding of the classification. In terms of the IMT task, in contrast to other participants, our approach focused on identifying sentences that are likely to bear evidence for the application of a PPI detection method, rather than on classifying a document as relevant to a method. As BioCreative III did not perform an evaluation of the evidence provided by the system, we have conducted a separate assessment; the evaluators agree that our tool is indeed effective in detecting relevant evidence for PPI detection methods.
△ Less
Submitted 22 April, 2011; v1 submitted 21 March, 2011;
originally announced March 2011.