-
Clustering Alzheimer's Disease Subtypes via Similarity Learning and Graph Diffusion
Authors:
Tianyi Wei,
Shu Yang,
Davoud Ataee Tarzanagh,
Jingxuan Bao,
Jia Xu,
Patryk Orzechowski,
Joost B. Wagenaar,
Qi Long,
Li Shen
Abstract:
Alzheimer's disease (AD) is a complex neurodegenerative disorder that affects millions of people worldwide. Due to the heterogeneous nature of AD, its diagnosis and treatment pose critical challenges. Consequently, there is a growing research interest in identifying homogeneous AD subtypes that can assist in addressing these challenges in recent years. In this study, we aim to identify subtypes of…
▽ More
Alzheimer's disease (AD) is a complex neurodegenerative disorder that affects millions of people worldwide. Due to the heterogeneous nature of AD, its diagnosis and treatment pose critical challenges. Consequently, there is a growing research interest in identifying homogeneous AD subtypes that can assist in addressing these challenges in recent years. In this study, we aim to identify subtypes of AD that represent distinctive clinical features and underlying pathology by utilizing unsupervised clustering with graph diffusion and similarity learning. We adopted SIMLR, a multi-kernel similarity learning framework, and graph diffusion to perform clustering on a group of 829 patients with AD and mild cognitive impairment (MCI, a prodromal stage of AD) based on their cortical thickness measurements extracted from magnetic resonance imaging (MRI) scans. Although the clustering approach we utilized has not been explored for the task of AD subtyping before, it demonstrated significantly better performance than several commonly used clustering methods. Specifically, we showed the power of graph diffusion in reducing the effects of noise in the subtype detection. Our results revealed five subtypes that differed remarkably in their biomarkers, cognitive status, and some other clinical features. To evaluate the resultant subtypes further, a genetic association study was carried out and successfully identified potential genetic underpinnings of different AD subtypes. Our source code is available at: https://github.com/PennShenLab/AD-SIMLR.
△ Less
Submitted 4 October, 2024;
originally announced October 2024.
-
Generative and reproducible benchmarks for comprehensive evaluation of machine learning classifiers
Authors:
Patryk Orzechowski,
Jason H. Moore
Abstract:
Understanding the strengths and weaknesses of machine learning (ML) algorithms is crucial for determine their scope of application. Here, we introduce the DIverse and GENerative ML Benchmark (DIGEN) - a collection of synthetic datasets for comprehensive, reproducible, and interpretable benchmarking of machine learning algorithms for classification of binary outcomes. The DIGEN resource consists of…
▽ More
Understanding the strengths and weaknesses of machine learning (ML) algorithms is crucial for determine their scope of application. Here, we introduce the DIverse and GENerative ML Benchmark (DIGEN) - a collection of synthetic datasets for comprehensive, reproducible, and interpretable benchmarking of machine learning algorithms for classification of binary outcomes. The DIGEN resource consists of 40 mathematical functions which map continuous features to discrete endpoints for creating synthetic datasets. These 40 functions were discovered using a heuristic algorithm designed to maximize the diversity of performance among multiple popular machine learning algorithms thus providing a useful test suite for evaluating and comparing new methods. Access to the generative functions facilitates understanding of why a method performs poorly compared to other algorithms thus providing ideas for improvement. The resource with extensive documentation and analyses is open-source and available on GitHub.
△ Less
Submitted 13 July, 2021;
originally announced July 2021.
-
Benchmarking in Optimization: Best Practice and Open Issues
Authors:
Thomas Bartz-Beielstein,
Carola Doerr,
Daan van den Berg,
Jakob Bossek,
Sowmya Chandrasekaran,
Tome Eftimov,
Andreas Fischbach,
Pascal Kerschke,
William La Cava,
Manuel Lopez-Ibanez,
Katherine M. Malan,
Jason H. Moore,
Boris Naujoks,
Patryk Orzechowski,
Vanessa Volz,
Markus Wagner,
Thomas Weise
Abstract:
This survey compiles ideas and recommendations from more than a dozen researchers with different backgrounds and from different institutes around the world. Promoting best practice in benchmarking is its main goal. The article discusses eight essential topics in benchmarking: clearly stated goals, well-specified problems, suitable algorithms, adequate performance measures, thoughtful analysis, eff…
▽ More
This survey compiles ideas and recommendations from more than a dozen researchers with different backgrounds and from different institutes around the world. Promoting best practice in benchmarking is its main goal. The article discusses eight essential topics in benchmarking: clearly stated goals, well-specified problems, suitable algorithms, adequate performance measures, thoughtful analysis, effective and efficient designs, comprehensible presentations, and guaranteed reproducibility. The final goal is to provide well-accepted guidelines (rules) that might be useful for authors and reviewers. As benchmarking in optimization is an active and evolving field of research this manuscript is meant to co-evolve over time by means of periodic updates.
△ Less
Submitted 16 December, 2020; v1 submitted 7 July, 2020;
originally announced July 2020.
-
EBIC: an open source software for high-dimensional and big data biclustering analyses
Authors:
Patryk Orzechowski,
Jason H. Moore
Abstract:
Motivation: In this paper we present the latest release of EBIC, a next-generation biclustering algorithm for mining genetic data. The major contribution of this paper is adding support for big data, making it possible to efficiently run large genomic data mining analyses. Additional enhancements include integration with R and Bioconductor and an option to remove influence of missing value on the…
▽ More
Motivation: In this paper we present the latest release of EBIC, a next-generation biclustering algorithm for mining genetic data. The major contribution of this paper is adding support for big data, making it possible to efficiently run large genomic data mining analyses. Additional enhancements include integration with R and Bioconductor and an option to remove influence of missing value on the final result.
Results: EBIC was applied to datasets of different sizes, including a large DNA methylation dataset with 436,444 rows. For the largest dataset we observed over 6.6 fold speedup in computation time on a cluster of 8 GPUs compared to running the method on a single GPU. This proves high scalability of the algorithm.
Availability: The latest version of EBIC could be downloaded from http://github.com/EpistasisLab/ebic . Installation and usage instructions are also available online.
△ Less
Submitted 4 September, 2024; v1 submitted 25 July, 2018;
originally announced July 2018.
-
Considerations of automated machine learning in clinical metabolic profiling: Altered homocysteine plasma concentration associated with metformin exposure
Authors:
Alena Orlenko,
Jason H. Moore,
Patryk Orzechowski,
Randal S. Olson,
Junmei Cairns,
Pedro J. Caraballo,
Richard M. Weinshilboum,
Liewei Wang,
Matthew K. Breitenstein
Abstract:
With the maturation of metabolomics science and proliferation of biobanks, clinical metabolic profiling is an increasingly opportunistic frontier for advancing translational clinical research. Automated Machine Learning (AutoML) approaches provide exciting opportunity to guide feature selection in agnostic metabolic profiling endeavors, where potentially thousands of independent data points must b…
▽ More
With the maturation of metabolomics science and proliferation of biobanks, clinical metabolic profiling is an increasingly opportunistic frontier for advancing translational clinical research. Automated Machine Learning (AutoML) approaches provide exciting opportunity to guide feature selection in agnostic metabolic profiling endeavors, where potentially thousands of independent data points must be evaluated. In previous research, AutoML using high-dimensional data of varying types has been demonstrably robust, outperforming traditional approaches. However, considerations for application in clinical metabolic profiling remain to be evaluated. Particularly, regarding the robustness of AutoML to identify and adjust for common clinical confounders. In this study, we present a focused case study regarding AutoML considerations for using the Tree-Based Optimization Tool (TPOT) in metabolic profiling of exposure to metformin in a biobank cohort. First, we propose a tandem rank-accuracy measure to guide agnostic feature selection and corresponding threshold determination in clinical metabolic profiling endeavors. Second, while AutoML, using default parameters, demonstrated potential to lack sensitivity to low-effect confounding clinical covariates, we demonstrated residual training and adjustment of metabolite features as an easily applicable approach to ensure AutoML adjustment for potential confounding characteristics. Finally, we present increased homocysteine with long-term exposure to metformin as a potentially novel, non-replicated metabolite association suggested by TPOT; an association not identified in parallel clinical metabolic profiling endeavors. While considerations are recommended, including adjustment approaches for clinical confounders, AutoML presents an exciting tool to enhance clinical metabolic profiling and advance translational research endeavors.
△ Less
Submitted 9 October, 2017;
originally announced October 2017.