-
A New Similarity Function for Spectral Clustering with Application to Plant Phenotypic Data
Authors:
Kapil Ahuja,
Mithun Singh,
Kuldeep Pathak,
Milind B. Ratnaparkhe
Abstract:
Clustering species of the same plant into different groups is an important step in developing new species of the concerned plant. Phenotypic (or physical) characteristics of plant species are commonly used to perform clustering. Hierarchical Clustering (HC) is popularly used for this task, and this algorithm suffers from low accuracy. In one of the recent works (Shastri et al., 2021), the authors…
▽ More
Clustering species of the same plant into different groups is an important step in developing new species of the concerned plant. Phenotypic (or physical) characteristics of plant species are commonly used to perform clustering. Hierarchical Clustering (HC) is popularly used for this task, and this algorithm suffers from low accuracy. In one of the recent works (Shastri et al., 2021), the authors have used the standard Spectral Clustering (SC) algorithm to improve the clustering accuracy. They have demonstrated the efficacy of their algorithm on soybean species.
In the SC algorithm, one of the crucial steps is building the similarity matrix. A Gaussian similarity function is the standard choice to build this matrix. In the past, many works have proposed variants of the Gaussian similarity function to improve the performance of the SC algorithm, however, all have focused on the variance or scaling of the Gaussian. None of the past works have investigated upon the choice of base "e" (Euler's number) of the Gaussian similarity function (natural exponential function).
Based upon spectral graph theory, specifically the Cheeger's inequality, in this work we propose use of a base "a" exponential function as the similarity function. We also integrate this new approach with the notion of "local scaling" from one of the first works that experimented with the scaling of the Gaussian similarity function (Zelnik-Manor et al., 2004).
Using an eigenvalue analysis, we theoretically justify that our proposed algorithm should work better than the existing one. With evaluation on 2376 soybean species and 1865 rice species, we experimentally demonstrate that our new SC is 35% and 11% better than the standard SC, respectively.
△ Less
Submitted 23 May, 2025; v1 submitted 22 December, 2023;
originally announced December 2023.
-
A Novel Scalable Apache Spark Based Feature Extraction Approaches for Huge Protein Sequence and their Clustering Performance Analysis
Authors:
Preeti Jha,
Aruna Tiwari,
Neha Bharill,
Milind Ratnaparkhe,
Om Prakash Patel,
Nilagiri Harshith,
Mukkamalla Mounika,
Neha Nagendra
Abstract:
Genome sequencing projects are rapidly increasing the number of high-dimensional protein sequence datasets. Clustering a high-dimensional protein sequence dataset using traditional machine learning approaches poses many challenges. Many different feature extraction methods exist and are widely used. However, extracting features from millions of protein sequences becomes impractical because they ar…
▽ More
Genome sequencing projects are rapidly increasing the number of high-dimensional protein sequence datasets. Clustering a high-dimensional protein sequence dataset using traditional machine learning approaches poses many challenges. Many different feature extraction methods exist and are widely used. However, extracting features from millions of protein sequences becomes impractical because they are not scalable with current algorithms. Therefore, there is a need for an efficient feature extraction approach that extracts significant features. We have proposed two scalable feature extraction approaches for extracting features from huge protein sequences using Apache Spark, which are termed 60d-SPF (60-dimensional Scalable Protein Feature) and 6d-SCPSF (6-dimensional Scalable Co-occurrence-based Probability-Specific Feature). The proposed 60d-SPF and 6d-SCPSF approaches capture the statistical properties of amino acids to create a fixed-length numeric feature vector that represents each protein sequence in terms of 60-dimensional and 6-dimensional features, respectively. The preprocessed huge protein sequences are used as an input in two clustering algorithms, i.e., Scalable Random Sampling with Iterative Optimization Fuzzy c-Means (SRSIO-FCM) and Scalable Literal Fuzzy C-Means (SLFCM) for clustering. We have conducted extensive experiments on various soybean protein datasets to demonstrate the effectiveness of the proposed feature extraction methods, 60d-SPF, 6d-SCPSF, and existing feature extraction methods on SRSIO-FCM and SLFCM clustering algorithms. The reported results in terms of the Silhouette index and the Davies-Bouldin index show that the proposed 60d-SPF extraction method on SRSIO-FCM and SLFCM clustering algorithms achieves significantly better results than the proposed 6d-SCPSF and existing feature extraction approaches.
△ Less
Submitted 21 April, 2022;
originally announced April 2022.
-
Probabilistically Sampled and Spectrally Clustered Plant Genotypes using Phenotypic Characteristics
Authors:
Aditya A. Shastri,
Kapil Ahuja,
Milind B. Ratnaparkhe,
Yann Busnel
Abstract:
Clustering genotypes based upon their phenotypic characteristics is used to obtain diverse sets of parents that are useful in their breeding programs. The Hierarchical Clustering (HC) algorithm is the current standard in clustering of phenotypic data. This algorithm suffers from low accuracy and high computational complexity issues. To address the accuracy challenge, we propose the use of Spectral…
▽ More
Clustering genotypes based upon their phenotypic characteristics is used to obtain diverse sets of parents that are useful in their breeding programs. The Hierarchical Clustering (HC) algorithm is the current standard in clustering of phenotypic data. This algorithm suffers from low accuracy and high computational complexity issues. To address the accuracy challenge, we propose the use of Spectral Clustering (SC) algorithm. To make the algorithm computationally cheap, we propose using sampling, specifically, Pivotal Sampling that is probability based. Since application of samplings to phenotypic data has not been explored much, for effective comparison, another sampling technique called Vector Quantization (VQ) is adapted for this data as well. VQ has recently given promising results for genome data.
The novelty of our SC with Pivotal Sampling algorithm is in constructing the crucial similarity matrix for the clustering algorithm and defining probabilities for the sampling technique. Although our algorithm can be applied to any plant genotypes, we test it on the phenotypic data obtained from about 2400 Soybean genotypes. SC with Pivotal Sampling achieves substantially more accuracy (in terms of Silhouette Values) than all the other proposed competitive clustering with sampling algorithms (i.e. SC with VQ, HC with Pivotal Sampling, and HC with VQ). The complexities of our SC with Pivotal Sampling algorithm and these three variants are almost same because of the involved sampling. In addition to this, SC with Pivotal Sampling outperforms the standard HC algorithm in both accuracy and computational complexity. We experimentally show that we are up to 45% more accurate than HC in terms of clustering accuracy. The computational complexity of our algorithm is more than a magnitude lesser than HC.
△ Less
Submitted 18 September, 2020;
originally announced September 2020.
-
Vector Quantized Spectral Clustering applied to Soybean Whole Genome Sequences
Authors:
Aditya A. Shastri,
Kapil Ahuja,
Milind B. Ratnaparkhe,
Aditya Shah,
Aishwary Gagrani,
Anant Lal
Abstract:
We develop a Vector Quantized Spectral Clustering (VQSC) algorithm that is a combination of Spectral Clustering (SC) and Vector Quantization (VQ) sampling for grouping Soybean genomes. The inspiration here is to use SC for its accuracy and VQ to make the algorithm computationally cheap (the complexity of SC is cubic in-terms of the input size). Although the combination of SC and VQ is not new, the…
▽ More
We develop a Vector Quantized Spectral Clustering (VQSC) algorithm that is a combination of Spectral Clustering (SC) and Vector Quantization (VQ) sampling for grouping Soybean genomes. The inspiration here is to use SC for its accuracy and VQ to make the algorithm computationally cheap (the complexity of SC is cubic in-terms of the input size). Although the combination of SC and VQ is not new, the novelty of our work is in developing the crucial similarity matrix in SC as well as use of k-medoids in VQ, both adapted for the Soybean genome data. We compare our approach with commonly used techniques like UPGMA (Un-weighted Pair Graph Method with Arithmetic Mean) and NJ (Neighbour Joining). Experimental results show that our approach outperforms both these techniques significantly in terms of cluster quality (up to 25% better cluster quality) and time complexity (order of magnitude faster).
△ Less
Submitted 30 September, 2018;
originally announced October 2018.