-
A multi-core periphery perspective: Ranking via relative centrality
Authors:
Chandra Sekhar Mukherjee,
Jiapeng Zhang
Abstract:
Community and core-periphery are two widely studied graph structures, with their coexistence observed in real-world graphs (Rombach, Porter, Fowler \& Mucha [SIAM J. App. Math. 2014, SIAM Review 2017]). However, the nature of this coexistence is not well understood and has been pointed out as an open problem (Yanchenko \& Sengupta [Statistics Surveys, 2023]). Especially, the impact of inferring th…
▽ More
Community and core-periphery are two widely studied graph structures, with their coexistence observed in real-world graphs (Rombach, Porter, Fowler \& Mucha [SIAM J. App. Math. 2014, SIAM Review 2017]). However, the nature of this coexistence is not well understood and has been pointed out as an open problem (Yanchenko \& Sengupta [Statistics Surveys, 2023]). Especially, the impact of inferring the core-periphery structure of a graph on understanding its community structure is not well utilized. In this direction, we introduce a novel quantification for graphs with ground truth communities, where each community has a densely connected part (the core), and the rest is more sparse (the periphery), with inter-community edges more frequent between the peripheries.
Built on this structure, we propose a new algorithmic concept that we call relative centrality to detect the cores. We observe that core-detection algorithms based on popular centrality measures such as PageRank and degree centrality can show some bias in their outcome by selecting very few vertices from some cores. We show that relative centrality solves this bias issue and provide theoretical and simulation support, as well as experiments on real-world graphs.
Core detection is known to have important applications with respect to core-periphery structures. In our model, we show a new application: relative-centrality-based algorithms can select a subset of the vertices such that it contains sufficient vertices from all communities, and points in this subset are better separable into their respective communities. We apply the methods to 11 biological datasets, with our methods resulting in a more balanced selection of vertices from all communities such that clustering algorithms have better performance on this set.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Detecting Hidden Communities by Power Iterations with Connections to Vanilla Spectral Algorithms
Authors:
Chandra Sekhar Mukherjee,
Jiapeng Zhang
Abstract:
Community detection in the stochastic block model is one of the central problems of graph clustering. Since its introduction, many subsequent papers have made great strides in solving and understanding this model. In this setup, spectral algorithms have been one of the most widely used frameworks. However, despite the long history of study, there are still unsolved challenges. One of the main open…
▽ More
Community detection in the stochastic block model is one of the central problems of graph clustering. Since its introduction, many subsequent papers have made great strides in solving and understanding this model. In this setup, spectral algorithms have been one of the most widely used frameworks. However, despite the long history of study, there are still unsolved challenges. One of the main open problems is the design and analysis of "simple"(vanilla) spectral algorithms, especially when the number of communities is large.
In this paper, we provide two algorithms. The first one is based on the power-iteration method. It is a simple algorithm which only compares the rows of the powered adjacency matrix. Our algorithm performs optimally (up to logarithmic factors) compared to the best known bounds in the dense graph regime by Van Vu (Combinatorics Probability and Computing, 2018). Furthermore, our algorithm is also robust to the "small cluster barrier", recovering large clusters in the presence of an arbitrary number of small clusters. Then based on a connection between the powered adjacency matrix and eigenvectors, we provide a vanilla spectral algorithm for large number of communities in the balanced case. This answers an open question by Van Vu (Combinatorics Probability and Computing, 2018) in the balanced case. Our methods also partially solve technical barriers discussed by Abbe, Fan, Wang and Zhong (Annals of Statistics, 2020).
In the technical side, we introduce a random partition method to analyze each entry of a powered random matrix. This method can be viewed as an eigenvector version of Wigner's trace method. Recall that Wigner's trace method links the trace of powered matrix to eigenvalues. Our method links the whole powered matrix to the span of eigenvectors. We expect our method to have more applications in random matrix theory.
△ Less
Submitted 7 November, 2022;
originally announced November 2022.
-
Confident Clustering via PCA Compression Ratio and Its Application to Single-cell RNA-seq Analysis
Authors:
Yingcong Li,
Chandra Sekhar Mukherjee,
Jiapeng Zhang
Abstract:
Unsupervised clustering algorithms for vectors has been widely used in the area of machine learning. Many applications, including the biological data we studied in this paper, contain some boundary datapoints which show combination properties of two underlying clusters and could lower the performance of the traditional clustering algorithms. We develop a confident clustering method aiming to dimin…
▽ More
Unsupervised clustering algorithms for vectors has been widely used in the area of machine learning. Many applications, including the biological data we studied in this paper, contain some boundary datapoints which show combination properties of two underlying clusters and could lower the performance of the traditional clustering algorithms. We develop a confident clustering method aiming to diminish the influence of these datapoints and improve the clustering results. Concretely, for a list of datapoints, we give two clustering results. The first-round clustering attempts to classify only pure vectors with high confidence. Based on it, we classify more vectors with less confidence in the second round. We validate our algorithm on single-cell RNA-seq data, which is a powerful and widely used tool in biology area. Our confident clustering shows a high accuracy on our tested datasets. In addition, unlike traditional clustering methods in single-cell analysis, the confident clustering shows high stability under different choices of parameters.
△ Less
Submitted 19 May, 2022;
originally announced May 2022.
-
Capturing the Denoising Effect of PCA via Compression Ratio
Authors:
Chandra Sekhar Mukherjee,
Nikhil Doerkar,
Jiapeng Zhang
Abstract:
Principal component analysis (PCA) is one of the most fundamental tools in machine learning with broad use as a dimensionality reduction and denoising tool. In the later setting, while PCA is known to be effective at subspace recovery and is proven to aid clustering algorithms in some specific settings, its improvement of noisy data is still not well quantified in general.
In this paper, we prop…
▽ More
Principal component analysis (PCA) is one of the most fundamental tools in machine learning with broad use as a dimensionality reduction and denoising tool. In the later setting, while PCA is known to be effective at subspace recovery and is proven to aid clustering algorithms in some specific settings, its improvement of noisy data is still not well quantified in general.
In this paper, we propose a novel metric called \emph{compression ratio} to capture the effect of PCA on high-dimensional noisy data. We show that, for data with \emph{underlying community structure}, PCA significantly reduces the distance of data points belonging to the same community while reducing inter-community distance relatively mildly. We explain this phenomenon through both theoretical proofs and experiments on real-world data.
Building on this new metric, we design a straightforward algorithm that could be used to detect outliers. Roughly speaking, we argue that points that have a \emph{lower variance of compression ratio} do not share a \emph{common signal} with others (hence could be considered outliers).
We provide theoretical justification for this simple outlier detection algorithm and use simulations to demonstrate that our method is competitive with popular outlier detection tools. Finally, we run experiments on real-world high-dimension noisy data (single-cell RNA-seq) to show that removing points from these datasets via our outlier detection method improves the accuracy of clustering algorithms. Our method is very competitive with popular outlier detection tools in this task.
△ Less
Submitted 21 April, 2024; v1 submitted 22 April, 2022;
originally announced April 2022.
-
Recovering Unbalanced Communities in the Stochastic Block Model With Application to Clustering with a Faulty Oracle
Authors:
Chandra Sekhar Mukherjee,
Pan Peng,
Jiapeng Zhang
Abstract:
The stochastic block model (SBM) is a fundamental model for studying graph clustering or community detection in networks. It has received great attention in the last decade and the balanced case, i.e., assuming all clusters have large size, has been well studied. However, our understanding of SBM with unbalanced communities (arguably, more relevant in practice) is still limited. In this paper, we…
▽ More
The stochastic block model (SBM) is a fundamental model for studying graph clustering or community detection in networks. It has received great attention in the last decade and the balanced case, i.e., assuming all clusters have large size, has been well studied. However, our understanding of SBM with unbalanced communities (arguably, more relevant in practice) is still limited. In this paper, we provide a simple SVD-based algorithm for recovering the communities in the SBM with communities of varying sizes. We improve upon a result of Ailon, Chen and Xu [ICML 2013; JMLR 2015] by removing the assumption that there is a large interval such that the sizes of clusters do not fall in, and also remove the dependency of the size of the recoverable clusters on the number of underlying clusters. We further complement our theoretical improvements with experimental comparisons. Under the planted clique conjecture, the size of the clusters that can be recovered by our algorithm is nearly optimal (up to poly-logarithmic factors) when the probability parameters are constant.
As a byproduct, we obtain an efficient clustering algorithm with sublinear query complexity in a faulty oracle model, which is capable of detecting all clusters larger than $\tildeΩ({\sqrt{n}})$, even in the presence of $Ω(n)$ small clusters in the graph. In contrast, previous efficient algorithms that use a sublinear number of queries are incapable of recovering any large clusters if there are more than $\tildeΩ(n^{2/5})$ small clusters.
△ Less
Submitted 21 October, 2023; v1 submitted 17 February, 2022;
originally announced February 2022.
-
On Boolean Functions with Low Polynomial Degree and Higher Order Sensitivity
Authors:
Subhamoy Maitra,
Chandra Sekhar Mukherjee,
Pantelimon Stanica,
Deng Tang
Abstract:
Boolean functions are important primitives in different domains of cryptology, complexity and coding theory. In this paper, we connect the tools from cryptology and complexity theory in the domain of Boolean functions with low polynomial degree and high sensitivity. It is well known that the polynomial degree of of a Boolean function and its resiliency are directly connected. Using this connection…
▽ More
Boolean functions are important primitives in different domains of cryptology, complexity and coding theory. In this paper, we connect the tools from cryptology and complexity theory in the domain of Boolean functions with low polynomial degree and high sensitivity. It is well known that the polynomial degree of of a Boolean function and its resiliency are directly connected. Using this connection we analyze the polynomial degree-sensitivity values through the lens of resiliency, demonstrating existence and non-existence results of functions with low polynomial degree and high sensitivity on small number of variables (upto 10). In this process, borrowing an idea from complexity theory, we show that one can implement resilient Boolean functions on a large number of variables with linear size and logarithmic depth. Finally, we extend the notion of sensitivity to higher order and note that the existing construction idea of Nisan and Szegedy (1994) can provide only constant higher order sensitivity when aiming for polynomial degree of $n-ω(1)$. In this direction, we present a construction with low ($n-ω(1)$) polynomial degree and super-constant $ω(1)$ order sensitivity exploiting Maiorana-McFarland constructions, that we borrow from construction of resilient functions. The questions we raise identify novel combinatorial problems in the domain of Boolean functions.
△ Less
Submitted 23 July, 2021;
originally announced July 2021.
-
Following Forrelation -- Quantum Algorithms in Exploring Boolean Functions' Spectra
Authors:
Suman Dutta,
Subhamoy Maitra,
Chandra Sekhar Mukherjee
Abstract:
Here we revisit the quantum algorithms for obtaining Forrelation [Aaronson et al, 2015] values to evaluate some of the well-known cryptographically significant spectra of Boolean functions, namely the Walsh spectrum, the cross-correlation spectrum and the autocorrelation spectrum. We introduce the existing 2-fold Forrelation formulation with bent duality based promise problems as desirable instant…
▽ More
Here we revisit the quantum algorithms for obtaining Forrelation [Aaronson et al, 2015] values to evaluate some of the well-known cryptographically significant spectra of Boolean functions, namely the Walsh spectrum, the cross-correlation spectrum and the autocorrelation spectrum. We introduce the existing 2-fold Forrelation formulation with bent duality based promise problems as desirable instantiations. Next we concentrate on the $3$-fold version through two approaches. First, we judiciously set-up some of the functions in $3$-fold Forrelation, so that given an oracle access, one can sample from the Walsh Spectrum of $f$. Using this, we obtain improved results than what we obtain from the Deutsch-Jozsa algorithm, and in turn it has implications in resiliency checking. Furthermore, we use similar idea to obtain a technique in estimating the cross-correlation (and thus autocorrelation) value at any point, improving upon the existing algorithms. Finally, we tweak the quantum algorithm with superposition of linear functions to obtain a cross-correlation sampling technique. To the best of our knowledge, this is the first cross-correlation sampling algorithm with constant query complexity. This also provides a strategy to check if two functions are uncorrelated of degree $m$. We further modify this using Dicke states so that the time complexity reduces, particularly for constant values of $m$.
△ Less
Submitted 28 September, 2021; v1 submitted 25 April, 2021;
originally announced April 2021.
-
Exact Quantum Query Algorithms Outperforming Parity -- Beyond The Symmetric functions
Authors:
Chandra Sekhar Mukherjee,
Subhamoy Maitra
Abstract:
In Exact Quantum Query model, almost all of the Boolean functions for which non-trivial query algorithms exist are symmetric in nature. The most well known techniques in this domain exploit parity decision trees, in which the parity of two bits can be obtained by a single query. Thus, exact quantum query algorithms outperforming parity decision trees are rare. In this paper we first obtain optimal…
▽ More
In Exact Quantum Query model, almost all of the Boolean functions for which non-trivial query algorithms exist are symmetric in nature. The most well known techniques in this domain exploit parity decision trees, in which the parity of two bits can be obtained by a single query. Thus, exact quantum query algorithms outperforming parity decision trees are rare. In this paper we first obtain optimal exact quantum query algorithms ($Q_{algo}(f)$) for a direct sum based class of $Ω\left( 2^{\frac{\sqrt{n}}{2}} \right)$ non-symmetric functions. We construct these algorithms by analyzing the algebraic normal form together with a novel untangling strategy. Next we obtain the generalized parity decision tree complexity ($D_{\oplus}(f)$) analysing the Walsh Spectrum. Finally, we show that query complexity of $Q_{algo}$ is $\lceil \frac{3n}{4} \rceil$ whereas $D_{\oplus}(f)$ varies between $n-1$ and $\lceil \frac{3n}{4} \rceil+1$ for different classes, underlining linear separation between the two measures in many cases. To the best of our knowledge, this is the first family of algorithms beyond generalized parity (and thus parity) for a large class of non-symmetric functions. We also implement these techniques for a larger (doubly exponential in $\frac{n}{4}$) class of Maiorana-McFarland type functions, but could only obtain partial results using similar algorithmic techniques.
△ Less
Submitted 16 May, 2021; v1 submitted 14 August, 2020;
originally announced August 2020.
-
On Actual Preparation of Dicke State on a Quantum Computer
Authors:
Chandra Sekhar Mukherjee,
Subhamoy Maitra,
Vineet Gaurav,
Dibyendu Roy
Abstract:
The exact number of CNOT and single qubit gates needed to implement a Quantum Algorithm in a given architecture is one of the central problems of Quantum Computation. In this work we study the importance of concise realizations of Partially defined Unitary Transformations for better circuit construction using the case study of Dicke State Preparation. The Dicke States $(\left|D^n_k \right>)$ are a…
▽ More
The exact number of CNOT and single qubit gates needed to implement a Quantum Algorithm in a given architecture is one of the central problems of Quantum Computation. In this work we study the importance of concise realizations of Partially defined Unitary Transformations for better circuit construction using the case study of Dicke State Preparation. The Dicke States $(\left|D^n_k \right>)$ are an important class of entangled states with uses in many branches of Quantum Information. In this regard we provide the most efficient Deterministic Dicke State Preparation Circuit in terms of CNOT and single qubit gate counts in comparison to existing literature. We further observe that our improvements also reduce architectural constraints of the circuits. We implement the circuit for preparing $\left| D^4_2 \right>$ on the "ibmqx2" machine of the IBM QX service and observe that the error induced due to noise in the system is lesser in comparison to the existing circuit descriptions. We conclude by describing the CNOT map of the generic $\left| D^n_k \right>$ preparation circuit and analyze different ways of distributing the CNOT gates in the circuit and its affect on the induced error.
△ Less
Submitted 19 July, 2020; v1 submitted 3 July, 2020;
originally announced July 2020.
-
Classical-Quantum Separations in Certain Classes of Boolean Functions-- Analysis using the Parity Decision Trees
Authors:
Chandra Sekhar Mukherjee,
Subhamoy Maitra
Abstract:
In this paper we study the separation between the deterministic (classical) query complexity ($D$) and the exact quantum query complexity ($Q_E$) of several Boolean function classes using the parity decision tree method. We first define the Query Friendly (QF) functions on $n$ variables as the ones with minimum deterministic query complexity $(D(f))$. We observe that for each $n$, there exists a n…
▽ More
In this paper we study the separation between the deterministic (classical) query complexity ($D$) and the exact quantum query complexity ($Q_E$) of several Boolean function classes using the parity decision tree method. We first define the Query Friendly (QF) functions on $n$ variables as the ones with minimum deterministic query complexity $(D(f))$. We observe that for each $n$, there exists a non-separable class of QF functions such that $D(f)=Q_E(f)$. Further, we show that for some values of $n$, all the QF functions are non-separable. Then we present QF functions for certain other values of $n$ where separation can be demonstrated, in particular, $Q_E(f)=D(f)-1$. In a related effort, we also study the Maiorana McFarland (M-M) type Bent functions. We show that while for any M-M Bent function $f$ on $n$ variables $D(f) = n$, separation can be achieved as $\frac{n}{2} \leq Q_E(f) \leq \lceil \frac{3n}{4} \rceil$. Our results highlight how different classes of Boolean functions can be analyzed for classical-quantum separation exploiting the parity decision tree method.
△ Less
Submitted 4 September, 2020; v1 submitted 27 April, 2020;
originally announced April 2020.