-
Probabilistic Robust Autoencoders for Outlier Detection
Authors:
Ofir Lindenbaum,
Yariv Aizenbud,
Yuval Kluger
Abstract:
Anomalies (or outliers) are prevalent in real-world empirical observations and potentially mask important underlying structures. Accurate identification of anomalous samples is crucial for the success of downstream data analysis tasks. To automatically identify anomalies, we propose Probabilistic Robust AutoEncoder (PRAE). PRAE aims to simultaneously remove outliers and identify a low-dimensional…
▽ More
Anomalies (or outliers) are prevalent in real-world empirical observations and potentially mask important underlying structures. Accurate identification of anomalous samples is crucial for the success of downstream data analysis tasks. To automatically identify anomalies, we propose Probabilistic Robust AutoEncoder (PRAE). PRAE aims to simultaneously remove outliers and identify a low-dimensional representation for the inlier samples. We first present the Robust AutoEncoder (RAE) objective as a minimization problem for splitting the data into inliers and outliers. Our objective is designed to exclude outliers while including a subset of samples (inliers) that can be effectively reconstructed using an AutoEncoder (AE). RAE minimizes the autoencoder's reconstruction error while incorporating as many samples as possible. This could be formulated via regularization by subtracting an $\ell_0$ norm counting the number of selected samples from the reconstruction term. Unfortunately, this leads to an intractable combinatorial problem. Therefore, we propose two probabilistic relaxations of RAE, which are differentiable and alleviate the need for a combinatorial search. We prove that the solution to the PRAE problem is equivalent to the solution of RAE. We use synthetic data to show that PRAE can accurately remove outliers in a wide range of contamination levels. Finally, we demonstrate that using PRAE for anomaly detection leads to state-of-the-art results on various benchmark datasets.
△ Less
Submitted 24 August, 2022; v1 submitted 1 October, 2021;
originally announced October 2021.
-
Non-Parametric Estimation of Manifolds from Noisy Data
Authors:
Yariv Aizenbud,
Barak Sober
Abstract:
A common observation in data-driven applications is that high dimensional data has a low intrinsic dimension, at least locally. In this work, we consider the problem of estimating a $d$ dimensional sub-manifold of $\mathbb{R}^D$ from a finite set of noisy samples. Assuming that the data was sampled uniformly from a tubular neighborhood of $\mathcal{M}\in \mathcal{C}^k$, a compact manifold without…
▽ More
A common observation in data-driven applications is that high dimensional data has a low intrinsic dimension, at least locally. In this work, we consider the problem of estimating a $d$ dimensional sub-manifold of $\mathbb{R}^D$ from a finite set of noisy samples. Assuming that the data was sampled uniformly from a tubular neighborhood of $\mathcal{M}\in \mathcal{C}^k$, a compact manifold without boundary, we present an algorithm that takes a point $r$ from the tubular neighborhood and outputs $\hat p_n\in \mathbb{R}^D$, and $\widehat{T_{\hat p_n}\mathcal{M}}$ an element in the Grassmanian $Gr(d, D)$. We prove that as the number of samples $n\to\infty$ the point $\hat p_n$ converges to $p\in \mathcal{M}$ and $\widehat{T_{\hat p_n}\mathcal{M}}$ converges to $T_p\mathcal{M}$ (the tangent space at that point) with high probability. Furthermore, we show that the estimation yields asymptotic rates of convergence of $n^{-\frac{k}{2k + d}}$ for the point estimation and $n^{-\frac{k-1}{2k + d}}$ for the estimation of the tangent space. These rates are known to be optimal for the case of function estimation.
△ Less
Submitted 19 July, 2021; v1 submitted 10 May, 2021;
originally announced May 2021.
-
Spectral Top-Down Recovery of Latent Tree Models
Authors:
Yariv Aizenbud,
Ariel Jaffe,
Meng Wang,
Amber Hu,
Noah Amsel,
Boaz Nadler,
Joseph T. Chang,
Yuval Kluger
Abstract:
Modeling the distribution of high dimensional data by a latent tree graphical model is a prevalent approach in multiple scientific domains. A common task is to infer the underlying tree structure, given only observations of its terminal nodes. Many algorithms for tree recovery are computationally intensive, which limits their applicability to trees of moderate size. For large trees, a common appro…
▽ More
Modeling the distribution of high dimensional data by a latent tree graphical model is a prevalent approach in multiple scientific domains. A common task is to infer the underlying tree structure, given only observations of its terminal nodes. Many algorithms for tree recovery are computationally intensive, which limits their applicability to trees of moderate size. For large trees, a common approach, termed divide-and-conquer, is to recover the tree structure in two steps. First, recover the structure separately of multiple, possibly random subsets of the terminal nodes. Second, merge the resulting subtrees to form a full tree. Here, we develop Spectral Top-Down Recovery (STDR), a deterministic divide-and-conquer approach to infer large latent tree models. Unlike previous methods, STDR partitions the terminal nodes in a non random way, based on the Fiedler vector of a suitable Laplacian matrix related to the observed nodes. We prove that under certain conditions, this partitioning is consistent with the tree structure. This, in turn, leads to a significantly simpler merging procedure of the small subtrees. We prove that STDR is statistically consistent and bound the number of samples required to accurately recover the tree with high probability. Using simulated data from several common tree models in phylogenetics, we demonstrate that STDR has a significant advantage in terms of runtime, with improved or similar accuracy.
△ Less
Submitted 7 December, 2021; v1 submitted 25 February, 2021;
originally announced February 2021.
-
Spectral neighbor joining for reconstruction of latent tree models
Authors:
Ariel Jaffe,
Noah Amsel,
Yariv Aizenbud,
Boaz Nadler,
Joseph T. Chang,
Yuval Kluger
Abstract:
A common assumption in multiple scientific applications is that the distribution of observed data can be modeled by a latent tree graphical model. An important example is phylogenetics, where the tree models the evolutionary lineages of a set of observed organisms. Given a set of independent realizations of the random variables at the leaves of the tree, a key challenge is to infer the underlying…
▽ More
A common assumption in multiple scientific applications is that the distribution of observed data can be modeled by a latent tree graphical model. An important example is phylogenetics, where the tree models the evolutionary lineages of a set of observed organisms. Given a set of independent realizations of the random variables at the leaves of the tree, a key challenge is to infer the underlying tree topology. In this work we develop Spectral Neighbor Joining (SNJ), a novel method to recover the structure of latent tree graphical models. Given a matrix that contains a measure of similarity between all pairs of observed variables, SNJ computes a spectral measure of cohesion between groups of observed variables. We prove that SNJ is consistent, and derive a sufficient condition for correct tree recovery from an estimated similarity matrix. Combining this condition with a concentration of measure result on the similarity matrix, we bound the number of samples required to recover the tree with high probability. We illustrate via extensive simulations that in comparison to several other reconstruction methods, SNJ requires fewer samples to accurately recover trees with a large number of leaves or long edges.
△ Less
Submitted 22 September, 2020; v1 submitted 28 February, 2020;
originally announced February 2020.
-
Rank-one Multi-Reference Factor Analysis
Authors:
Yariv Aizenbud,
Boris Landa,
Yoel Shkolnisky
Abstract:
In recent years, there is a growing need for processing methods aimed at extracting useful information from large datasets. In many cases the challenge is to discover a low-dimensional structure in the data, often concealed by the existence of nuisance parameters and noise. Motivated by such challenges, we consider the problem of estimating a signal from its scaled, cyclically-shifted and noisy ob…
▽ More
In recent years, there is a growing need for processing methods aimed at extracting useful information from large datasets. In many cases the challenge is to discover a low-dimensional structure in the data, often concealed by the existence of nuisance parameters and noise. Motivated by such challenges, we consider the problem of estimating a signal from its scaled, cyclically-shifted and noisy observations. We focus on the particularly challenging regime of low signal-to-noise ratio (SNR), where different observations cannot be shift-aligned. We show that an accurate estimation of the signal from its noisy observations is possible, and derive a procedure which is proved to consistently estimate the signal. The asymptotic sample complexity (the number of observations required to recover the signal) of the procedure is $1/\operatorname{SNR}^4$. Additionally, we propose a procedure which is experimentally shown to improve the sample complexity by a factor equal to the signal's length. Finally, we present numerical experiments which demonstrate the performance of our algorithms, and corroborate our theoretical findings.
△ Less
Submitted 4 June, 2019; v1 submitted 29 May, 2019;
originally announced May 2019.
-
Approximation of Functions over Manifolds: A Moving Least-Squares Approach
Authors:
Barak Sober,
Yariv Aizenbud,
David Levin
Abstract:
We present an algorithm for approximating a function defined over a $d$-dimensional manifold utilizing only noisy function values at locations sampled from the manifold with noise. To produce the approximation we do not require any knowledge regarding the manifold other than its dimension $d$. We use the Manifold Moving Least-Squares approach of (Sober and Levin 2016) to reconstruct the atlas of c…
▽ More
We present an algorithm for approximating a function defined over a $d$-dimensional manifold utilizing only noisy function values at locations sampled from the manifold with noise. To produce the approximation we do not require any knowledge regarding the manifold other than its dimension $d$. We use the Manifold Moving Least-Squares approach of (Sober and Levin 2016) to reconstruct the atlas of charts and the approximation is built on-top of those charts. The resulting approximant is shown to be a function defined over a neighborhood of a manifold, approximating the originally sampled manifold. In other words, given a new point, located near the manifold, the approximation can be evaluated directly on that point. We prove that our construction yields a smooth function, and in case of noiseless samples the approximation order is $\mathcal{O}(h^{m+1})$, where $h$ is a local density of sample parameter (i.e., the fill distance) and $m$ is the degree of a local polynomial approximation, used in our algorithm. In addition, the proposed algorithm has linear time complexity with respect to the ambient-space's dimension. Thus, we are able to avoid the computational complexity, commonly encountered in high dimensional approximations, without having to perform non-linear dimension reduction, which inevitably introduces distortions to the geometry of the data. Additionaly, we show numerical experiments that the proposed approach compares favorably to statistical approaches for regression over manifolds and show its potential.
△ Less
Submitted 16 January, 2020; v1 submitted 2 November, 2017;
originally announced November 2017.
-
Similarity Search Over Graphs Using Localized Spectral Analysis
Authors:
Yariv Aizenbud,
Amir Averbuch,
Gil Shabat,
Guy Ziv
Abstract:
This paper provides a new similarity detection algorithm. Given an input set of multi-dimensional data points, where each data point is assumed to be multi-dimensional, and an additional reference data point for similarity finding, the algorithm uses kernel method that embeds the data points into a low dimensional manifold. Unlike other kernel methods, which consider the entire data for the embedd…
▽ More
This paper provides a new similarity detection algorithm. Given an input set of multi-dimensional data points, where each data point is assumed to be multi-dimensional, and an additional reference data point for similarity finding, the algorithm uses kernel method that embeds the data points into a low dimensional manifold. Unlike other kernel methods, which consider the entire data for the embedding, our method selects a specific set of kernel eigenvectors. The eigenvectors are chosen to separate between the data points and the reference data point so that similar data points can be easily identified as being distinct from most of the members in the dataset.
△ Less
Submitted 11 July, 2017;
originally announced July 2017.
-
A max-cut approach to heterogeneity in cryo-electron microscopy
Authors:
Yariv Aizenbud,
Yoel Shkolnisky
Abstract:
The field of cryo-electron microscopy has made astounding advancements in the past few years, mainly due to advancements in electron detectors' technology. Yet, one of the key open challenges of the field remains the processing of heterogeneous data sets, produced from samples containing particles at several different conformational states. For such data sets, the algorithms must include some clas…
▽ More
The field of cryo-electron microscopy has made astounding advancements in the past few years, mainly due to advancements in electron detectors' technology. Yet, one of the key open challenges of the field remains the processing of heterogeneous data sets, produced from samples containing particles at several different conformational states. For such data sets, the algorithms must include some classification procedure to identify homogeneous groups within the data, so that the images in each group correspond to the same underlying structure. The fundamental importance of the heterogeneity problem in cryo-electron microscopy has drawn many research efforts, and resulted in significant progress in classification algorithms for heterogeneous data sets. While these algorithms are extremely useful and effective in practice, they lack rigorous mathematical analysis and performance guarantees.
In this paper, we attempt to make the first steps towards rigorous mathematical analysis of the heterogeneity problem in cryo-electron microscopy. To that end, we present an algorithm for processing heterogeneous data sets, and prove accuracy and stability bounds for it. We also suggest an extension of this algorithm that combines the classification and reconstruction steps. We demonstrate it on simulated data, and compare its performance to the state-of-the-art algorithm in RELION.
△ Less
Submitted 3 October, 2019; v1 submitted 5 September, 2016;
originally announced September 2016.
-
Multi-View Kernel Consensus For Data Analysis
Authors:
Moshe Salhov,
Ofir Lindenbaum,
Yariv Aizenbud,
Avi Silberschatz,
Yoel Shkolnisky,
Amir Averbuch
Abstract:
The input data features set for many data driven tasks is high-dimensional while the intrinsic dimension of the data is low. Data analysis methods aim to uncover the underlying low dimensional structure imposed by the low dimensional hidden parameters by utilizing distance metrics that consider the set of attributes as a single monolithic set. However, the transformation of the low dimensional phe…
▽ More
The input data features set for many data driven tasks is high-dimensional while the intrinsic dimension of the data is low. Data analysis methods aim to uncover the underlying low dimensional structure imposed by the low dimensional hidden parameters by utilizing distance metrics that consider the set of attributes as a single monolithic set. However, the transformation of the low dimensional phenomena into the measured high dimensional observations might distort the distance metric, This distortion can effect the desired estimated low dimensional geometric structure. In this paper, we suggest to utilize the redundancy in the attribute domain by partitioning the attributes into multiple subsets we call views. The proposed methods utilize the agreement also called consensus between different views to extract valuable geometric information that unifies multiple views about the intrinsic relationships among several different observations. This unification enhances the information that a single view or a simple concatenations of views provides.
△ Less
Submitted 29 January, 2019; v1 submitted 28 June, 2016;
originally announced June 2016.