Skip to main content

Showing 1–25 of 25 results for author: Orlitsky, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2309.01973  [pdf, other

    cs.LG cs.AI cs.IT stat.ML

    Linear Regression using Heterogeneous Data Batches

    Authors: Ayush Jain, Rajat Sen, Weihao Kong, Abhimanyu Das, Alon Orlitsky

    Abstract: In many learning applications, data are collected from multiple sources, each providing a \emph{batch} of samples that by itself is insufficient to learn its input-output relationship. A common approach assumes that the sources fall in one of several unknown subgroups, each with an unknown input distribution and input-output relationship. We consider one of this setup's most fundamental and import… ▽ More

    Submitted 5 September, 2023; originally announced September 2023.

  2. arXiv:2202.07172  [pdf, other

    stat.ML cs.LG math.ST

    TURF: A Two-factor, Universal, Robust, Fast Distribution Learning Algorithm

    Authors: Yi Hao, Ayush Jain, Alon Orlitsky, Vaishakh Ravindrakumar

    Abstract: Approximating distributions from their samples is a canonical statistical-learning problem. One of its most powerful and successful modalities approximates every distribution to an $\ell_1$ distance essentially at most a constant times larger than its closest $t$-piece degree-$d$ polynomial, where $t\ge1$ and $d\ge0$. Letting $c_{t,d}$ denote the smallest such factor, clearly $c_{1,0}=1$, and it c… ▽ More

    Submitted 17 June, 2022; v1 submitted 14 February, 2022; originally announced February 2022.

    Comments: 19 pages, 12 figures

  3. arXiv:2202.05453  [pdf, ps, other

    cs.LG stat.ML

    Robust estimation algorithms don't need to know the corruption level

    Authors: Ayush Jain, Alon Orlitsky, Vaishakh Ravindrakumar

    Abstract: Real data are rarely pure. Hence the past half-century has seen great interest in robust estimation algorithms that perform well even when part of the data is corrupt. However, their vast majority approach optimal accuracy only when given a tight upper bound on the fraction of corrupt data. Such bounds are not available in practice, resulting in weak guarantees and often poor performance. This bri… ▽ More

    Submitted 11 February, 2022; originally announced February 2022.

  4. arXiv:2010.00064  [pdf, other

    cs.LG cs.IT math.ST stat.ML

    Linear-Sample Learning of Low-Rank Distributions

    Authors: Ayush Jain, Alon Orlitsky

    Abstract: Many latent-variable applications, including community detection, collaborative filtering, genomic analysis, and NLP, model data as generated by low-rank matrices. Yet despite considerable research, except for very special cases, the number of samples required to efficiently recover the underlying matrices has not been known. We determine the onset of learning in several common latent-variable set… ▽ More

    Submitted 30 September, 2020; originally announced October 2020.

    Comments: Accepted for Neurips 2020

  5. arXiv:2002.11665  [pdf, ps, other

    stat.ML cs.IT cs.LG math.ST

    Profile Entropy: A Fundamental Measure for the Learnability and Compressibility of Discrete Distributions

    Authors: Yi Hao, Alon Orlitsky

    Abstract: The profile of a sample is the multiset of its symbol frequencies. We show that for samples of discrete distributions, profile entropy is a fundamental measure unifying the concepts of estimation, inference, and compression. Specifically, profile entropy a) determines the speed of estimating the distribution relative to the best natural estimator; b) characterizes the rate of inferring all symmetr… ▽ More

    Submitted 26 February, 2020; originally announced February 2020.

    Comments: 56 pages

  6. arXiv:2002.11099  [pdf, ps, other

    stat.ML cs.IT cs.LG math.ST

    A General Method for Robust Learning from Batches

    Authors: Ayush Jain, Alon Orlitsky

    Abstract: In many applications, data is collected in batches, some of which are corrupt or even adversarial. Recent work derived optimal robust algorithms for estimating discrete distributions in this setting. We consider a general framework of robust learning from batches, and determine the limits of both classification and distribution estimation over arbitrary, including continuous, domains. Building on… ▽ More

    Submitted 25 February, 2020; originally announced February 2020.

    Comments: First Draft

  7. arXiv:2002.09589  [pdf, other

    stat.ML cs.IT cs.LG math.ST

    SURF: A Simple, Universal, Robust, Fast Distribution Learning Algorithm

    Authors: Yi Hao, Ayush Jain, Alon Orlitsky, Vaishakh Ravindrakumar

    Abstract: Sample- and computationally-efficient distribution estimation is a fundamental tenet in statistics and machine learning. We present SURF, an algorithm for approximating distributions by piecewise polynomials. SURF is: simple, replacing prior complex optimization techniques by straight-forward {empirical probability} approximation of each potential polynomial piece {through simple empirical-probabi… ▽ More

    Submitted 11 February, 2021; v1 submitted 21 February, 2020; originally announced February 2020.

    Comments: 27 pages, 9 figures, 3 tables

  8. arXiv:1911.08532  [pdf, other

    cs.LG stat.ML

    Optimal Robust Learning of Discrete Distributions from Batches

    Authors: Ayush Jain, Alon Orlitsky

    Abstract: Many applications, including natural language processing, sensor networks, collaborative filtering, and federated learning, call for estimating discrete distributions from data collected in batches, some of which may be untrustworthy, erroneous, faulty, or even adversarial. Previous estimators for this setting ran in exponential time, and for some regimes required a suboptimal number of batches.… ▽ More

    Submitted 24 February, 2020; v1 submitted 19 November, 2019; originally announced November 2019.

    Comments: Added experiments, minor improvement in results

  9. arXiv:1911.03105  [pdf, ps, other

    cs.LG math.ST stat.ML

    Unified Sample-Optimal Property Estimation in Near-Linear Time

    Authors: Yi Hao, Alon Orlitsky

    Abstract: We consider the fundamental learning problem of estimating properties of distributions over large domains. Using a novel piecewise-polynomial approximation technique, we derive the first unified methodology for constructing sample- and time-efficient estimators for all sufficiently smooth, symmetric and non-symmetric, additive properties. This technique yields near-linear-time computable estimator… ▽ More

    Submitted 17 March, 2020; v1 submitted 8 November, 2019; originally announced November 2019.

    Comments: Appeared at NeurIPS 2019. Fixed a few typos and minor issues in corner cases

  10. arXiv:1906.03794  [pdf, other

    stat.ML cs.LG math.ST

    The Broad Optimality of Profile Maximum Likelihood

    Authors: Yi Hao, Alon Orlitsky

    Abstract: We study three fundamental statistical-learning problems: distribution estimation, property estimation, and property testing. We establish the profile maximum likelihood (PML) estimator as the first unified sample-optimal approach to a wide range of learning tasks. In particular, for every alphabet size $k$ and desired accuracy $\varepsilon$: $\textbf{Distribution estimation}$ Under $\ell_1$ dis… ▽ More

    Submitted 11 July, 2019; v1 submitted 10 June, 2019; originally announced June 2019.

    Comments: Added a new section (Section 8) about truncated PML (TPML) and derived several new results

  11. arXiv:1904.00070  [pdf, other

    stat.ML cs.LG math.ST

    Data Amplification: A Unified and Competitive Approach to Property Estimation

    Authors: Yi Hao, Alon Orlitsky, Ananda T. Suresh, Yihong Wu

    Abstract: Estimating properties of discrete distributions is a fundamental problem in statistical learning. We design the first unified, linear-time, competitive, property estimator that for a wide class of properties and for all underlying distributions uses just $2n$ samples to achieve the performance attained by the empirical estimator with $n\sqrt{\log n}$ samples. This provides off-the-shelf, distribut… ▽ More

    Submitted 29 March, 2019; originally announced April 2019.

    Comments: In NeurIPS 2018

  12. arXiv:1903.01432  [pdf, other

    math.ST cs.LG stat.ML

    Data Amplification: Instance-Optimal Property Estimation

    Authors: Yi Hao, Alon Orlitsky

    Abstract: The best-known and most commonly used distribution-property estimation technique uses a plug-in estimator, with empirical frequency replacing the underlying distribution. We present novel linear-time-computable estimators that significantly "amplify" the effective amount of data available. For a large variety of distribution properties including four of the most popular ones and for every underlyi… ▽ More

    Submitted 5 March, 2019; v1 submitted 4 March, 2019; originally announced March 2019.

    Comments: In this new version, we strengthened the previous results by eliminating unnecessary assumptions

  13. arXiv:1810.11754  [pdf, other

    cs.LG stat.ML

    On Learning Markov Chains

    Authors: Yi Hao, Alon Orlitsky, Venkatadheeraj Pichapati

    Abstract: The problem of estimating an unknown discrete distribution from its samples is a fundamental tenet of statistical learning. Over the past decade, it attracted significant research effort and has been solved for a variety of divergence measures. Surprisingly, an equally important problem, estimating an unknown Markov chain from its samples, is still far from understood. We consider two problems rel… ▽ More

    Submitted 27 October, 2018; originally announced October 2018.

    Comments: To appear at NIPS 2018

  14. arXiv:1705.05366  [pdf, ps, other

    cs.LG

    Maximum Selection and Ranking under Noisy Comparisons

    Authors: Moein Falahatgar, Alon Orlitsky, Venkatadheeraj Pichapati, Ananda Theertha Suresh

    Abstract: We consider $(ε,δ)$-PAC maximum-selection and ranking for general probabilistic models whose comparisons probabilities satisfy strong stochastic transitivity and stochastic triangle inequality. Modifying the popular knockout tournament, we propose a maximum-selection algorithm that uses $\mathcal{O}\left(\frac{n}{ε^2}\log \frac{1}δ\right)$ comparisons, a number tight up to a constant factor. We th… ▽ More

    Submitted 15 May, 2017; originally announced May 2017.

  15. arXiv:1611.02960  [pdf, other

    cs.IT cs.DS cs.LG

    A Unified Maximum Likelihood Approach for Optimal Distribution Property Estimation

    Authors: Jayadev Acharya, Hirakendu Das, Alon Orlitsky, Ananda Theertha Suresh

    Abstract: The advent of data science has spurred interest in estimating properties of distributions over large alphabets. Fundamental symmetric properties such as support size, support coverage, entropy, and proximity to uniformity, received most attention, with each property estimated using a different technique and often intricate analysis tools. We prove that for all these properties, a single, simple,… ▽ More

    Submitted 28 November, 2016; v1 submitted 9 November, 2016; originally announced November 2016.

  16. arXiv:1606.02786  [pdf, ps, other

    cs.DS cs.IT

    Maximum Selection and Sorting with Adversarial Comparators and an Application to Density Estimation

    Authors: Jayadev Acharya, Moein Falahatgar, Ashkan Jafarpour, Alon Orlitsky, Ananda Theertha Suresh

    Abstract: We study maximum selection and sorting of $n$ numbers using pairwise comparators that output the larger of their two inputs if the inputs are more than a given threshold apart, and output an adversarially-chosen input otherwise. We consider two adversarial models. A non-adaptive adversary that decides on the outcomes in advance based solely on the inputs, and an adaptive adversary that can decide… ▽ More

    Submitted 8 June, 2016; originally announced June 2016.

  17. arXiv:1504.08070  [pdf, ps, other

    cs.IT math.ST

    Universal Compression of Power-Law Distributions

    Authors: Moein Falahatgar, Ashkan Jafarpour, Alon Orlitsky, Venkatadheeraj Pichapati, Ananda Theertha Suresh

    Abstract: English words and the outputs of many other natural processes are well-known to follow a Zipf distribution. Yet this thoroughly-established property has never been shown to help compress or predict these important processes. We show that the expected redundancy of Zipf distributions of order $α>1$ is roughly the $1/α$ power of the expected redundancy of unrestricted distributions. Hence for these… ▽ More

    Submitted 30 April, 2015; v1 submitted 29 April, 2015; originally announced April 2015.

    Comments: 20 pages

  18. arXiv:1504.04103  [pdf, ps, other

    cs.DS cs.CC cs.LG math.ST

    Faster Algorithms for Testing under Conditional Sampling

    Authors: Moein Falahatgar, Ashkan Jafarpour, Alon Orlitsky, Venkatadheeraj Pichapathi, Ananda Theertha Suresh

    Abstract: There has been considerable recent interest in distribution-tests whose run-time and sample requirements are sublinear in the domain-size $k$. We study two of the most important tests under the conditional-sampling model where each query specifies a subset $S$ of the domain, and the response is a sample drawn from $S$ according to the underlying distribution. For identity testing, which asks whe… ▽ More

    Submitted 16 April, 2015; originally announced April 2015.

    Comments: 31 pages

  19. arXiv:1503.07940  [pdf, other

    cs.IT cs.DS cs.LG math.ST

    Competitive Distribution Estimation

    Authors: Alon Orlitsky, Ananda Theertha Suresh

    Abstract: Estimating an unknown distribution from its samples is a fundamental problem in statistics. The common, min-max, formulation of this goal considers the performance of the best estimator over all distributions in a class. It shows that with $n$ samples, distributions over $k$ symbols can be learned to a KL divergence that decreases to zero with the sample size $n$, but grows unboundedly with the al… ▽ More

    Submitted 26 March, 2015; originally announced March 2015.

    Comments: 15 pages

  20. arXiv:1410.4550  [pdf, ps, other

    cs.IT

    Universal compression of Gaussian sources with unknown parameters

    Authors: A. Orlitsky, N. Santhanam

    Abstract: For a collection of distributions over a countable support set, the worst case universal compression formulation by Shtarkov attempts to assign a universal distribution over the support set. The formulation aims to ensure that the universal distribution does not underestimate the probability of any element in the support set relative to distributions in the collection. When the alphabet is uncount… ▽ More

    Submitted 16 October, 2014; originally announced October 2014.

  21. arXiv:1408.1000  [pdf, other

    cs.IT cs.DS cs.LG

    Estimating Renyi Entropy of Discrete Distributions

    Authors: Jayadev Acharya, Alon Orlitsky, Ananda Theertha Suresh, Himanshu Tyagi

    Abstract: It was recently shown that estimating the Shannon entropy $H({\rm p})$ of a discrete $k$-symbol distribution ${\rm p}$ requires $Θ(k/\log k)$ samples, a number that grows near-linearly in the support size. In many applications $H({\rm p})$ can be replaced by the more general Rényi entropy of order $α$, $H_α({\rm p})$. We determine the number of samples needed to estimate $H_α({\rm p})$ for all… ▽ More

    Submitted 10 March, 2016; v1 submitted 2 August, 2014; originally announced August 2014.

  22. arXiv:1405.7460  [pdf, ps, other

    cs.IT cs.LG

    Universal Compression of Envelope Classes: Tight Characterization via Poisson Sampling

    Authors: Jayadev Acharya, Ashkan Jafarpour, Alon Orlitsky, Ananda Theertha Suresh

    Abstract: The Poisson-sampling technique eliminates dependencies among symbol appearances in a random sequence. It has been used to simplify the analysis and strengthen the performance guarantees of randomized algorithms. Applying this method to universal compression, we relate the redundancies of fixed-length and Poisson-sampled sequences, use the relation to derive a simple single-letter formula that appr… ▽ More

    Submitted 29 May, 2014; originally announced May 2014.

  23. arXiv:1403.2439  [pdf, ps, other

    cs.DM cs.DS cs.IT

    String Reconstruction from Substring Compositions

    Authors: Jayadev Acharya, Hirakendu Das, Olgica Milenkovic, Alon Orlitsky, Shengjun Pan

    Abstract: Motivated by mass-spectrometry protein sequencing, we consider a simply-stated problem of reconstructing a string from the multiset of its substring compositions. We show that all strings of length 7, one less than a prime, or one less than twice a prime, can be reconstructed uniquely up to reversal. For all other lengths we show that reconstruction is not always possible and provide sometimes-tig… ▽ More

    Submitted 10 March, 2014; originally announced March 2014.

  24. arXiv:1402.4746  [pdf, ps, other

    cs.LG cs.DS cs.IT stat.ML

    Near-optimal-sample estimators for spherical Gaussian mixtures

    Authors: Jayadev Acharya, Ashkan Jafarpour, Alon Orlitsky, Ananda Theertha Suresh

    Abstract: Statistical and machine-learning algorithms are frequently applied to high-dimensional data. In many of these applications data is scarce, and often much more costly than computation time. We provide the first sample-efficient polynomial-time estimator for high-dimensional spherical Gaussian mixtures. For mixtures of any $k$ $d$-dimensional spherical Gaussians, we derive an intuitive spectral-es… ▽ More

    Submitted 19 February, 2014; originally announced February 2014.

  25. arXiv:1207.4175  [pdf

    cs.AI

    On Modeling Profiles instead of Values

    Authors: Alon Orlitsky, Narayana Santhanam, Krishnamurthy Viswanathan, Junan Zhang

    Abstract: We consider the problem of estimating the distribution underlying an observed sample of data. Instead of maximum likelihood, which maximizes the probability of the ob served values, we propose a different estimate, the high-profile distribution, which maximizes the probability of the observed profile the number of symbols appearing any given number of times. We determine the high-profile distribut… ▽ More

    Submitted 11 July, 2012; originally announced July 2012.

    Comments: Appears in Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence (UAI2004)

    Report number: UAI-P-2004-PG-426-435