Skip to main content
Cornell University
We gratefully acknowledge support from the Simons Foundation, member institutions, and all contributors. Donate
arxiv logo > cs > arXiv:2311.14247

Help | Advanced Search

arXiv logo
Cornell University Logo

quick links

  • Login
  • Help Pages
  • About

Computer Science > Data Structures and Algorithms

arXiv:2311.14247 (cs)
[Submitted on 24 Nov 2023]

Title:Distribution Testing with a Confused Collector

Authors:Renato Ferreira Pinto Jr., Nathaniel Harms
View a PDF of the paper titled Distribution Testing with a Confused Collector, by Renato Ferreira Pinto Jr. and Nathaniel Harms
View PDF
Abstract:We are interested in testing properties of distributions with systematically mislabeled samples. Our goal is to make decisions about unknown probability distributions, using a sample that has been collected by a confused collector, such as a machine-learning classifier that has not learned to distinguish all elements of the domain. The confused collector holds an unknown clustering of the domain and an input distribution $\mu$, and provides two oracles: a sample oracle which produces a sample from $\mu$ that has been labeled according to the clustering; and a label-query oracle which returns the label of a query point $x$ according to the clustering.
Our first set of results shows that identity, uniformity, and equivalence of distributions can be tested efficiently, under the earth-mover distance, with remarkably weak conditions on the confused collector, even when the unknown clustering is adversarial. This requires defining a variant of the distribution testing task (inspired by the recent testable learning framework of Rubinfeld & Vasilyan), where the algorithm should test a joint property of the distribution and its clustering. As an example, we get efficient testers when the distribution tester is allowed to reject if it detects that the confused collector clustering is "far" from being a decision tree.
The second set of results shows that we can sometimes do significantly better when the clustering is random instead of adversarial. For certain one-dimensional random clusterings, we show that uniformity can be tested under the TV distance using $\widetilde O\left(\frac{\sqrt n}{\rho^{3/2} \epsilon^2}\right)$ samples and zero queries, where $\rho \in (0,1]$ controls the "resolution" of the clustering. We improve this to $O\left(\frac{\sqrt n}{\rho \epsilon^2}\right)$ when queries are allowed.
Comments: 64 pages. Full version of paper to appear at ITCS 2024. arXiv admin note: text overlap with arXiv:2304.01374
Subjects: Data Structures and Algorithms (cs.DS)
Cite as: arXiv:2311.14247 [cs.DS]
  (or arXiv:2311.14247v1 [cs.DS] for this version)
  https://doi.org/10.48550/arXiv.2311.14247
arXiv-issued DOI via DataCite

Submission history

From: Renato Ferreira Pinto Jr. [view email]
[v1] Fri, 24 Nov 2023 01:39:45 UTC (68 KB)
Full-text links:

Access Paper:

    View a PDF of the paper titled Distribution Testing with a Confused Collector, by Renato Ferreira Pinto Jr. and Nathaniel Harms
  • View PDF
  • TeX Source
  • Other Formats
license icon view license
Current browse context:
cs.DS
< prev   |   next >
new | recent | 2023-11
Change to browse by:
cs

References & Citations

  • NASA ADS
  • Google Scholar
  • Semantic Scholar
a export BibTeX citation Loading...

BibTeX formatted citation

×
Data provided by:

Bookmark

BibSonomy logo Reddit logo

Bibliographic and Citation Tools

Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)

Code, Data and Media Associated with this Article

alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)

Demos

Replicate (What is Replicate?)
Hugging Face Spaces (What is Spaces?)
TXYZ.AI (What is TXYZ.AI?)

Recommenders and Search Tools

Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
  • Author
  • Venue
  • Institution
  • Topic

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
  • About
  • Help
  • contact arXivClick here to contact arXiv Contact
  • subscribe to arXiv mailingsClick here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status
    Get status notifications via email or slack