Market-Based Data Subset Selection -- Principled Aggregation of Multi-Criteria Example Utility

Jha, Ashish; Leplat, Valentin; Phan, AH

Abstract:Selecting a small yet useful subset of training data is hard because signals of example utility (uncertainty, rarity, diversity, etc.) are heterogeneous and typically combined with ad hoc weights. We propose a market-based selector that prices each example via a cost-function prediction market (LMSR), signals act as traders, a single liquidity parameter controls concentration, and topic-wise normalization stabilizes calibration. Token budgets are handled explicitly by a price-per-token rule $\rho=p/\ell^{\gamma}$, with $\gamma$ exposing an interpretable length bias; a lightweight diversity head improves coverage. We quantify coverage via topic cluster coverage and effective sample size. On the theory side, we show that LMSR implements a maximum-entropy aggregation with exponential weighting and a convex objective, yielding transparent knobs for aggregation strength. Empirically, on GSM8K (60k-token budget) the market with diversity achieves parity with strong single-signal baselines while reducing seed variance and incurring $<\!0.1$ GPU-hr selection overhead; on AGNews at kept=5-25\% the market (with light balancing) delivers competitive accuracy with improved balance and stability. The framework unifies multi-signal data curation under fixed compute for prompt-level reasoning and classification.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
Cite as:	arXiv:2510.02456 [cs.LG]
	(or arXiv:2510.02456v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2510.02456

Computer Science > Machine Learning

Title:Market-Based Data Subset Selection -- Principled Aggregation of Multi-Criteria Example Utility

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators