-
XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages
Authors:
Sebastian Ruder,
Jonathan H. Clark,
Alexander Gutkin,
Mihir Kale,
Min Ma,
Massimo Nicosia,
Shruti Rijhwani,
Parker Riley,
Jean-Michel A. Sarr,
Xinyi Wang,
John Wieting,
Nitish Gupta,
Anna Katanova,
Christo Kirov,
Dana L. Dickinson,
Brian Roark,
Bidisha Samanta,
Connie Tao,
David I. Adelani,
Vera Axelrod,
Isaac Caswell,
Colin Cherry,
Dan Garrette,
Reeve Ingle,
Melvin Johnson
, et al. (2 additional authors not shown)
Abstract:
Data scarcity is a crucial issue for the development of highly multilingual NLP systems. Yet for many under-represented languages (ULs) -- languages for which NLP re-search is particularly far behind in meeting user needs -- it is feasible to annotate small amounts of data. Motivated by this, we propose XTREME-UP, a benchmark defined by: its focus on the scarce-data scenario rather than zero-shot;…
▽ More
Data scarcity is a crucial issue for the development of highly multilingual NLP systems. Yet for many under-represented languages (ULs) -- languages for which NLP re-search is particularly far behind in meeting user needs -- it is feasible to annotate small amounts of data. Motivated by this, we propose XTREME-UP, a benchmark defined by: its focus on the scarce-data scenario rather than zero-shot; its focus on user-centric tasks -- tasks with broad adoption by speakers of high-resource languages; and its focus on under-represented languages where this scarce-data scenario tends to be most realistic. XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies including ASR, OCR, MT, and information access tasks that are of general utility. We create new datasets for OCR, autocomplete, semantic parsing, and transliteration, and build on and refine existing datasets for other tasks. XTREME-UP provides methodology for evaluating many modeling scenarios including text-only, multi-modal (vision, audio, and text),supervised parameter tuning, and in-context learning. We evaluate commonly used models on the benchmark. We release all code and scripts to train and evaluate models
△ Less
Submitted 24 May, 2023; v1 submitted 19 May, 2023;
originally announced May 2023.
-
Weighting vectors for machine learning: numerical harmonic analysis applied to boundary detection
Authors:
Eric Bunch,
Jeffery Kline,
Daniel Dickinson,
Suhaas Bhat,
Glenn Fung
Abstract:
Metric space magnitude, an active field of research in algebraic topology, is a scalar quantity that summarizes the effective number of distinct points that live in a general metric space. The {\em weighting vector} is a closely-related concept that captures, in a nontrivial way, much of the underlying geometry of the original metric space. Recent work has demonstrated that when the metric space i…
▽ More
Metric space magnitude, an active field of research in algebraic topology, is a scalar quantity that summarizes the effective number of distinct points that live in a general metric space. The {\em weighting vector} is a closely-related concept that captures, in a nontrivial way, much of the underlying geometry of the original metric space. Recent work has demonstrated that when the metric space is Euclidean, the weighting vector serves as an effective tool for boundary detection. We recast this result and show the weighting vector may be viewed as a solution to a kernelized SVM. As one consequence, we apply this new insight to the task of outlier detection, and we demonstrate performance that is competitive or exceeds performance of state-of-the-art techniques on benchmark data sets. Under mild assumptions, we show the weighting vector, which has computational cost of matrix inversion, can be efficiently approximated in linear time. We show how nearest neighbor methods can approximate solutions to the minimization problems defined by SVMs.
△ Less
Submitted 1 June, 2021;
originally announced June 2021.
-
Practical applications of metric space magnitude and weighting vectors
Authors:
Eric Bunch,
Daniel Dickinson,
Jeffery Kline,
Glenn Fung
Abstract:
Metric space magnitude, an active subject of research in algebraic topology, originally arose in the context of biology, where it was used to represent the effective number of distinct species in an environment. In a more general setting, the magnitude of a metric space is a real number that aims to quantify the effective number of distinct points in the space. The contribution of each point to a…
▽ More
Metric space magnitude, an active subject of research in algebraic topology, originally arose in the context of biology, where it was used to represent the effective number of distinct species in an environment. In a more general setting, the magnitude of a metric space is a real number that aims to quantify the effective number of distinct points in the space. The contribution of each point to a metric space's global magnitude, which is encoded by the {\em weighting vector}, captures much of the underlying geometry of the original metric space.
Surprisingly, when the metric space is Euclidean, the weighting vector also serves as an effective tool for boundary detection. This allows the weighting vector to serve as the foundation of novel algorithms for classic machine learning tasks such as classification, outlier detection and active learning. We demonstrate, using experiments and comparisons on classic benchmark datasets, the promise of the proposed magnitude and weighting vector-based approaches.
△ Less
Submitted 2 July, 2020; v1 submitted 24 June, 2020;
originally announced June 2020.