-
Data Contamination Report from the 2024 CONDA Shared Task
Authors:
Oscar Sainz,
Iker García-Ferrero,
Alon Jacovi,
Jon Ander Campos,
Yanai Elazar,
Eneko Agirre,
Yoav Goldberg,
Wei-Lin Chen,
Jenny Chim,
Leshem Choshen,
Luca D'Amico-Wong,
Melissa Dell,
Run-Ze Fan,
Shahriar Golchin,
Yucheng Li,
Pengfei Liu,
Bhavish Pahwa,
Ameya Prabhu,
Suryansh Sharma,
Emily Silcock,
Kateryna Solonko,
David Stap,
Mihai Surdeanu,
Yu-Min Tseng,
Vishaal Udandarao
, et al. (3 additional authors not shown)
Abstract:
The 1st Workshop on Data Contamination (CONDA 2024) focuses on all relevant aspects of data contamination in natural language processing, where data contamination is understood as situations where evaluation data is included in pre-training corpora used to train large scale models, compromising evaluation results. The workshop fostered a shared task to collect evidence on data contamination in cur…
▽ More
The 1st Workshop on Data Contamination (CONDA 2024) focuses on all relevant aspects of data contamination in natural language processing, where data contamination is understood as situations where evaluation data is included in pre-training corpora used to train large scale models, compromising evaluation results. The workshop fostered a shared task to collect evidence on data contamination in current available datasets and models. The goal of the shared task and associated database is to assist the community in understanding the extent of the problem and to assist researchers in avoiding reporting evaluation results on known contaminated resources. The shared task provides a structured, centralized public database for the collection of contamination evidence, open to contributions from the community via GitHub pool requests. This first compilation paper is based on 566 reported entries over 91 contaminated sources from a total of 23 contributors. The details of the individual contamination events are available in the platform. The platform continues to be online, open to contributions from the community.
△ Less
Submitted 4 August, 2024; v1 submitted 31 July, 2024;
originally announced July 2024.
-
Newswire: A Large-Scale Structured Database of a Century of Historical News
Authors:
Emily Silcock,
Abhishek Arora,
Luca D'Amico-Wong,
Melissa Dell
Abstract:
In the U.S. historically, local newspapers drew their content largely from newswires like the Associated Press. Historians argue that newswires played a pivotal role in creating a national identity and shared understanding of the world, but there is no comprehensive archive of the content sent over newswires. We reconstruct such an archive by applying a customized deep learning pipeline to hundred…
▽ More
In the U.S. historically, local newspapers drew their content largely from newswires like the Associated Press. Historians argue that newswires played a pivotal role in creating a national identity and shared understanding of the world, but there is no comprehensive archive of the content sent over newswires. We reconstruct such an archive by applying a customized deep learning pipeline to hundreds of terabytes of raw image scans from thousands of local newspapers. The resulting dataset contains 2.7 million unique public domain U.S. newswire articles, written between 1878 and 1977. Locations in these articles are georeferenced, topics are tagged using customized neural topic classification, named entities are recognized, and individuals are disambiguated to Wikipedia using a novel entity disambiguation model. To construct the Newswire dataset, we first recognize newspaper layouts and transcribe around 138 millions structured article texts from raw image scans. We then use a customized neural bi-encoder model to de-duplicate reproduced articles, in the presence of considerable abridgement and noise, quantifying how widely each article was reproduced. A text classifier is used to ensure that we only include newswire articles, which historically are in the public domain. The structured data that accompany the texts provide rich information about the who (disambiguated individuals), what (topics), and where (georeferencing) of the news that millions of Americans read over the course of a century. We also include Library of Congress metadata information about the newspapers that ran the articles on their front pages. The Newswire dataset is useful both for large language modeling - expanding training data beyond what is available from modern web texts - and for studying a diversity of questions in computational linguistics, social science, and the digital humanities.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Disrupting Bipartite Trading Networks: Matching for Revenue Maximization
Authors:
Luca D'Amico-Wong,
Yannai A. Gonczarowski,
Gary Qiurui Ma,
David C. Parkes
Abstract:
We model the role of an online platform disrupting a market with unit-demand buyers and unit-supply sellers. Each seller can transact with a subset of the buyers whom she already knows, as well as with any additional buyers to whom she is introduced by the platform. Given these constraints on trade, prices and transactions are induced by a competitive equilibrium. The platform's revenue is proport…
▽ More
We model the role of an online platform disrupting a market with unit-demand buyers and unit-supply sellers. Each seller can transact with a subset of the buyers whom she already knows, as well as with any additional buyers to whom she is introduced by the platform. Given these constraints on trade, prices and transactions are induced by a competitive equilibrium. The platform's revenue is proportional to the total price of all trades between platform-introduced buyers and sellers.
In general, we show that the platform's revenue-maximization problem is computationally intractable. We provide structural results for revenue-optimal matchings and isolate special cases in which the platform can efficiently compute them. Furthermore, in a market where the maximum increase in social welfare that the platform can create is $ΔW$, we prove that the platform can attain revenue $Ω(ΔW/\log(\min\{n,m\}))$, where $n$ and $m$ are the numbers of buyers and sellers, respectively. When $ΔW$ is large compared to welfare without the platform, this gives a polynomial-time algorithm that guarantees a logarithmic approximation of the optimal welfare as revenue. We also show that even when the platform optimizes for revenue, the social welfare is at least an $O(\log(\min\{n,m\}))$-approximation to the optimal welfare. Finally, we prove significantly stronger bounds for revenue and social welfare in homogeneous-goods markets.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Easy as ABCs: Unifying Boltzmann Q-Learning and Counterfactual Regret Minimization
Authors:
Luca D'Amico-Wong,
Hugh Zhang,
Marc Lanctot,
David C. Parkes
Abstract:
We propose ABCs (Adaptive Branching through Child stationarity), a best-of-both-worlds algorithm combining Boltzmann Q-learning (BQL), a classic reinforcement learning algorithm for single-agent domains, and counterfactual regret minimization (CFR), a central algorithm for learning in multi-agent domains. ABCs adaptively chooses what fraction of the environment to explore each iteration by measuri…
▽ More
We propose ABCs (Adaptive Branching through Child stationarity), a best-of-both-worlds algorithm combining Boltzmann Q-learning (BQL), a classic reinforcement learning algorithm for single-agent domains, and counterfactual regret minimization (CFR), a central algorithm for learning in multi-agent domains. ABCs adaptively chooses what fraction of the environment to explore each iteration by measuring the stationarity of the environment's reward and transition dynamics. In Markov decision processes, ABCs converges to the optimal policy with at most an O(A) factor slowdown compared to BQL, where A is the number of actions in the environment. In two-player zero-sum games, ABCs is guaranteed to converge to a Nash equilibrium (assuming access to a perfect oracle for detecting stationarity), while BQL has no such guarantees. Empirically, ABCs demonstrates strong performance when benchmarked across environments drawn from the OpenSpiel game library and OpenAI Gym and exceeds all prior methods in environments which are neither fully stationary nor fully nonstationary.
△ Less
Submitted 18 February, 2024;
originally announced February 2024.
-
American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers
Authors:
Melissa Dell,
Jacob Carlson,
Tom Bryan,
Emily Silcock,
Abhishek Arora,
Zejiang Shen,
Luca D'Amico-Wong,
Quan Le,
Pablo Querubin,
Leander Heldring
Abstract:
Existing full text datasets of U.S. public domain newspapers do not recognize the often complex layouts of newspaper scans, and as a result the digitized content scrambles texts from articles, headlines, captions, advertisements, and other layout regions. OCR quality can also be low. This study develops a novel, deep learning pipeline for extracting full article texts from newspaper images and app…
▽ More
Existing full text datasets of U.S. public domain newspapers do not recognize the often complex layouts of newspaper scans, and as a result the digitized content scrambles texts from articles, headlines, captions, advertisements, and other layout regions. OCR quality can also be low. This study develops a novel, deep learning pipeline for extracting full article texts from newspaper images and applies it to the nearly 20 million scans in Library of Congress's public domain Chronicling America collection. The pipeline includes layout detection, legibility classification, custom OCR, and association of article texts spanning multiple bounding boxes. To achieve high scalability, it is built with efficient architectures designed for mobile phones. The resulting American Stories dataset provides high quality data that could be used for pre-training a large language model to achieve better understanding of historical English and historical world knowledge. The dataset could also be added to the external database of a retrieval-augmented language model to make historical information - ranging from interpretations of political events to minutiae about the lives of people's ancestors - more widely accessible. Furthermore, structured article texts facilitate using transformer-based methods for popular social science applications like topic classification, detection of reproduced content, and news story clustering. Finally, American Stories provides a massive silver quality dataset for innovating multimodal layout analysis models and other multimodal applications.
△ Less
Submitted 23 August, 2023;
originally announced August 2023.
-
Understanding and Mitigating Extrapolation Failures in Physics-Informed Neural Networks
Authors:
Lukas Fesser,
Luca D'Amico-Wong,
Richard Qiu
Abstract:
Physics-informed Neural Networks (PINNs) have recently gained popularity due to their effective approximation of partial differential equations (PDEs) using deep neural networks (DNNs). However, their out of domain behavior is not well understood, with previous work speculating that the presence of high frequency components in the solution function might be to blame for poor extrapolation performa…
▽ More
Physics-informed Neural Networks (PINNs) have recently gained popularity due to their effective approximation of partial differential equations (PDEs) using deep neural networks (DNNs). However, their out of domain behavior is not well understood, with previous work speculating that the presence of high frequency components in the solution function might be to blame for poor extrapolation performance. In this paper, we study the extrapolation behavior of PINNs on a representative set of PDEs of different types, including high-dimensional PDEs. We find that failure to extrapolate is not caused by high frequencies in the solution function, but rather by shifts in the support of the Fourier spectrum over time. We term these spectral shifts and quantify them by introducing a Weighted Wasserstein-Fourier distance (WWF). We show that the WWF can be used to predict PINN extrapolation performance, and that in the absence of significant spectral shifts, PINN predictions stay close to the true solution even in extrapolation. Finally, we propose a transfer learning-based strategy to mitigate the effects of larger spectral shifts, which decreases extrapolation errors by up to 82%.
△ Less
Submitted 26 November, 2023; v1 submitted 15 June, 2023;
originally announced June 2023.
-
Noise-Robust De-Duplication at Scale
Authors:
Emily Silcock,
Luca D'Amico-Wong,
Jinglin Yang,
Melissa Dell
Abstract:
Identifying near duplicates within large, noisy text corpora has a myriad of applications that range from de-duplicating training datasets, reducing privacy risk, and evaluating test set leakage, to identifying reproduced news articles and literature within large corpora. Across these diverse applications, the overwhelming majority of work relies on N-grams. Limited efforts have been made to evalu…
▽ More
Identifying near duplicates within large, noisy text corpora has a myriad of applications that range from de-duplicating training datasets, reducing privacy risk, and evaluating test set leakage, to identifying reproduced news articles and literature within large corpora. Across these diverse applications, the overwhelming majority of work relies on N-grams. Limited efforts have been made to evaluate how well N-gram methods perform, in part because it is unclear how one could create an unbiased evaluation dataset for a massive corpus. This study uses the unique timeliness of historical news wires to create a 27,210 document dataset, with 122,876 positive duplicate pairs, for studying noise-robust de-duplication. The time-sensitivity of news makes comprehensive hand labelling feasible - despite the massive overall size of the corpus - as duplicates occur within a narrow date range. The study then develops and evaluates a range of de-duplication methods: hashing and N-gram overlap (which predominate in the literature), a contrastively trained bi-encoder, and a re-rank style approach combining a bi- and cross-encoder. The neural approaches significantly outperform hashing and N-gram overlap. We show that the bi-encoder scales well, de-duplicating a 10 million article corpus on a single GPU card in a matter of hours. We also apply our pre-trained model to the RealNews and patent portions of C4 (Colossal Clean Crawled Corpus), illustrating that a neural approach can identify many near duplicates missed by hashing, in the presence of various types of noise. The public release of our NEWS-COPY de-duplication dataset, codebase, and the pre-trained models will facilitate further research and applications.
△ Less
Submitted 24 April, 2024; v1 submitted 9 October, 2022;
originally announced October 2022.