Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Authors:
Julia Kreutzer,
Isaac Caswell,
Lisa Wang,
Ahsan Wahab,
Daan van Esch,
Nasanbayar Ulzii-Orshikh,
Allahsera Tapo,
Nishant Subramani,
Artem Sokolov,
Claytone Sikasote,
Monang Setyawan,
Supheakmungkol Sarin,
Sokhar Samb,
Benoît Sagot,
Clara Rivera,
Annette Rios,
Isabel Papadimitriou,
Salomey Osei,
Pedro Ortiz Suarez,
Iroro Orife,
Kelechi Ogueji,
Andre Niyongabo Rubungo,
Toan Q. Nguyen,
Mathias Müller,
André Müller
, et al. (27 additional authors not shown)
Abstract:
With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have system…
▽ More
With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.
△ Less
Submitted 21 February, 2022; v1 submitted 22 March, 2021;
originally announced March 2021.
Information access representations and social capital in networks
Authors:
Ashkan Bashardoust,
Hannah C. Beilinson,
Sorelle A. Friedler,
Jiajie Ma,
Jade Rousseau,
Carlos E. Scheidegger,
Blair D. Sullivan,
Nasanbayar Ulzii-Orshikh,
Suresh Venkatasubramanian
Abstract:
Social network position confers power and social capital. In the setting of online social networks that have massive reach, creating mathematical representations of social capital is an important step towards understanding how network position can differentially confer advantage to different groups and how network position can itself be a source of advantage. In this paper, we use well established…
▽ More
Social network position confers power and social capital. In the setting of online social networks that have massive reach, creating mathematical representations of social capital is an important step towards understanding how network position can differentially confer advantage to different groups and how network position can itself be a source of advantage. In this paper, we use well established models for information flow on networks as a base to propose a formal descriptor of the network position of a node as represented by its information access. Combining these descriptors allows a full representation of social capital across the network. Using real-world networks, we demonstrate that this representation allows the identification of differences between groups based on network specific measures of inequality of access.
△ Less
Submitted 16 October, 2023; v1 submitted 23 October, 2020;
originally announced October 2020.