Skip to main content

Showing 1–2 of 2 results for author: Ulzii-Orshikh, N

.
  1. Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

    Authors: Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Andre Niyongabo Rubungo, Toan Q. Nguyen, Mathias Müller, André Müller , et al. (27 additional authors not shown)

    Abstract: With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have system… ▽ More

    Submitted 21 February, 2022; v1 submitted 22 March, 2021; originally announced March 2021.

    Comments: Accepted at TACL; pre-MIT Press publication version

    Journal ref: Transactions of the Association for Computational Linguistics (2022) 10: 50-72

  2. arXiv:2010.12611  [pdf, other

    cs.SI

    Information access representations and social capital in networks

    Authors: Ashkan Bashardoust, Hannah C. Beilinson, Sorelle A. Friedler, Jiajie Ma, Jade Rousseau, Carlos E. Scheidegger, Blair D. Sullivan, Nasanbayar Ulzii-Orshikh, Suresh Venkatasubramanian

    Abstract: Social network position confers power and social capital. In the setting of online social networks that have massive reach, creating mathematical representations of social capital is an important step towards understanding how network position can differentially confer advantage to different groups and how network position can itself be a source of advantage. In this paper, we use well established… ▽ More

    Submitted 16 October, 2023; v1 submitted 23 October, 2020; originally announced October 2020.