The study of short texts in digital politics: Document aggregation for topic modeling
Authors:
Nitheesha Nakka,
Omer F. Yalcin,
Bruce A. Desmarais,
Sarah Rajtmajer,
Burt Monroe
Abstract:
Statistical topic modeling is widely used in political science to study text. Researchers examine documents of varying lengths, from tweets to speeches. There is ongoing debate on how document length affects the interpretability of topic models. We investigate the effects of aggregating short documents into larger ones based on natural units that partition the corpus. In our study, we analyze one…
▽ More
Statistical topic modeling is widely used in political science to study text. Researchers examine documents of varying lengths, from tweets to speeches. There is ongoing debate on how document length affects the interpretability of topic models. We investigate the effects of aggregating short documents into larger ones based on natural units that partition the corpus. In our study, we analyze one million tweets by U.S. state legislators from April 2016 to September 2020. We find that for documents aggregated at the account level, topics are more associated with individual states than when using individual tweets. This finding is replicated with Wikipedia pages aggregated by birth cities, showing how document definitions can impact topic modeling results.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
Language-Agnostic Bias Detection in Language Models with Bias Probing
Authors:
Abdullatif Köksal,
Omer Faruk Yalcin,
Ahmet Akbiyik,
M. Tahir Kilavuz,
Anna Korhonen,
Hinrich Schütze
Abstract:
Pretrained language models (PLMs) are key components in NLP, but they contain strong social biases. Quantifying these biases is challenging because current methods focusing on fill-the-mask objectives are sensitive to slight changes in input. To address this, we propose a bias probing technique called LABDet, for evaluating social bias in PLMs with a robust and language-agnostic method. For nation…
▽ More
Pretrained language models (PLMs) are key components in NLP, but they contain strong social biases. Quantifying these biases is challenging because current methods focusing on fill-the-mask objectives are sensitive to slight changes in input. To address this, we propose a bias probing technique called LABDet, for evaluating social bias in PLMs with a robust and language-agnostic method. For nationality as a case study, we show that LABDet `surfaces' nationality bias by training a classifier on top of a frozen PLM on non-nationality sentiment detection. We find consistent patterns of nationality bias across monolingual PLMs in six languages that align with historical and political context. We also show for English BERT that bias surfaced by LABDet correlates well with bias in the pretraining data; thus, our work is one of the few studies that directly links pretraining data to PLM behavior. Finally, we verify LABDet's reliability and applicability to different templates and languages through an extensive set of robustness checks. We publicly share our code and dataset in https://github.com/akoksal/LABDet.
△ Less
Submitted 20 November, 2023; v1 submitted 22 May, 2023;
originally announced May 2023.