Search | arXiv e-print repository

CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents

Authors: Francisco Valentini, Diego Kozlowski, Vincent Larivière

Abstract: Cross-lingual information retrieval (CLIR) consists in finding relevant documents in a language that differs from the language of the queries. This paper presents CLIRudit, a new dataset created to evaluate cross-lingual academic search, focusing on English queries and French documents. The dataset is built using bilingual article metadata from Érudit, a Canadian publishing platform, and is design… ▽ More Cross-lingual information retrieval (CLIR) consists in finding relevant documents in a language that differs from the language of the queries. This paper presents CLIRudit, a new dataset created to evaluate cross-lingual academic search, focusing on English queries and French documents. The dataset is built using bilingual article metadata from Érudit, a Canadian publishing platform, and is designed to represent scenarios in which researchers search for scholarly content in languages other than English. We perform a comprehensive benchmarking of different zero-shot first-stage retrieval methods on the dataset, including dense and sparse retrievers, query and document machine translation, and state-of-the-art multilingual retrievers. Our results show that large dense retrievers, not necessarily trained for the cross-lingual retrieval task, can achieve zero-shot performance comparable to using ground truth human translations, without the need for machine translation. Sparse retrievers, such as BM25 or SPLADE, combined with document translation, show competitive results, providing an efficient alternative to large dense models. This research advances the understanding of cross-lingual academic information retrieval and provides a framework that others can use to build comparable datasets across different languages and disciplines. By making the dataset and code publicly available, we aim to facilitate further research that will help make scientific knowledge more accessible across language barriers. △ Less

Submitted 22 April, 2025; originally announced April 2025.

arXiv:2502.13934 [pdf]

Citation proximus: the role of social and semantic ties in citing behaviour

Authors: Diego Kozlowski, Carolina Pradier, Pierre Benz, Natsumi Shokida, Jens Peter Andersen, Vincent Larivière

Abstract: Citations are a key indicator of research impact but are shaped by factors beyond intrinsic research quality, including prestige, social networks, and thematic similarity. While the Matthew Effect explains how prestige accumulates and influences citation distributions, our study contextualizes this by showing that other mechanisms also play a crucial role. Analyzing a large dataset of disambiguate… ▽ More Citations are a key indicator of research impact but are shaped by factors beyond intrinsic research quality, including prestige, social networks, and thematic similarity. While the Matthew Effect explains how prestige accumulates and influences citation distributions, our study contextualizes this by showing that other mechanisms also play a crucial role. Analyzing a large dataset of disambiguated authors (N=43,467) and citation linkages (N=264,436) in U.S. economics, we find that close ties in the collaboration network are the strongest predictor of citation, closely followed by thematic similarity between papers. This reinforces the idea that citations are not only a matter of prestige but mostly of social networks and intellectual proximity. Prestige remains important for understanding highly cited papers, but for the majority of citations, proximity--both social and semantic--plays a more significant role. These findings shift attention from extreme cases of highly cited research toward the broader distribution of citations, which shapes career trajectories and the production of knowledge. Recognizing the diverse factors influencing citations is critical for science policy, as this work highlights inequalities that are not based on preferential attachment, but on the role of self-citations, collaborations, and mainstream versus no mainstream research subjects. △ Less

Submitted 19 February, 2025; originally announced February 2025.

arXiv:2502.03627 [pdf, other]

Sorting the Babble in Babel: Assessing the Performance of Language Detection Algorithms on the OpenAlex Database

Authors: Maxime Holmberg Sainte-Marie, Diego Kozlowski, Lucía Céspedes, Vincent Larivière

Abstract: This project aims to compare various language classification procedures, procedures combining various Python language detection algorithms and metadata-based corpora extracted from manually-annotated articles sampled from the OpenAlex database. Following an analysis of precision and recall performance for each algorithm, corpus, and language as well as of processing speeds recorded for each algori… ▽ More This project aims to compare various language classification procedures, procedures combining various Python language detection algorithms and metadata-based corpora extracted from manually-annotated articles sampled from the OpenAlex database. Following an analysis of precision and recall performance for each algorithm, corpus, and language as well as of processing speeds recorded for each algorithm and corpus type, overall procedure performance at the database level was simulated using probabilistic confusion matrices for each algorithm, corpus, and language as well as a probabilistic model of relative article language frequencies for the whole OpenAlex database. Results show that procedure performance strongly depends on the importance given to each of the measures implemented: for contexts where precision is preferred, using the LangID algorithm on the greedy corpus gives the best results; however, for all cases where recall is considered at least slightly more important than precision or as soon as processing times are given any kind of consideration, the procedure combining the FastSpell algorithm and the Titles corpus outperforms all other alternatives. Given the lack of truly multilingual, large-scale bibliographic databases, it is hoped that these results help confirm and foster the unparalleled potential of the OpenAlex database for cross-linguistic, bibliometric-based research and analysis. △ Less

Submitted 18 February, 2025; v1 submitted 5 February, 2025; originally announced February 2025.

Comments: 33 pages, 4 figures

arXiv:2411.18306 [pdf]

Delineating Feminist Studies through bibliometric analysis

Authors: Natsumi S. Shokida, Diego Kozlowski, Vincent Larivière

Abstract: The multidisciplinary and socially anchored nature of Feminist Studies presents unique challenges for bibliometric analysis, as this research area transcends traditional disciplinary boundaries and reflects discussions from feminist and LGBTQIA+ social movements. This paper proposes a novel approach for identifying gender/sex related publications scattered across diverse scientific disciplines. Us… ▽ More The multidisciplinary and socially anchored nature of Feminist Studies presents unique challenges for bibliometric analysis, as this research area transcends traditional disciplinary boundaries and reflects discussions from feminist and LGBTQIA+ social movements. This paper proposes a novel approach for identifying gender/sex related publications scattered across diverse scientific disciplines. Using the Dimensions database, we employ bibliometric techniques, natural language processing (NLP) and manual curation to compile a dataset of scientific publications that allows for the analysis of Gender Studies and its influence across different disciplines. This is achieved through a methodology that combines a core of specialized journals with a comprehensive keyword search over titles. These keywords are obtained by applying Topic Modeling (BERTopic) to the corpus of titles and abstracts from the core. This methodological strategy, divided into two stages, reflects the dynamic interaction between Gender Studies and its dialogue with different disciplines. This hybrid system surpasses basic keyword search by mitigating potential biases introduced through manual keyword enumeration. The resulting dataset comprises over 1.9 million scientific documents published between 1668 and 2023, spanning four languages. This dataset enables a characterization of Gender Studies in terms of addressed topics, citation and collaboration dynamics, and institutional and regional participation. By addressing the methodological challenges of studying "more-than-disciplinary" research areas, this approach could also be adapted to delineate other conversations where disciplinary boundaries are difficult to disentangle. △ Less

Submitted 27 November, 2024; originally announced November 2024.

Comments: 2 tables, 5 figures

arXiv:2409.10633 [pdf]

Evaluating the Linguistic Coverage of OpenAlex: An Assessment of Metadata Accuracy and Completeness

Authors: Lucía Céspedes, Diego Kozlowski, Carolina Pradier, Maxime Holmberg Sainte-Marie, Natsumi Solange Shokida, Pierre Benz, Constance Poitras, Anton Boudreau Ninkov, Saeideh Ebrahimy, Philips Ayeni, Sarra Filali, Bing Li, Vincent Larivière

Abstract: Clarivate's Web of Science (WoS) and Elsevier's Scopus have been for decades the main sources of bibliometric information. Although highly curated, these closed, proprietary databases are largely biased towards English-language publications, underestimating the use of other languages in research dissemination. Launched in 2022, OpenAlex promised comprehensive, inclusive, and open-source research i… ▽ More Clarivate's Web of Science (WoS) and Elsevier's Scopus have been for decades the main sources of bibliometric information. Although highly curated, these closed, proprietary databases are largely biased towards English-language publications, underestimating the use of other languages in research dissemination. Launched in 2022, OpenAlex promised comprehensive, inclusive, and open-source research information. While already in use by scholars and research institutions, the quality of its metadata is currently being assessed. This paper contributes to this literature by assessing the completeness and accuracy of OpenAlex's metadata related to language, through a comparison with WoS, as well as an in-depth manual validation of a sample of 6,836 articles. Results show that OpenAlex exhibits a far more balanced linguistic coverage than WoS. However, language metadata is not always accurate, which leads OpenAlex to overestimate the place of English while underestimating that of other languages. If used critically, OpenAlex can provide comprehensive and representative analyses of languages used for scholarly publishing. However, more work is needed at infrastructural level to ensure the quality of metadata on language. △ Less

Submitted 19 September, 2024; v1 submitted 16 September, 2024; originally announced September 2024.

Comments: Reference list updated and corrected, corresponding author's email contact added, minor typographical errors corrected

arXiv:2408.07003 [pdf]

Generative AI for automatic topic labelling

Authors: Diego Kozlowski, Carolina Pradier, Pierre Benz

Abstract: Topic Modeling has become a prominent tool for the study of scientific fields, as they allow for a large scale interpretation of research trends. Nevertheless, the output of these models is structured as a list of keywords which requires a manual interpretation for the labelling. This paper proposes to assess the reliability of three LLMs, namely flan, GPT-4o, and GPT-4 mini for topic labelling. D… ▽ More Topic Modeling has become a prominent tool for the study of scientific fields, as they allow for a large scale interpretation of research trends. Nevertheless, the output of these models is structured as a list of keywords which requires a manual interpretation for the labelling. This paper proposes to assess the reliability of three LLMs, namely flan, GPT-4o, and GPT-4 mini for topic labelling. Drawing on previous research leveraging BERTopic, we generate topics from a dataset of all the scientific articles (n=34,797) authored by all biology professors in Switzerland (n=465) between 2008 and 2020, as recorded in the Web of Science database. We assess the output of the three models both quantitatively and qualitatively and find that, first, both GPT models are capable of accurately and precisely label topics from the models' output keywords. Second, 3-word labels are preferable to grasp the complexity of research topics. △ Less

Submitted 13 August, 2024; originally announced August 2024.

Comments: 10 pages, 1 figure

arXiv:2407.18783 [pdf]

Science for whom? The influence of the regional academic circuit on gender inequalities in Latin America

Authors: Carolina Pradier, Diego Kozlowski, Natsumi S. Shokida, Vincent Larivière

Abstract: The Latin-American scientific community has achieved significant progress towards gender parity, with nearly equal representation of women and men scientists. Nevertheless, women continue to be underrepresented in scholarly communication. Throughout the 20th century, Latin America established its academic circuit, focusing on research topics of regional significance. Through an analysis of scienti… ▽ More The Latin-American scientific community has achieved significant progress towards gender parity, with nearly equal representation of women and men scientists. Nevertheless, women continue to be underrepresented in scholarly communication. Throughout the 20th century, Latin America established its academic circuit, focusing on research topics of regional significance. Through an analysis of scientific publications, this article explores the relationship between gender inequalities in science and the integration of Latin-American researchers into the regional and global academic circuits between 1993 and 2022. We find that women are more likely to engage in the regional circuit, while men are more active within the global circuit. This trend is attributed to a thematic alignment between women's research interests and issues specific to Latin America. Furthermore, our results reveal that the mechanisms contributing to gender differences in symbolic capital accumulation vary between circuits. Women's work achieves equal or greater recognition compared to men's within the regional circuit, but generally garners less attention in the global circuit. Our findings suggest that policies aimed at strengthening the regional academic circuit would encourage scientists to address locally relevant topics while simultaneously fostering gender equality in science. △ Less

Submitted 28 November, 2024; v1 submitted 26 July, 2024; originally announced July 2024.

arXiv:2402.04391 [pdf]

The Howard-Harvard effect: Institutional reproduction of intersectional inequalities

Authors: Diego Kozlowski, Thema Monroe-White, Vincent Larivière, Cassidy R. Sugimoto

Abstract: The US higher education system concentrates the production of science and scientists within a few institutions. This has implications for minoritized scholars and the topics with which they are disproportionately associated. This paper examines topical alignment between institutions and authors of varying intersectional identities, and the relationship with prestige and scientific impact. We obser… ▽ More The US higher education system concentrates the production of science and scientists within a few institutions. This has implications for minoritized scholars and the topics with which they are disproportionately associated. This paper examines topical alignment between institutions and authors of varying intersectional identities, and the relationship with prestige and scientific impact. We observe a Howard-Harvard effect, in which the topical profile of minoritized scholars are amplified in mission-driven institutions and decreased in prestigious institutions. Results demonstrate a consistent pattern of inequality in topics and research impact. Specifically, we observe statistically significant differences between minoritized scholars and White men in citations and journal impact. The aggregate research profile of prestigious US universities is highly correlated with the research profile of White men, and highly negatively correlated with the research profile of minoritized women. Furthermore, authors affiliated with more prestigious institutions are associated with increasing inequalities in both citations and journal impact. Academic institutions and funders are called to create policies to mitigate the systemic barriers that prevent the United States from achieving a fully robust scientific ecosystem. △ Less

Submitted 6 February, 2024; originally announced February 2024.

arXiv:2104.12553 [pdf]

doi 10.1371/journal.pone.0264270

Avoiding bias when inferring race using name-based approaches

Authors: Diego Kozlowski, Dakota S. Murray, Alexis Bell, Will Hulsey, Vincent Larivière, Thema Monroe-White, Cassidy R. Sugimoto

Abstract: Racial disparity in academia is a widely acknowledged problem. The quantitative understanding of racial based systemic inequalities is an important step towards a more equitable research system. However, because of the lack of robust information on authors' race, few large scale analyses have been performed on this topic. Algorithmic approaches offer one solution, using known information about aut… ▽ More Racial disparity in academia is a widely acknowledged problem. The quantitative understanding of racial based systemic inequalities is an important step towards a more equitable research system. However, because of the lack of robust information on authors' race, few large scale analyses have been performed on this topic. Algorithmic approaches offer one solution, using known information about authors, such as their names, to infer their perceived race. As with any other algorithm, the process of racial inference can generate biases if it is not carefully considered. The goal of this article is to assess the extent to which algorithmic bias is introduced using different approaches for name based racial inference. We use information from the U.S. Census and mortgage applications to infer the race of U.S. affiliated authors in the Web of Science. We estimate the effects of using given and family names, thresholds or continuous distributions, and imputation. Our results demonstrate that the validity of name based inference varies by race/ethnicity and that threshold approaches underestimate Black authors and overestimate White authors. We conclude with recommendations to avoid potential biases. This article lays the foundation for more systematic and less biased investigations into racial disparities in science. △ Less

Submitted 12 October, 2021; v1 submitted 14 April, 2021; originally announced April 2021.

Journal ref: PLOS ONE 17(3) e0264270 (2022)

arXiv:2011.12096 [pdf]

doi 10.1080/14680777.2022.2047090

Gender bias in magazines oriented to men and women: a computational approach

Authors: Diego Kozlowski, Gabriela Lozano, Carla M. Felcher, Fernando Gonzalez, Edgar Altszyler

Abstract: Cultural products are a source to acquire individual values and behaviours. Therefore, the differences in the content of the magazines aimed specifically at women or men are a means to create and reproduce gender stereotypes. In this study, we compare the content of a women-oriented magazine with that of a men-oriented one, both produced by the same editorial group, over a decade (2008-2018). With… ▽ More Cultural products are a source to acquire individual values and behaviours. Therefore, the differences in the content of the magazines aimed specifically at women or men are a means to create and reproduce gender stereotypes. In this study, we compare the content of a women-oriented magazine with that of a men-oriented one, both produced by the same editorial group, over a decade (2008-2018). With Topic Modelling techniques we identify the main themes discussed in the magazines and quantify how much the presence of these topics differs between magazines over time. Then, we performed a word-frequency analysis to validate this methodology and extend the analysis to other subjects that did not emerge automatically. Our results show that the frequency of appearance of the topics Family, Business and Women as sex objects, present an initial bias that tends to disappear over time. Conversely, in Fashion and Science topics, the initial differences between both magazines are maintained. Besides, we show that in 2012, the content associated with horoscope increased in the women-oriented magazine, generating a new gap that remained open over time. Also, we show a strong increase in the use of words associated with feminism since 2015 and specifically the word abortion in 2018. Overall, these computational tools allowed us to analyse more than 24,000 articles. Up to our knowledge, this is the first study to compare magazines in such a large dataset, a task that would have been prohibitive using manual content analysis methodologies. △ Less

Submitted 24 November, 2020; originally announced November 2020.

Journal ref: Feminist Media Studies (2022)

arXiv:2011.02887 [pdf, other]

doi 10.1007/s11192-021-03984-1

Semantic and Relational Spaces in Science of Science: Deep Learning Models for Article Vectorisation

Authors: Diego Kozlowski, Jennifer Dusdal, Jun Pang, Andreas Zilian

Abstract: Over the last century, we observe a steady and exponentially growth of scientific publications globally. The overwhelming amount of available literature makes a holistic analysis of the research within a field and between fields based on manual inspection impossible. Automatic techniques to support the process of literature review are required to find the epistemic and social patterns that are emb… ▽ More Over the last century, we observe a steady and exponentially growth of scientific publications globally. The overwhelming amount of available literature makes a holistic analysis of the research within a field and between fields based on manual inspection impossible. Automatic techniques to support the process of literature review are required to find the epistemic and social patterns that are embedded in scientific publications. In computer sciences, new tools have been developed to deal with large volumes of data. In particular, deep learning techniques open the possibility of automated end-to-end models to project observations to a new, low-dimensional space where the most relevant information of each observation is highlighted. Using deep learning to build new representations of scientific publications is a growing but still emerging field of research. The aim of this paper is to discuss the potential and limits of deep learning for gathering insights about scientific research articles. We focus on document-level embeddings based on the semantic and relational aspects of articles, using Natural Language Processing (NLP) and Graph Neural Networks (GNNs). We explore the different outcomes generated by those techniques. Our results show that using NLP we can encode a semantic space of articles, while with GNN we are able to build a relational space where the social practices of a research community are also encoded. △ Less

Submitted 5 November, 2020; originally announced November 2020.

Journal ref: Scientometrics 126, 2021

arXiv:2009.07727 [pdf, other]

doi 10.1371/journal.pone.0245393

Latent Dirichlet Allocation Models for World Trade Analysis

Authors: Diego Kozlowski, Viktoriya Semeshenko, Andrea Molinari

Abstract: The international trade is one of the classic areas of study in economics. Nowadays, given the availability of data, the tools used for the analysis can be complemented and enriched with new methodologies and techniques that go beyond the traditional approach. The present paper shows the application of the Latent Dirichlet Allocation Models, a well known technique from the area of Natural Language… ▽ More The international trade is one of the classic areas of study in economics. Nowadays, given the availability of data, the tools used for the analysis can be complemented and enriched with new methodologies and techniques that go beyond the traditional approach. The present paper shows the application of the Latent Dirichlet Allocation Models, a well known technique from the area of Natural Language Processing, to search for latent dimensions in the product space of international trade, and their distribution across countries over time. We apply this technique to a dataset of countries' exports of goods from 1962 to 2016. The findings show the possibility to generate higher level classifications of goods based on the empirical evidence, and also allow to study the distribution of those classifications within countries. The latter show interesting insights about countries' trade specialisation. △ Less

Submitted 16 September, 2020; originally announced September 2020.

Journal ref: PLOS ONE (2021) 16(2): e0245393

Showing 1–12 of 12 results for author: Kozlowski, D