Search | arXiv e-print repository

Characterizing Knowledge Manipulation in a Russian Wikipedia Fork

Authors: Mykola Trokhymovych, Oleksandr Kosovan, Nathan Forrester, Pablo Aragón, Diego Saez-Trumper, Ricardo Baeza-Yates

Abstract: Wikipedia is powered by MediaWiki, a free and open-source software that is also the infrastructure for many other wiki-based online encyclopedias. These include the recently launched website Ruwiki, which has copied and modified the original Russian Wikipedia content to conform to Russian law. To identify practices and narratives that could be associated with different forms of knowledge manipulat… ▽ More Wikipedia is powered by MediaWiki, a free and open-source software that is also the infrastructure for many other wiki-based online encyclopedias. These include the recently launched website Ruwiki, which has copied and modified the original Russian Wikipedia content to conform to Russian law. To identify practices and narratives that could be associated with different forms of knowledge manipulation, this article presents an in-depth analysis of this Russian Wikipedia fork. We propose a methodology to characterize the main changes with respect to the original version. The foundation of this study is a comprehensive comparative analysis of more than 1.9M articles from Russian Wikipedia and its fork. Using meta-information and geographical, temporal, categorical, and textual features, we explore the changes made by Ruwiki editors. Furthermore, we present a classification of the main topics of knowledge manipulation in this fork, including a numerical estimation of their scope. This research not only sheds light on significant changes within Ruwiki, but also provides a methodology that could be applied to analyze other Wikipedia forks and similar collaborative projects. △ Less

Submitted 21 April, 2025; v1 submitted 14 April, 2025; originally announced April 2025.

arXiv:2410.18803 [pdf, other]

Language-Agnostic Modeling of Source Reliability on Wikipedia

Authors: Jacopo D'Ignazi, Andreas Kaltenbrunner, Yelena Mejova, Michele Tizzani, Kyriaki Kalimeri, Mariano Beiró, Pablo Aragón

Abstract: Over the last few years, content verification through reliable sources has become a fundamental need to combat disinformation. Here, we present a language-agnostic model designed to assess the reliability of sources across multiple language editions of Wikipedia. Utilizing editorial activity data, the model evaluates source reliability within different articles of varying controversiality such as… ▽ More Over the last few years, content verification through reliable sources has become a fundamental need to combat disinformation. Here, we present a language-agnostic model designed to assess the reliability of sources across multiple language editions of Wikipedia. Utilizing editorial activity data, the model evaluates source reliability within different articles of varying controversiality such as Climate Change, COVID-19, History, Media, and Biology topics. Crafting features that express domain usage across articles, the model effectively predicts source reliability, achieving an F1 Macro score of approximately 0.80 for English and other high-resource languages. For mid-resource languages, we achieve 0.65 while the performance of low-resource languages varies; in all cases, the time the domain remains present in the articles (which we dub as permanence) is one of the most predictive features. We highlight the challenge of maintaining consistent model performance across languages of varying resource levels and demonstrate that adapting models from higher-resource languages can improve performance. This work contributes not only to Wikipedia's efforts in ensuring content verifiability but in ensuring reliability across diverse user-generated content in various language communities. △ Less

Submitted 14 January, 2025; v1 submitted 24 October, 2024; originally announced October 2024.

arXiv:2404.09764 [pdf, other]

Language-Agnostic Modeling of Wikipedia Articles for Content Quality Assessment across Languages

Authors: Paramita Das, Isaac Johnson, Diego Saez-Trumper, Pablo Aragón

Abstract: Wikipedia is the largest web repository of free knowledge. Volunteer editors devote time and effort to creating and expanding articles in more than 300 language editions. As content quality varies from article to article, editors also spend substantial time rating articles with specific criteria. However, keeping these assessments complete and up-to-date is largely impossible given the ever-changi… ▽ More Wikipedia is the largest web repository of free knowledge. Volunteer editors devote time and effort to creating and expanding articles in more than 300 language editions. As content quality varies from article to article, editors also spend substantial time rating articles with specific criteria. However, keeping these assessments complete and up-to-date is largely impossible given the ever-changing nature of Wikipedia. To overcome this limitation, we propose a novel computational framework for modeling the quality of Wikipedia articles. State-of-the-art approaches to model Wikipedia article quality have leveraged machine learning techniques with language-specific features. In contrast, our framework is based on language-agnostic structural features extracted from the articles, a set of universal weights, and a language version-specific normalization criterion. Therefore, we ensure that all language editions of Wikipedia can benefit from our framework, even those that do not have their own quality assessment scheme. Using this framework, we have built datasets with the feature values and quality scores of all revisions of all articles in the existing language versions of Wikipedia. We provide a descriptive analysis of these resources and a benchmark of our framework. In addition, we discuss possible downstream tasks to be addressed with these datasets, which are released for public use. △ Less

Submitted 15 April, 2024; originally announced April 2024.

Comments: Accepted at ICWSM-24

arXiv:2309.00196 [pdf, other]

doi 10.1145/3583780.3615254

A Comparative Study of Reference Reliability in Multiple Language Editions of Wikipedia

Authors: Aitolkyn Baigutanova, Diego Saez-Trumper, Miriam Redi, Meeyoung Cha, Pablo Aragón

Abstract: Information presented in Wikipedia articles must be attributable to reliable published sources in the form of references. This study examines over 5 million Wikipedia articles to assess the reliability of references in multiple language editions. We quantify the cross-lingual patterns of the perennial sources list, a collection of reliability labels for web domains identified and collaboratively a… ▽ More Information presented in Wikipedia articles must be attributable to reliable published sources in the form of references. This study examines over 5 million Wikipedia articles to assess the reliability of references in multiple language editions. We quantify the cross-lingual patterns of the perennial sources list, a collection of reliability labels for web domains identified and collaboratively agreed upon by Wikipedia editors. We discover that some sources (or web domains) deemed untrustworthy in one language (i.e., English) continue to appear in articles in other languages. This trend is especially evident with sources tailored for smaller communities. Furthermore, non-authoritative sources found in the English version of a page tend to persist in other language versions of that page. We finally present a case study on the Chinese, Russian, and Swedish Wikipedias to demonstrate a discrepancy in reference reliability across cultures. Our finding highlights future challenges in coordinating global knowledge on source reliability. △ Less

Submitted 4 September, 2023; v1 submitted 31 August, 2023; originally announced September 2023.

Comments: Conference on Information & Knowledge Management (CIKM '23)

arXiv:2106.15940 [pdf, other]

A preliminary approach to knowledge integrity risk assessment in Wikipedia projects

Authors: Pablo Aragón, Diego Sáez-Trumper

Abstract: Wikipedia is one of the main repositories of free knowledge available today, with a central role in the Web ecosystem. For this reason, it can also be a battleground for actors trying to impose specific points of view or even spreading disinformation online. There is a growing need to monitor its "health" but this is not an easy task. Wikipedia exists in over 300 language editions and each project… ▽ More Wikipedia is one of the main repositories of free knowledge available today, with a central role in the Web ecosystem. For this reason, it can also be a battleground for actors trying to impose specific points of view or even spreading disinformation online. There is a growing need to monitor its "health" but this is not an easy task. Wikipedia exists in over 300 language editions and each project is maintained by a different community, with their own strengths, weaknesses and limitations. In this paper, we introduce a taxonomy of knowledge integrity risks across Wikipedia projects and a first set of indicators to assess internal risks related to community and content issues, as well as external threats such as the geopolitical and media landscape. On top of this taxonomy, we offer a preliminary analysis illustrating how the lack of editors' geographical diversity might represent a knowledge integrity risk. These are the first steps of a research project to build a Wikipedia Knowledge Integrity Risk Observatory. △ Less

Submitted 30 June, 2021; originally announced June 2021.

Comments: Accepted at MIS2'21: Misinformation and Misbehavior Mining on the Web Workshop held in conjunction with KDD 2021

arXiv:2012.00515 [pdf]

Civic Technologies: Research, Practice and Open Challenges

Authors: Pablo Aragon, Adriana Alvarado Garcia, Christopher A. Le Dantec, Claudia Flores-Saviaga, Jorge Saldivar

Abstract: Over the last years, civic technology projects have emerged around the world to advance open government and community action. Although Computer-Supported Cooperative Work (CSCW) and Human-Computer Interaction (HCI) communities have shown a growing interest in researching issues around civic technologies, yet most research still focuses on projects from the Global North. The goal of this workshop i… ▽ More Over the last years, civic technology projects have emerged around the world to advance open government and community action. Although Computer-Supported Cooperative Work (CSCW) and Human-Computer Interaction (HCI) communities have shown a growing interest in researching issues around civic technologies, yet most research still focuses on projects from the Global North. The goal of this workshop is, therefore, to advance CSCW research by raising awareness for the ongoing challenges and open questions around civic technology by bridging the gap between researchers and practitioners from different regions. The workshop will be organized around three central topics: (1) discuss how the local context and infrastructure affect the design, implementation, adoption, and maintenance of civic technology; (2) identify key elements of the configuration of trust among government, citizenry, and local organizations and how these elements change depending on the sociopolitical context where community engagement takes place; (3) discover what methods and strategies are best suited for conducting research on civic technologies in different contexts. These core topics will be covered across sessions that will initiate in-depth discussions and, thereby, stimulate collaboration between the CSCW research community and practitioners of civic technologies from both Global North and South. △ Less

Submitted 1 December, 2020; originally announced December 2020.

Comments: Proposal, outcome and position papers of the 23rd ACM Conference on Computer-Supported Cooperative Work and Social Computing (CSCW 2020) workshop "Civic Technologies: Research, Practice, and Open Challenges"

arXiv:1807.04448 [pdf, other]

Interactive Discovery System for Direct Democracy

Authors: Pablo Aragón, Yago Bermejo, Vicenç Gómez, Andreas Kaltenbrunner

Abstract: Decide Madrid is the civic technology of Madrid City Council which allows users to create and support online petitions. Despite the initial success, the platform is encountering problems with the growth of petition signing because petitions are far from the minimum number of supporting votes they must gather. Previous analyses have suggested that this problem is produced by the interface: a pagina… ▽ More Decide Madrid is the civic technology of Madrid City Council which allows users to create and support online petitions. Despite the initial success, the platform is encountering problems with the growth of petition signing because petitions are far from the minimum number of supporting votes they must gather. Previous analyses have suggested that this problem is produced by the interface: a paginated list of petitions which applies a non-optimal ranking algorithm. For this reason, we present an interactive system for the discovery of topics and petitions. This approach leads us to reflect on the usefulness of data visualization techniques to address relevant societal challenges. △ Less

Submitted 12 July, 2018; originally announced July 2018.

Comments: Accepted at the 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 18). For academic purposes, please cite the conference version

arXiv:1806.08282 [pdf, other]

Online Petitioning Through Data Exploration and What We Found There: A Dataset of Petitions from Avaaz.org

Authors: Pablo Aragón, Diego Sáez-Trumper, Miriam Redi, Scott A. Hale, Vicenç Gómez, Andreas Kaltenbrunner

Abstract: The Internet has become a fundamental resource for activism as it facilitates political mobilization at a global scale. Petition platforms are a clear example of how thousands of people around the world can contribute to social change. Avaaz.org, with a presence in over 200 countries, is one of the most popular of this type. However, little research has focused on this platform, probably due to a… ▽ More The Internet has become a fundamental resource for activism as it facilitates political mobilization at a global scale. Petition platforms are a clear example of how thousands of people around the world can contribute to social change. Avaaz.org, with a presence in over 200 countries, is one of the most popular of this type. However, little research has focused on this platform, probably due to a lack of available data. In this work we retrieved more than 350K petitions, standardized their field values, and added new information using language detection and named-entity recognition. To motivate future research with this unique repository of global protest, we present a first exploration of the dataset. In particular, we examine how social media campaigning is related to the success of petitions, as well as some geographic and linguistic findings about the worldwide community of Avaaz.org. We conclude with example research questions that could be addressed with our dataset. △ Less

Submitted 21 June, 2018; originally announced June 2018.

Comments: Accepted as a dataset paper at the 12th International AAAI Conference on Web and Social Media (ICWSM-18). This preprint includes an additional appendix with the reasons, provided by Avaaz.org, about the anomalies detected when exploring the dataset. For academic purposes, please cite the ICWSM version

arXiv:1707.06526 [pdf, other]

Deliberative Platform Design: The case study of the online discussions in Decidim Barcelona

Authors: Pablo Aragón, Andreas Kaltenbrunner, Antonio Calleja-López, Andrés Pereira, Arnau Monterde, Xabier E. Barandiaran, Vicenç Gómez

Abstract: With the irruption of ICTs and the crisis of political representation, many online platforms have been developed with the aim of improving participatory democratic processes. However, regarding platforms for online petitioning, previous research has not found examples of how to effectively introduce discussions, a crucial feature to promote deliberation. In this study we focus on the case of Decid… ▽ More With the irruption of ICTs and the crisis of political representation, many online platforms have been developed with the aim of improving participatory democratic processes. However, regarding platforms for online petitioning, previous research has not found examples of how to effectively introduce discussions, a crucial feature to promote deliberation. In this study we focus on the case of Decidim Barcelona, the online participatory-democracy platform launched by the City Council of Barcelona in which proposals can be discussed with an interface that combines threaded discussions and comment alignment with the proposal. This innovative approach allows to examine whether neutral, positive or negative comments are more likely to generate discussion cascades. The results reveal that, with this interface, comments marked as negatively aligned with the proposal were more likely to engage users in online discussions and, therefore, helped to promote deliberative decision making. △ Less

Submitted 20 July, 2017; originally announced July 2017.

Comments: Accepted at the 9th International Conference on Social Informatics (SocInfo 2017)

arXiv:1507.08599 [pdf, other]

When a Movement Becomes a Party: The 2015 Barcelona City Council Election

Authors: Pablo Aragón, Yana Volkovich, David Laniado, Andreas Kaltenbrunner

Abstract: Barcelona en Comú, an emerging grassroots movement-party, won the 2015 Barcelona City Council election. This candidacy was devised by activists involved in the 15M movement in order to turn citizen outrage into political change. On the one hand, the 15M movement is based on a decentralized structure. On the other hand, political science literature postulates that parties historically develop oliga… ▽ More Barcelona en Comú, an emerging grassroots movement-party, won the 2015 Barcelona City Council election. This candidacy was devised by activists involved in the 15M movement in order to turn citizen outrage into political change. On the one hand, the 15M movement is based on a decentralized structure. On the other hand, political science literature postulates that parties historically develop oligarchical leadership structures. This tension motivates us to examine whether Barcelona en Comú preserved a decentralizated structure or adopted a conventional centralized organization. In this article we analyse the Twitter networks of the parties that ran for this election by measuring their hierarchical structure, information efficiency and social resilience. Our results show that in Barcelona en Comú two well-defined groups co-exist: a cluster dominated by the leader and the collective accounts, and another cluster formed by the movement activists. While the former group is highly centralized like the other major parties, the latter one stands out for its decentralized, cohesive and resilient structure. △ Less

Submitted 30 July, 2015; originally announced July 2015.

arXiv:1405.7183 [pdf, ps, other]

doi 10.1371/journal.pone.0114825

Interactions of cultures and top people of Wikipedia from ranking of 24 language editions

Authors: Young-Ho Eom, Pablo Aragón, David Laniado, Andreas Kaltenbrunner, Sebastiano Vigna, Dima L. Shepelyansky

Abstract: Wikipedia is a huge global repository of human knowledge, that can be leveraged to investigate interwinements between cultures. With this aim, we apply methods of Markov chains and Google matrix, for the analysis of the hyperlink networks of 24 Wikipedia language editions, and rank all their articles by PageRank, 2DRank and CheiRank algorithms. Using automatic extraction of people names, we obtain… ▽ More Wikipedia is a huge global repository of human knowledge, that can be leveraged to investigate interwinements between cultures. With this aim, we apply methods of Markov chains and Google matrix, for the analysis of the hyperlink networks of 24 Wikipedia language editions, and rank all their articles by PageRank, 2DRank and CheiRank algorithms. Using automatic extraction of people names, we obtain the top 100 historical figures, for each edition and for each algorithm. We investigate their spatial, temporal, and gender distributions in dependence of their cultural origins. Our study demonstrates not only the existence of skewness with local figures, mainly recognized only in their own cultures, but also the existence of global historical figures appearing in a large number of editions. By determining the birth time and place of these persons, we perform an analysis of the evolution of such figures through 35 centuries of human history for each language, thus recovering interactions and entanglement of cultures over time. We also obtain the distributions of historical figures over world countries, highlighting geographical aspects of cross-cultural links. Considering historical figures who appear in multiple editions as interactions between cultures, we construct a network of cultures and identify the most influential cultures according to this network. △ Less

Submitted 17 November, 2014; v1 submitted 28 May, 2014; originally announced May 2014.

Comments: 32 pages. 10 figures. Submitted for publication. Supporting information is available on http://www.quantware.ups-tlse.fr/QWLIB/topwikipeople/

Journal ref: PLoS ONE 10(3): e0114825 (2015)

arXiv:1301.6900 [pdf, other]

Not all paths lead to Rome: Analysing the network of sister cities

Authors: Andreas Kaltenbrunner, Pablo Aragón, David Laniado, Yana Volkovich

Abstract: This work analyses the practice of sister city pairing. We investigate structural properties of the resulting city and country networks and present rankings of the most central nodes in these networks. We identify different country clusters and find that the practice of sister city pairing is not influenced by geographical proximity but results in highly assortative networks. This work analyses the practice of sister city pairing. We investigate structural properties of the resulting city and country networks and present rankings of the most central nodes in these networks. We identify different country clusters and find that the practice of sister city pairing is not influenced by geographical proximity but results in highly assortative networks. △ Less

Submitted 29 January, 2013; originally announced January 2013.

Comments: 7 pages, 4 figures

arXiv:1210.6883 [pdf]

doi 10.1371/journal.pone.0060584

Jointly they edit: examining the impact of community identification on political interaction in Wikipedia

Authors: Jessica G. Neff, David Laniado, Karolin Kappler, Yana Volkovich, Pablo Aragón, Andreas Kaltenbrunner

Abstract: In their 2005 study, Adamic and Glance coined the memorable phrase "divided they blog", referring to a trend of cyberbalkanization in the political blogosphere, with liberal and conservative blogs tending to link to other blogs with a similar political slant, and not to one another. As political discussion and activity increasingly moves online, the power of framing political discourses is shiftin… ▽ More In their 2005 study, Adamic and Glance coined the memorable phrase "divided they blog", referring to a trend of cyberbalkanization in the political blogosphere, with liberal and conservative blogs tending to link to other blogs with a similar political slant, and not to one another. As political discussion and activity increasingly moves online, the power of framing political discourses is shifting from mass media to social media. Continued examination of political interactions online is critical, and we extend this line of research by examining the activities of political users within the Wikipedia community. First, we examined how users in Wikipedia choose to display (or not to display) their political affiliation. Next, we more closely examined the patterns of cross-party interaction and community participation among those users proclaiming a political affiliation. In contrast to previous analyses of other social media, we did not find strong trends indicating a preference to interact with members of the same political party within the Wikipedia community. Our results indicate that users who proclaim their political affiliation within the community tend to proclaim their identity as a "Wikipedian" even more loudly. It seems that the shared identity of "being Wikipedian" may be strong enough to triumph over other potentially divisive facets of personal identity, such as political affiliation. △ Less

Submitted 5 November, 2012; v1 submitted 25 October, 2012; originally announced October 2012.

Comments: 33 pages, 5 figures

Journal ref: PLoS ONE 8(4): e60584 (2013)

arXiv:1204.3799 [pdf, other]

Biographical Social Networks on Wikipedia - A cross-cultural study of links that made history

Authors: Pablo Aragón, Andreas Kaltenbrunner, David Laniado, Yana Volkovich

Abstract: It is arguable whether history is made by great men and women or vice versa, but undoubtably social connections shape history. Analysing Wikipedia, a global collective memory place, we aim to understand how social links are recorded across cultures. Starting with the set of biographies in the English Wikipedia we focus on the networks of links between these biographical articles on the 15 largest… ▽ More It is arguable whether history is made by great men and women or vice versa, but undoubtably social connections shape history. Analysing Wikipedia, a global collective memory place, we aim to understand how social links are recorded across cultures. Starting with the set of biographies in the English Wikipedia we focus on the networks of links between these biographical articles on the 15 largest language Wikipedias. We detect the most central characters in these networks and point out culture-related peculiarities. Furthermore, we reveal remarkable similarities between distinct groups of language Wikipedias and highlight the shared knowledge about connections between persons across cultures. △ Less

Submitted 4 July, 2012; v1 submitted 17 April, 2012; originally announced April 2012.

Comments: 4 pages, 3 figures

ACM Class: J.4; G.2.2

Journal ref: Proceedings of WikiSym, 2012

Showing 1–14 of 14 results for author: Aragón, P