-
Scientific Open-Source Software Is Less Likely to Become Abandoned Than One Might Think! Lessons from Curating a Catalog of Maintained Scientific Software
Authors:
Addi Malviya Thakur,
Reed Milewicz,
Mahmoud Jahanshahi,
LavĂnia Paganini,
Bogdan Vasilescu,
Audris Mockus
Abstract:
Scientific software is essential to scientific innovation and in many ways it is distinct from other types of software. Abandoned (or unmaintained), buggy, and hard to use software, a perception often associated with scientific software can hinder scientific progress, yet, in contrast to other types of software, its longevity is poorly understood. Existing data curation efforts are fragmented by s…
▽ More
Scientific software is essential to scientific innovation and in many ways it is distinct from other types of software. Abandoned (or unmaintained), buggy, and hard to use software, a perception often associated with scientific software can hinder scientific progress, yet, in contrast to other types of software, its longevity is poorly understood. Existing data curation efforts are fragmented by science domain and/or are small in scale and lack key attributes. We use large language models to classify public software repositories in World of Code into distinct scientific domains and layers of the software stack, curating a large and diverse collection of over 18,000 scientific software projects. Using this data, we estimate survival models to understand how the domain, infrastructural layer, and other attributes of scientific software affect its longevity. We further obtain a matched sample of non-scientific software repositories and investigate the differences. We find that infrastructural layers, downstream dependencies, mentions of publications, and participants from government are associated with a longer lifespan, while newer projects with participants from academia had shorter lifespan. Against common expectations, scientific projects have a longer lifetime than matched non-scientific open-source software projects. We expect our curated attribute-rich collection to support future research on scientific software and provide insights that may help extend longevity of both scientific and other projects.
△ Less
Submitted 26 April, 2025;
originally announced April 2025.
-
Cracks in The Stack: Hidden Vulnerabilities and Licensing Risks in LLM Pre-Training Datasets
Authors:
Mahmoud Jahanshahi,
Audris Mockus
Abstract:
A critical part of creating code suggestion systems is the pre-training of Large Language Models on vast amounts of source code and natural language text, often of questionable origin or quality. This may contribute to the presence of bugs and vulnerabilities in code generated by LLMs. While efforts to identify bugs at or after code generation exist, it is preferable to pre-train or fine-tune LLMs…
▽ More
A critical part of creating code suggestion systems is the pre-training of Large Language Models on vast amounts of source code and natural language text, often of questionable origin or quality. This may contribute to the presence of bugs and vulnerabilities in code generated by LLMs. While efforts to identify bugs at or after code generation exist, it is preferable to pre-train or fine-tune LLMs on curated, high-quality, and compliant datasets. The need for vast amounts of training data necessitates that such curation be automated, minimizing human intervention.
We propose an automated source code autocuration technique that leverages the complete version history of open-source software projects to improve the quality of training data. This approach leverages the version history of all OSS projects to identify training data samples that have been modified or have undergone changes in at least one OSS project, and pinpoint a subset of samples that include fixes for bugs or vulnerabilities. We evaluate this method using The Stack v2 dataset, and find that 17% of the code versions in the dataset have newer versions, with 17% of those representing bug fixes, including 2.36% addressing known CVEs. The deduplicated version of Stack v2 still includes blobs vulnerable to 6,947 known CVEs. Furthermore, 58% of the blobs in the dataset were never modified after creation, suggesting they likely represent software with minimal or no use. Misidentified blob origins present an additional challenge, as they lead to the inclusion of non-permissively licensed code, raising serious compliance concerns.
By addressing these issues, the training of new models can avoid perpetuating buggy code patterns or license violations. We expect our results to inspire process improvements for automated data curation, with the potential to enhance the reliability of outputs generated by AI tools.
△ Less
Submitted 5 January, 2025;
originally announced January 2025.
-
Beyond Dependencies: The Role of Copy-Based Reuse in Open Source Software Development
Authors:
Mahmoud Jahanshahi,
David Reid,
Audris Mockus
Abstract:
In Open Source Software, resources of any project are open for reuse by introducing dependencies or copying the resource itself. In contrast to dependency-based reuse, the infrastructure to systematically support copy-based reuse appears to be entirely missing. Our aim is to enable future research and tool development to increase efficiency and reduce the risks of copy-based reuse. We seek a bette…
▽ More
In Open Source Software, resources of any project are open for reuse by introducing dependencies or copying the resource itself. In contrast to dependency-based reuse, the infrastructure to systematically support copy-based reuse appears to be entirely missing. Our aim is to enable future research and tool development to increase efficiency and reduce the risks of copy-based reuse. We seek a better understanding of such reuse by measuring its prevalence and identifying factors affecting the propensity to reuse. To identify reused artifacts and trace their origins, our method exploits World of Code infrastructure. We begin with a set of theory-derived factors related to the propensity to reuse, sample instances of different reuse types, and survey developers to better understand their intentions. Our results indicate that copy-based reuse is common, with many developers being aware of it when writing code. The propensity for a file to be reused varies greatly among languages and between source code and binary files, consistently decreasing over time. Files introduced by popular projects are more likely to be reused, but at least half of reused resources originate from ``small'' and ``medium'' projects. Developers had various reasons for reuse but were generally positive about using a package manager.
△ Less
Submitted 7 September, 2024;
originally announced September 2024.
-
OSS License Identification at Scale: A Comprehensive Dataset Using World of Code
Authors:
Mahmoud Jahanshahi,
David Reid,
Adam McDaniel,
Audris Mockus
Abstract:
The proliferation of open source software (OSS) and different types of reuse has made it incredibly difficult to perform an essential legal and compliance task of accurate license identification within the software supply chain. This study presents a reusable and comprehensive dataset of OSS licenses, created using the World of Code (WoC) infrastructure. By scanning all files containing "license"…
▽ More
The proliferation of open source software (OSS) and different types of reuse has made it incredibly difficult to perform an essential legal and compliance task of accurate license identification within the software supply chain. This study presents a reusable and comprehensive dataset of OSS licenses, created using the World of Code (WoC) infrastructure. By scanning all files containing "license" in their file paths, and applying the approximate matching via winnowing algorithm to identify the most similar license from the SPDX list, we found and identified 5.5 million distinct license blobs in OSS projects. The dataset includes a detailed project-to-license (P2L) map with commit timestamps, enabling dynamic analysis of license adoption and changes over time. To verify the accuracy of the dataset we use stratified sampling and manual review, achieving a final accuracy of 92.08%, with precision of 87.14%, recall of 95.45%, and an F1 score of 91.11%. This dataset is intended to support a range of research and practical tasks, including the detection of license noncompliance, the investigations of license changes, study of licensing trends, and the development of compliance tools. The dataset is open, providing a valuable resource for developers, researchers, and legal professionals in the OSS community.
△ Less
Submitted 11 March, 2025; v1 submitted 7 September, 2024;
originally announced September 2024.
-
Dataset: Copy-based Reuse in Open Source Software
Authors:
Mahmoud Jahanshahi,
Audris Mockus
Abstract:
In Open Source Software, the source code and any other resources available in a project can be viewed or reused by anyone subject to often permissive licensing restrictions. In contrast to some studies of dependency-based reuse supported via package managers, no studies of OSS-wide copy-based reuse exist. This dataset seeks to encourage the studies of OSS-wide copy-based reuse by providing copying…
▽ More
In Open Source Software, the source code and any other resources available in a project can be viewed or reused by anyone subject to often permissive licensing restrictions. In contrast to some studies of dependency-based reuse supported via package managers, no studies of OSS-wide copy-based reuse exist. This dataset seeks to encourage the studies of OSS-wide copy-based reuse by providing copying activity data that captures whole-file reuse in nearly all OSS. To accomplish that, we develop approaches to detect copy-based reuse by developing an efficient algorithm that exploits World of Code infrastructure: a curated and cross referenced collection of nearly all open source repositories. We expect this data to enable future research and tool development that support such reuse and minimize associated risks.
△ Less
Submitted 14 December, 2023;
originally announced December 2023.
-
Farspredict: A benchmark dataset for link prediction
Authors:
Najmeh Torabian,
Behrouz Minaei-Bidgoli,
Mohsen Jahanshahi
Abstract:
Link prediction with knowledge graph embedding (KGE) is a popular method for knowledge graph completion. Furthermore, training KGEs on non-English knowledge graph promote knowledge extraction and knowledge graph reasoning in the context of these languages. However, many challenges in non-English KGEs pose to learning a low-dimensional representation of a knowledge graph's entities and relations. T…
▽ More
Link prediction with knowledge graph embedding (KGE) is a popular method for knowledge graph completion. Furthermore, training KGEs on non-English knowledge graph promote knowledge extraction and knowledge graph reasoning in the context of these languages. However, many challenges in non-English KGEs pose to learning a low-dimensional representation of a knowledge graph's entities and relations. This paper proposes "Farspredict" a Persian knowledge graph based on Farsbase (the most comprehensive knowledge graph in Persian). It also explains how the knowledge graph structure affects link prediction accuracy in KGE. To evaluate Farspredict, we implemented the popular models of KGE on it and compared the results with Freebase. Given the analysis results, some optimizations on the knowledge graph are carried out to improve its functionality in the KGE. As a result, a new Persian knowledge graph is achieved. Implementation results in the KGE models on Farspredict outperforming Freebases in many cases. At last, we discuss what improvements could be effective in enhancing the quality of Farspredict and how much it improves.
△ Less
Submitted 26 March, 2023;
originally announced March 2023.
-
Building the Collaboration Graph of Open-Source Software Ecosystem
Authors:
Elena Lyulina,
Mahmoud Jahanshahi
Abstract:
The Open-Source Software community has become the center of attention for many researchers, who are investigating various aspects of collaboration in this extremely large ecosystem. Due to its size, it is difficult to grasp whether or not it has structure, and if so, what it may be. Our hackathon project aims to facilitate the understanding of the developer collaboration structure and relationship…
▽ More
The Open-Source Software community has become the center of attention for many researchers, who are investigating various aspects of collaboration in this extremely large ecosystem. Due to its size, it is difficult to grasp whether or not it has structure, and if so, what it may be. Our hackathon project aims to facilitate the understanding of the developer collaboration structure and relationships among projects based on the bi-graph of what projects developers contribute to by providing an interactive collaboration graph of this ecosystem, using the data obtained from World of Code infrastructure. Our attempts to visualize the entirety of projects and developers were stymied by the inability of the layout and visualization tools to process the exceedingly large scale of the full graph. We used WoC to filter the nodes (developers and projects) and edges (developer contributions to a project) to reduce the scale of the graph that made it amenable to an interactive visualization and published the resulting visualizations. We plan to apply hierarchical approaches to be able to incorporate the entire data in the interactive visualizations and also to evaluate the utility of such visualizations for several tasks.
△ Less
Submitted 22 March, 2021;
originally announced March 2021.
-
Gateway Placement and Selection Solutions in WMNs: A Survey
Authors:
Mohsen Jahanshahi,
Arash Bozorgchenani
Abstract:
Due to the high demand of Internet access by users, and the tremendous success of wireless technologies, Wireless Mesh Networks (WMNs) have become a promising solution. IGW Placement and Selection (GPS) are significantly investigated problems to achieve QoS requirements, network performance, and reduce deployment cost in WMNs. Best effort is made to classify different works in the literature based…
▽ More
Due to the high demand of Internet access by users, and the tremendous success of wireless technologies, Wireless Mesh Networks (WMNs) have become a promising solution. IGW Placement and Selection (GPS) are significantly investigated problems to achieve QoS requirements, network performance, and reduce deployment cost in WMNs. Best effort is made to classify different works in the literature based on network characteristics. At first, one of the most principal capabilities of WMNs, which is taking advantage of using multi-radio routers in a multi-channel network, is studied. In this article, GPS protocols considering a definition of three types of WMN are investigated based on channel-radio association including Single Radio Single Channel (SRSC), Single Radio Multi- Channel (SRMC), and Multi-Radio Multi-Channel (MRMC) WMNs. Furthermore, a classification regarding static and dynamic channel allocation policies is derived. In addition, the reported works from the viewpoint of network solutions are classified. The first perspective is: centralized, distributed or hybrid architectures. Following this classification, the studies are categorized regarding optimization techniques, which are operation research-based solutions, heuristic algorithms, and meta-heuristic-based algorithms.
△ Less
Submitted 16 June, 2019;
originally announced June 2019.