Skip to main content
Cornell University
We gratefully acknowledge support from the Simons Foundation, member institutions, and all contributors. Donate
arxiv logo > cs > arXiv:1902.02595

Help | Advanced Search

arXiv logo
Cornell University Logo

quick links

  • Login
  • Help Pages
  • About

Computer Science > Software Engineering

arXiv:1902.02595 (cs)
[Submitted on 7 Feb 2019 (v1), last revised 19 Mar 2019 (this version, v3)]

Title:A Manually-Curated Dataset of Fixes to Vulnerabilities of Open-Source Software

Authors:Serena E. Ponta, Henrik Plate, Antonino Sabetta, Michele Bezzi, Cédric Dangremont
View a PDF of the paper titled A Manually-Curated Dataset of Fixes to Vulnerabilities of Open-Source Software, by Serena E. Ponta and 4 other authors
View PDF
Abstract:Advancing our understanding of software vulnerabilities, automating their identification, the analysis of their impact, and ultimately their mitigation is necessary to enable the development of software that is more secure. While operating a vulnerability assessment tool that we developed and that is currently used by hundreds of development units at SAP, we manually collected and curated a dataset of vulnerabilities of open-source software and the commits fixing them. The data was obtained both from the National Vulnerability Database (NVD) and from project-specific Web resources that we monitor on a continuous basis. From that data, we extracted a dataset that maps 624 publicly disclosed vulnerabilities affecting 205 distinct open-source Java projects, used in SAP products or internal tools, onto the 1282 commits that fix them. Out of 624 vulnerabilities, 29 do not have a CVE identifier at all and 46, which do have a CVE identifier assigned by a numbering authority, are not available in the NVD yet. The dataset is released under an open-source license, together with supporting scripts that allow researchers to automatically retrieve the actual content of the commits from the corresponding repositories and to augment the attributes available for each instance. Also, these scripts allow to complement the dataset with additional instances that are not security fixes (which is useful, for example, in machine learning applications). Our dataset has been successfully used to train classifiers that could automatically identify security-relevant commits in code repositories. The release of this dataset and the supporting code as open-source will allow future research to be based on data of industrial relevance; also, it represents a concrete step towards making the maintenance of this dataset a shared effort involving open-source communities, academia, and the industry.
Comments: This is a pre-print version of the paper that appears in the proceedings of The 16th International Conference on Mining Software Repositories (MSR), Data Showcase track
Subjects: Software Engineering (cs.SE); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Cite as: arXiv:1902.02595 [cs.SE]
  (or arXiv:1902.02595v3 [cs.SE] for this version)
  https://doi.org/10.48550/arXiv.1902.02595
arXiv-issued DOI via DataCite
Journal reference: Proceedings of The 16th International Conference on Mining Software Repositories (Data Showcase track), 2019
Related DOI: https://doi.org/10.1109/MSR.2019.0006
DOI(s) linking to related resources

Submission history

From: Antonino Sabetta [view email]
[v1] Thu, 7 Feb 2019 12:47:13 UTC (2,113 KB)
[v2] Fri, 8 Feb 2019 09:06:58 UTC (2,113 KB)
[v3] Tue, 19 Mar 2019 10:33:34 UTC (2,114 KB)
Full-text links:

Access Paper:

    View a PDF of the paper titled A Manually-Curated Dataset of Fixes to Vulnerabilities of Open-Source Software, by Serena E. Ponta and 4 other authors
  • View PDF
  • TeX Source
  • Other Formats
view license
Current browse context:
cs.LG
< prev   |   next >
new | recent | 2019-02
Change to browse by:
cs
cs.CR
cs.SE

References & Citations

  • NASA ADS
  • Google Scholar
  • Semantic Scholar

DBLP - CS Bibliography

listing | bibtex
Serena Elisa Ponta
Henrik Plate
Antonino Sabetta
Michele Bezzi
Cédric Dangremont
a export BibTeX citation Loading...

BibTeX formatted citation

×
Data provided by:

Bookmark

BibSonomy logo Reddit logo

Bibliographic and Citation Tools

Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)

Code, Data and Media Associated with this Article

alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)

Demos

Replicate (What is Replicate?)
Hugging Face Spaces (What is Spaces?)
TXYZ.AI (What is TXYZ.AI?)

Recommenders and Search Tools

Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
  • Author
  • Venue
  • Institution
  • Topic

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
  • About
  • Help
  • contact arXivClick here to contact arXiv Contact
  • subscribe to arXiv mailingsClick here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status
    Get status notifications via email or slack