ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation

Holzmann, Helge; Goel, Vinay; Anand, Avishek

doi:10.1145/2910896.2910902

Computer Science > Digital Libraries

arXiv:1702.01015 (cs)

[Submitted on 3 Feb 2017]

Title:ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation

Authors:Helge Holzmann, Vinay Goel, Avishek Anand

View PDF

Abstract:Web archives are a valuable resource for researchers of various disciplines. However, to use them as a scholarly source, researchers require a tool that provides efficient access to Web archive data for extraction and derivation of smaller datasets. Besides efficient access we identify five other objectives based on practical researcher needs such as ease of use, extensibility and reusability.
Towards these objectives we propose ArchiveSpark, a framework for efficient, distributed Web archive processing that builds a research corpus by working on existing and standardized data formats commonly held by Web archiving institutions. Performance optimizations in ArchiveSpark, facilitated by the use of a widely available metadata index, result in significant speed-ups of data processing. Our benchmarks show that ArchiveSpark is faster than alternative approaches without depending on any additional data stores while improving usability by seamlessly integrating queries and derivations with external tools.

Comments:	JCDL 2016, Newark, NJ, USA
Subjects:	Digital Libraries (cs.DL); Databases (cs.DB)
Cite as:	arXiv:1702.01015 [cs.DL]
	(or arXiv:1702.01015v1 [cs.DL] for this version)
	https://doi.org/10.48550/arXiv.1702.01015
Related DOI:	https://doi.org/10.1145/2910896.2910902

Submission history

From: Helge Holzmann [view email]
[v1] Fri, 3 Feb 2017 14:17:02 UTC (232 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.DL

< prev | next >

new | recent | 2017-02

Change to browse by:

cs
cs.DB

References & Citations

DBLP - CS Bibliography

listing | bibtex

Helge Holzmann
Vinay Goel
Avishek Anand

export BibTeX citation

Computer Science > Digital Libraries

Title:ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Digital Libraries

Title:ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators