-
Status Report of the DPHEP Collaboration: A Global Effort for Sustainable Data Preservation in High Energy Physics
Authors:
DPHEP Collaboration,
Silvia Amerio,
Roberto Barbera,
Frank Berghaus,
Jakob Blomer,
Andrew Branson,
Germán Cancio,
Concetta Cartaro,
Gang Chen,
Sünje Dallmeier-Tiessen,
Cristinel Diaconu,
Gerardo Ganis,
Mihaela Gheata,
Takanori Hara,
Ken Herner,
Mike Hildreth,
Roger Jones,
Stefan Kluth,
Dirk Krücker,
Kati Lassila-Perini,
Marcello Maggi,
Jesus Marco de Lucas,
Salvatore Mele,
Alberto Pace,
Matthias Schröder
, et al. (9 additional authors not shown)
Abstract:
Data from High Energy Physics (HEP) experiments are collected with significant financial and human effort and are mostly unique. An inter-experimental study group on HEP data preservation and long-term analysis was convened as a panel of the International Committee for Future Accelerators (ICFA). The group was formed by large collider-based experiments and investigated the technical and organizati…
▽ More
Data from High Energy Physics (HEP) experiments are collected with significant financial and human effort and are mostly unique. An inter-experimental study group on HEP data preservation and long-term analysis was convened as a panel of the International Committee for Future Accelerators (ICFA). The group was formed by large collider-based experiments and investigated the technical and organizational aspects of HEP data preservation. An intermediate report was released in November 2009 addressing the general issues of data preservation in HEP and an extended blueprint paper was published in 2012. In July 2014 the DPHEP collaboration was formed as a result of the signature of the Collaboration Agreement by seven large funding agencies (others have since joined or are in the process of acquisition) and in June 2015 the first DPHEP Collaboration Workshop and Collaboration Board meeting took place.
This status report of the DPHEP collaboration details the progress during the period from 2013 to 2015 inclusive.
△ Less
Submitted 17 February, 2016; v1 submitted 7 December, 2015;
originally announced December 2015.
-
ROOT - A C++ Framework for Petabyte Data Storage, Statistical Analysis and Visualization
Authors:
Ilka Antcheva,
Maarten Ballintijn,
Bertrand Bellenot,
Marek Biskup,
Rene Brun,
Nenad Buncic,
Philippe Canal,
Diego Casadei,
Olivier Couet,
Valery Fine,
Leandro Franco,
Gerardo Ganis,
Andrei Gheata,
David Gonzalez Maline,
Masaharu Goto,
Jan Iwaszkiewicz,
Anna Kreshuk,
Diego Marcos Segura,
Richard Maunder,
Lorenzo Moneta,
Axel Naumann,
Eddy Offermann,
Valeriy Onuchin,
Suzanne Panacek,
Fons Rademakers
, et al. (2 additional authors not shown)
Abstract:
ROOT is an object-oriented C++ framework conceived in the high-energy physics (HEP) community, designed for storing and analyzing petabytes of data in an efficient way. Any instance of a C++ class can be stored into a ROOT file in a machine-independent compressed binary format. In ROOT the TTree object container is optimized for statistical data analysis over very large data sets by using vertical…
▽ More
ROOT is an object-oriented C++ framework conceived in the high-energy physics (HEP) community, designed for storing and analyzing petabytes of data in an efficient way. Any instance of a C++ class can be stored into a ROOT file in a machine-independent compressed binary format. In ROOT the TTree object container is optimized for statistical data analysis over very large data sets by using vertical data storage techniques. These containers can span a large number of files on local disks, the web, or a number of different shared file systems. In order to analyze this data, the user can chose out of a wide set of mathematical and statistical functions, including linear algebra classes, numerical algorithms such as integration and minimization, and various methods for performing regression analysis (fitting). In particular, ROOT offers packages for complex data modeling and fitting, as well as multivariate classification based on machine learning techniques. A central piece in these analysis tools are the histogram classes which provide binning of one- and multi-dimensional data. Results can be saved in high-quality graphical formats like Postscript and PDF or in bitmap formats like JPG or GIF. The result can also be stored into ROOT macros that allow a full recreation and rework of the graphics. Users typically create their analysis macros step by step, making use of the interactive C++ interpreter CINT, while running over small data samples. Once the development is finished, they can run these macros at full compiled speed over large data sets, using on-the-fly compilation, or by creating a stand-alone batch program. Finally, if processing farms are available, the user can reduce the execution time of intrinsically parallel tasks - e.g. data mining in HEP - by using PROOF, which will take care of optimally distributing the work over the available resources in a transparent way.
△ Less
Submitted 31 August, 2015;
originally announced August 2015.
-
The Need for a Versioned Data Analysis Software Environment
Authors:
Jakob Blomer,
Dario Berzano,
Predrag Buncic,
Ioannis Charalampidis,
Gerardo Ganis,
George Lestaris,
René Meusel
Abstract:
Scientific results in high-energy physics and in many other fields often rely on complex software stacks. In order to support reproducibility and scrutiny of the results, it is good practice to use open source software and to cite software packages and versions. With ever-growing complexity of scientific software on one side and with IT life-cycles of only a few years on the other side, however, i…
▽ More
Scientific results in high-energy physics and in many other fields often rely on complex software stacks. In order to support reproducibility and scrutiny of the results, it is good practice to use open source software and to cite software packages and versions. With ever-growing complexity of scientific software on one side and with IT life-cycles of only a few years on the other side, however, it turns out that despite source code availability the setup and the validation of a minimal usable analysis environment can easily become prohibitively expensive. We argue that there is a substantial gap between merely having access to versioned source code and the ability to create a data analysis runtime environment. In order to preserve all the different variants of the data analysis runtime environment, we developed a snapshotting file system optimized for software distribution. We report on our experience in preserving the analysis environment for high-energy physics such as the software landscape used to discover the Higgs boson at the Large Hadron Collider.
△ Less
Submitted 11 July, 2014;
originally announced July 2014.
-
CernVM Online and Cloud Gateway: a uniform interface for CernVM contextualization and deployment
Authors:
G. Lestaris,
I. Charalampidis,
D. Berzano,
J. Blomer,
P. Buncic,
G. Ganis,
R. Meusel
Abstract:
In a virtualized environment, contextualization is the process of configuring a VM instance for the needs of various deployment use cases. Contextualization in CernVM can be done by passing a handwritten context to the user data field of cloud APIs, when running CernVM on the cloud, or by using CernVM web interface when running the VM locally. CernVM Online is a publicly accessible web interface t…
▽ More
In a virtualized environment, contextualization is the process of configuring a VM instance for the needs of various deployment use cases. Contextualization in CernVM can be done by passing a handwritten context to the user data field of cloud APIs, when running CernVM on the cloud, or by using CernVM web interface when running the VM locally. CernVM Online is a publicly accessible web interface that unifies these two procedures. A user is able to define, store and share CernVM contexts using CernVM Online and then apply them either in a cloud by using CernVM Cloud Gateway or on a local VM with the single-step pairing mechanism. CernVM Cloud Gateway is a distributed system that provides a single interface to use multiple and different clouds (by location or type, private or public). Cloud gateway has been so far integrated with OpenNebula, CloudStack and EC2 tools interfaces. A user, with access to a number of clouds, can run CernVM cloud agents that will communicate with these clouds using their interfaces, and then use one single interface to deploy and scale CernVM clusters. CernVM clusters are defined in CernVM Online and consist of a set of CernVM instances that are contextualized and can communicate with each other.
△ Less
Submitted 7 April, 2014;
originally announced April 2014.
-
PROOF as a Service on the Cloud: a Virtual Analysis Facility based on the CernVM ecosystem
Authors:
Dario Berzano,
Jakob Blomer,
Predrag Buncic,
Ioannis Charalampidis,
Gerardo Ganis,
Georgios Lestaris,
René Meusel
Abstract:
PROOF, the Parallel ROOT Facility, is a ROOT-based framework which enables interactive parallelism for event-based tasks on a cluster of computing nodes. Although PROOF can be used simply from within a ROOT session with no additional requirements, deploying and configuring a PROOF cluster used to be not as straightforward. Recently great efforts have been spent to make the provisioning of generic…
▽ More
PROOF, the Parallel ROOT Facility, is a ROOT-based framework which enables interactive parallelism for event-based tasks on a cluster of computing nodes. Although PROOF can be used simply from within a ROOT session with no additional requirements, deploying and configuring a PROOF cluster used to be not as straightforward. Recently great efforts have been spent to make the provisioning of generic PROOF analysis facilities with zero configuration, with the added advantages of positively affecting both stability and scalability, making the deployment operations feasible even for the end user. Since a growing amount of large-scale computing resources are nowadays made available by Cloud providers in a virtualized form, we have developed the Virtual PROOF-based Analysis Facility: a cluster appliance combining the solid CernVM ecosystem and PoD (PROOF on Demand), ready to be deployed on the Cloud and leveraging some peculiar Cloud features such as elasticity. We will show how this approach is effective both for sysadmins, who will have little or no configuration to do to run it on their Clouds, and for the end users, who are ultimately in full control of their PROOF cluster and can even easily restart it by themselves in the unfortunate event of a major failure. We will also show how elasticity leads to a more optimal and uniform usage of Cloud resources.
△ Less
Submitted 19 February, 2014;
originally announced February 2014.
-
Micro-CernVM: Slashing the Cost of Building and Deploying Virtual Machines
Authors:
J. Blomer,
D. Berzano,
P. Buncic,
I. Charalampidis,
G. Ganis,
G. Lestaris,
R. Meusel,
V. Nicolaou
Abstract:
The traditional virtual machine building and and deployment process is centered around the virtual machine hard disk image. The packages comprising the VM operating system are carefully selected, hard disk images are built for a variety of different hypervisors, and images have to be distributed and decompressed in order to instantiate a virtual machine. Within the HEP community, the CernVM File S…
▽ More
The traditional virtual machine building and and deployment process is centered around the virtual machine hard disk image. The packages comprising the VM operating system are carefully selected, hard disk images are built for a variety of different hypervisors, and images have to be distributed and decompressed in order to instantiate a virtual machine. Within the HEP community, the CernVM File System has been established in order to decouple the distribution from the experiment software from the building and distribution of the VM hard disk images.
We show how to get rid of such pre-built hard disk images altogether. Due to the high requirements on POSIX compliance imposed by HEP application software, CernVM-FS can also be used to host and boot a Linux operating system. This allows the use of a tiny bootable CD image that comprises only a Linux kernel while the rest of the operating system is provided on demand by CernVM-FS. This approach speeds up the initial instantiation time and reduces virtual machine image sizes by an order of magnitude. Furthermore, security updates can be distributed instantaneously through CernVM-FS. By leveraging the fact that CernVM-FS is a versioning file system, a historic analysis environment can be easily re-spawned by selecting the corresponding CernVM-FS file system snapshot.
△ Less
Submitted 11 November, 2013;
originally announced November 2013.