-
Advancing ATLAS DCS Data Analysis with a Modern Data Platform
Authors:
Luca Canali,
Andrea Formica,
Michelle Ann Solis
Abstract:
This paper presents a modern and scalable framework for analyzing Detector Control System (DCS) data from the ATLAS experiment at CERN. The DCS data, stored in an Oracle database via the WinCC OA system, is optimized for transactional operations, posing challenges for large-scale analysis across extensive time periods and devices. To address these limitations, we developed a data pipeline using Ap…
▽ More
This paper presents a modern and scalable framework for analyzing Detector Control System (DCS) data from the ATLAS experiment at CERN. The DCS data, stored in an Oracle database via the WinCC OA system, is optimized for transactional operations, posing challenges for large-scale analysis across extensive time periods and devices. To address these limitations, we developed a data pipeline using Apache Spark, CERN's Hadoop service, and the CERN SWAN platform. This framework integrates seamlessly with Python notebooks, providing an accessible and efficient environment for data analysis using industry-standard tools. The approach has proven effective in troubleshooting Data Acquisition (DAQ) links for the ATLAS New Small Wheel (NSW) detector, demonstrating the value of modern data platforms in enabling detector experts to quickly identify and resolve critical issues.
△ Less
Submitted 23 January, 2025;
originally announced January 2025.
-
The ATLAS EventIndex: a BigData catalogue for all ATLAS experiment events
Authors:
Dario Barberis,
Igor Aleksandrov,
Evgeny Alexandrov,
Zbigniew Baranowski,
Luca Canali,
Elizaveta Cherepanova,
Gancho Dimitrov,
Andrea Favareto,
Alvaro Fernandez Casani,
Elizabeth J. Gallas,
Carlos Garcia Montoro,
Santiago Gonzalez de la Hoz,
Julius Hrivnac,
Alexander Iakovlev,
Andrei Kazymov,
Mikhail Mineev,
Fedor Prokoshin,
Grigori Rybkin,
Jose Salt,
Javier Sanchez,
Roman Sorokoletov,
Rainer Toebbicke,
Petya Vasileva,
Miguel Villaplana Perez,
Ruijun Yuan
Abstract:
The ATLAS EventIndex system comprises the catalogue of all events collected, processed or generated by the ATLAS experiment at the CERN LHC accelerator, and all associated software tools to collect, store and query this information. ATLAS records several billion particle interactions every year of operation, processes them for analysis and generates even larger simulated data samples; a global cat…
▽ More
The ATLAS EventIndex system comprises the catalogue of all events collected, processed or generated by the ATLAS experiment at the CERN LHC accelerator, and all associated software tools to collect, store and query this information. ATLAS records several billion particle interactions every year of operation, processes them for analysis and generates even larger simulated data samples; a global catalogue is needed to keep track of the location of each event record and be able to search and retrieve specific events for in-depth investigations. Each EventIndex record includes summary information on the event itself and the pointers to the files containing the full event. Most components of the EventIndex system are implemented using BigData open-source tools. This paper describes the architectural choices and their evolution in time, as well as the past, current and foreseen future implementations of all EventIndex components.
△ Less
Submitted 12 March, 2023; v1 submitted 15 November, 2022;
originally announced November 2022.
-
Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics
Authors:
Matteo Migliorini,
Riccardo Castellotti,
Luca Canali,
Marco Zanetti
Abstract:
The effective utilization at scale of complex machine learning (ML) techniques for HEP use cases poses several technological challenges, most importantly on the actual implementation of dedicated end-to-end data pipelines. A solution to these challenges is presented, which allows training neural network classifiers using solutions from the Big Data and data science ecosystems, integrated with tool…
▽ More
The effective utilization at scale of complex machine learning (ML) techniques for HEP use cases poses several technological challenges, most importantly on the actual implementation of dedicated end-to-end data pipelines. A solution to these challenges is presented, which allows training neural network classifiers using solutions from the Big Data and data science ecosystems, integrated with tools, software, and platforms common in the HEP environment. In particular, Apache Spark is exploited for data preparation and feature engineering, running the corresponding (Python) code interactively on Jupyter notebooks. Key integrations and libraries that make Spark capable of ingesting data stored using ROOT format and accessed via the XRootD protocol, are described and discussed. Training of the neural network models, defined using the Keras API, is performed in a distributed fashion on Spark clusters by using BigDL with Analytics Zoo and also by using TensorFlow, notably for distributed training on CPU and GPU resourcess. The implementation and the results of the distributed training are described in detail in this work.
△ Less
Submitted 16 June, 2020; v1 submitted 23 September, 2019;
originally announced September 2019.
-
Using Big Data Technologies for HEP Analysis
Authors:
Matteo Cremonesi,
Claudio Bellini,
Bianny Bian,
Luca Canali,
Vasileios Dimakopoulos,
Peter Elmer,
Ian Fisk,
Maria Girone,
Oliver Gutsche,
Siew-Yan Hoh,
Bo Jayatilaka,
Viktor Khristenko,
Andrea Luiselli,
Andrew Melo,
Evangelos Evangelos,
Dominick Olivito,
Jacopo Pazzini,
Jim Pivarski,
Alexey Svyatkovskiy,
Marco Zanetti
Abstract:
The HEP community is approaching an era were the excellent performances of the particle accelerators in delivering collision at high rate will force the experiments to record a large amount of information. The growing size of the datasets could potentially become a limiting factor in the capability to produce scientific results timely and efficiently. Recently, new technologies and new approaches…
▽ More
The HEP community is approaching an era were the excellent performances of the particle accelerators in delivering collision at high rate will force the experiments to record a large amount of information. The growing size of the datasets could potentially become a limiting factor in the capability to produce scientific results timely and efficiently. Recently, new technologies and new approaches have been developed in industry to answer to the necessity to retrieve information as quickly as possible to analyze PB and EB datasets. Providing the scientists with these modern computing tools will lead to rethinking the principles of data analysis in HEP, making the overall scientific process faster and smoother.
In this paper, we are presenting the latest developments and the most recent results on the usage of Apache Spark for HEP analysis. The study aims at evaluating the efficiency of the application of the new tools both quantitatively, by measuring the performances, and qualitatively, focusing on the user experience. The first goal is achieved by developing a data reduction facility: working together with CERN Openlab and Intel, CMS replicates a real physics search using Spark-based technologies, with the ambition of reducing 1 PB of public data in 5 hours, collected by the CMS experiment, to 1 TB of data in a format suitable for physics analysis.
The second goal is achieved by implementing multiple physics use-cases in Apache Spark using as input preprocessed datasets derived from official CMS data and simulation. By performing different end-analyses up to the publication plots on different hardware, feasibility, usability and portability are compared to the ones of a traditional ROOT-based workflow.
△ Less
Submitted 21 January, 2019;
originally announced January 2019.
-
CMS Analysis and Data Reduction with Apache Spark
Authors:
Oliver Gutsche,
Luca Canali,
Illia Cremer,
Matteo Cremonesi,
Peter Elmer,
Ian Fisk,
Maria Girone,
Bo Jayatilaka,
Jim Kowalkowski,
Viktor Khristenko,
Evangelos Motesnitsalis,
Jim Pivarski,
Saba Sehrish,
Kacper Surdy,
Alexey Svyatkovskiy
Abstract:
Experimental Particle Physics has been at the forefront of analyzing the world's largest datasets for decades. The HEP community was among the first to develop suitable software and computing tools for this task. In recent times, new toolkits and systems for distributed data processing, collectively called "Big Data" technologies have emerged from industry and open source projects to support the a…
▽ More
Experimental Particle Physics has been at the forefront of analyzing the world's largest datasets for decades. The HEP community was among the first to develop suitable software and computing tools for this task. In recent times, new toolkits and systems for distributed data processing, collectively called "Big Data" technologies have emerged from industry and open source projects to support the analysis of Petabyte and Exabyte datasets in industry. While the principles of data analysis in HEP have not changed (filtering and transforming experiment-specific data formats), these new technologies use different approaches and tools, promising a fresh look at analysis of very large datasets that could potentially reduce the time-to-physics with increased interactivity. Moreover these new tools are typically actively developed by large communities, often profiting of industry resources, and under open source licensing. These factors result in a boost for adoption and maturity of the tools and for the communities supporting them, at the same time helping in reducing the cost of ownership for the end-users. In this talk, we are presenting studies of using Apache Spark for end user data analysis. We are studying the HEP analysis workflow separated into two thrusts: the reduction of centrally produced experiment datasets and the end-analysis up to the publication plot. Studying the first thrust, CMS is working together with CERN openlab and Intel on the CMS Big Data Reduction Facility. The goal is to reduce 1 PB of official CMS data to 1 TB of ntuple output for analysis. We are presenting the progress of this 2-year project with first results of scaling up Spark-based HEP analysis. Studying the second thrust, we are presenting studies on using Apache Spark for a CMS Dark Matter physics search, comparing Spark's feasibility, usability and performance to the ROOT-based analysis.
△ Less
Submitted 31 October, 2017;
originally announced November 2017.