-
ROOT - A C++ Framework for Petabyte Data Storage, Statistical Analysis and Visualization
Authors:
Ilka Antcheva,
Maarten Ballintijn,
Bertrand Bellenot,
Marek Biskup,
Rene Brun,
Nenad Buncic,
Philippe Canal,
Diego Casadei,
Olivier Couet,
Valery Fine,
Leandro Franco,
Gerardo Ganis,
Andrei Gheata,
David Gonzalez Maline,
Masaharu Goto,
Jan Iwaszkiewicz,
Anna Kreshuk,
Diego Marcos Segura,
Richard Maunder,
Lorenzo Moneta,
Axel Naumann,
Eddy Offermann,
Valeriy Onuchin,
Suzanne Panacek,
Fons Rademakers
, et al. (2 additional authors not shown)
Abstract:
ROOT is an object-oriented C++ framework conceived in the high-energy physics (HEP) community, designed for storing and analyzing petabytes of data in an efficient way. Any instance of a C++ class can be stored into a ROOT file in a machine-independent compressed binary format. In ROOT the TTree object container is optimized for statistical data analysis over very large data sets by using vertical…
▽ More
ROOT is an object-oriented C++ framework conceived in the high-energy physics (HEP) community, designed for storing and analyzing petabytes of data in an efficient way. Any instance of a C++ class can be stored into a ROOT file in a machine-independent compressed binary format. In ROOT the TTree object container is optimized for statistical data analysis over very large data sets by using vertical data storage techniques. These containers can span a large number of files on local disks, the web, or a number of different shared file systems. In order to analyze this data, the user can chose out of a wide set of mathematical and statistical functions, including linear algebra classes, numerical algorithms such as integration and minimization, and various methods for performing regression analysis (fitting). In particular, ROOT offers packages for complex data modeling and fitting, as well as multivariate classification based on machine learning techniques. A central piece in these analysis tools are the histogram classes which provide binning of one- and multi-dimensional data. Results can be saved in high-quality graphical formats like Postscript and PDF or in bitmap formats like JPG or GIF. The result can also be stored into ROOT macros that allow a full recreation and rework of the graphics. Users typically create their analysis macros step by step, making use of the interactive C++ interpreter CINT, while running over small data samples. Once the development is finished, they can run these macros at full compiled speed over large data sets, using on-the-fly compilation, or by creating a stand-alone batch program. Finally, if processing farms are available, the user can reduce the execution time of intrinsically parallel tasks - e.g. data mining in HEP - by using PROOF, which will take care of optimally distributing the work over the available resources in a transparent way.
△ Less
Submitted 31 August, 2015;
originally announced August 2015.
-
Objective Bayesian analysis of counting experiments with correlated sources of background
Authors:
Diego Casadei,
Cornelius Grunwald,
Kevin Kröninger,
Florian Mentzel
Abstract:
Searches for faint signals in counting experiments are often encountered in particle physics and astrophysics, as well as in other fields. Many problems can be reduced to the case of a model with independent and Poisson-distributed signal and background. Often several background contributions are present at the same time, possibly correlated. We provide the analytic solution of the statistical inf…
▽ More
Searches for faint signals in counting experiments are often encountered in particle physics and astrophysics, as well as in other fields. Many problems can be reduced to the case of a model with independent and Poisson-distributed signal and background. Often several background contributions are present at the same time, possibly correlated. We provide the analytic solution of the statistical inference problem of estimating the signal in the presence of multiple backgrounds, in the framework of objective Bayes statistics. The model can be written in the form of a product of a single Poisson distribution with a multinomial distribution. The first is related to the total number of events, whereas the latter describes the fraction of events coming from each individual source. Correlations among different backgrounds can be included in the inference problem by a suitable choice of the priors.
△ Less
Submitted 19 January, 2017; v1 submitted 10 April, 2015;
originally announced April 2015.
-
Reference analysis of the signal + background model in counting experiments II. Approximate reference prior
Authors:
Diego Casadei
Abstract:
The objective Bayesian treatment of a model representing two independent Poisson processes, labelled as "signal" and "background" and both contributing additively to the total number of counted events, is considered. It is shown that the reference prior for the parameter of interest (the signal intensity) can be well approximated by the widely (ab)used flat prior only when the expected background…
▽ More
The objective Bayesian treatment of a model representing two independent Poisson processes, labelled as "signal" and "background" and both contributing additively to the total number of counted events, is considered. It is shown that the reference prior for the parameter of interest (the signal intensity) can be well approximated by the widely (ab)used flat prior only when the expected background is very high. On the other hand, a very simple approximation (the limiting form of the reference prior for perfect prior background knowledge) can be safely used over a large portion of the background parameters space. The resulting approximate reference posterior is a Gamma density whose parameters are related to the observed counts. This limiting form is simpler than the result obtained with a flat prior, with the additional advantage of representing a much closer approximation to the reference posterior in all cases. Hence such limiting prior should be considered a better default or conventional prior than the uniform prior. On the computing side, it is shown that a 2-parameter fitting function is able to reproduce extremely well the reference prior for any background prior. Thus, it can be useful in applications requiring the evaluation of the reference prior for a very large number of times. [The published version JINST 9 (2014) T10006 has a typo in the normalization $N$ of eq.(2.6) that is fixed here.]
△ Less
Submitted 2 March, 2015; v1 submitted 22 July, 2014;
originally announced July 2014.
-
Plotting the Differences Between Data and Expectation
Authors:
Georgios Choudalakis,
Diego Casadei
Abstract:
This article proposes a way to improve the presentation of histograms where data are compared to expectation. Sometimes, it is difficult to judge by eye whether the difference between the bin content and the theoretical expectation (provided by either a fitting function or another histogram) is just due to statistical fluctuations. More importantly, there could be statistically significant deviati…
▽ More
This article proposes a way to improve the presentation of histograms where data are compared to expectation. Sometimes, it is difficult to judge by eye whether the difference between the bin content and the theoretical expectation (provided by either a fitting function or another histogram) is just due to statistical fluctuations. More importantly, there could be statistically significant deviations which are completely invisible in the plot. We propose to add a small inset at the bottom of the plot, in which the statistical significance of the deviation observed in each bin is shown. Even though the numerical routines which we developed have only illustration purposes, it comes out that they are based on formulae which could be used to perform statistical inference in a proper way. An implementation of our computation is available at https://github.com/dcasadei/psde .
△ Less
Submitted 20 February, 2018; v1 submitted 8 November, 2011;
originally announced November 2011.
-
Reference analysis of the signal + background model in counting experiments
Authors:
Diego Casadei
Abstract:
The model representing two independent Poisson processes, labelled as "signal" and "background" and both contributing at the same time to the total number of counted events, is considered from a Bayesian point of view. This is a widely used model for the searches of rare or exotic events in presence of some background source, as for example in the searches performed by the high-energy physics expe…
▽ More
The model representing two independent Poisson processes, labelled as "signal" and "background" and both contributing at the same time to the total number of counted events, is considered from a Bayesian point of view. This is a widely used model for the searches of rare or exotic events in presence of some background source, as for example in the searches performed by the high-energy physics experiments. In the assumption of some prior knowledge about the background yield, a reference prior is obtained for the signal alone and its properties are studied. Finally, the properties of the full solution, the marginal reference posterior, are illustrated with few examples.
△ Less
Submitted 14 December, 2011; v1 submitted 22 August, 2011;
originally announced August 2011.
-
Statistical methods used in ATLAS for exclusion and discovery
Authors:
Diego Casadei
Abstract:
The statistical methods used by the ATLAS Collaboration for setting upper limits or establishing a discovery are reviewed, as they are fundamental ingredients in the search for new phenomena. The analyses published so far adopted different approaches, choosing a frequentist or a Bayesian or a hybrid frequentist-Bayesian method to perform a search for new physics and set upper limits. In this note,…
▽ More
The statistical methods used by the ATLAS Collaboration for setting upper limits or establishing a discovery are reviewed, as they are fundamental ingredients in the search for new phenomena. The analyses published so far adopted different approaches, choosing a frequentist or a Bayesian or a hybrid frequentist-Bayesian method to perform a search for new physics and set upper limits. In this note, after the introduction of the necessary basic concepts of statistical hypothesis testing, a few recommendations are made about the preferred approaches to be followed in future analyses.
△ Less
Submitted 10 August, 2011;
originally announced August 2011.
-
Estimating the selection efficiency
Authors:
Diego Casadei
Abstract:
The measurement of the efficiency of an event selection is always an important part of the analysis of experimental data. The statistical techniques which are needed to determine the efficiency and its uncertainty are reviewed. Frequentist and Bayesian approaches are illustrated, and the problem of choosing a meaningful prior is explicitly addressed. Several practical use cases are considered, fro…
▽ More
The measurement of the efficiency of an event selection is always an important part of the analysis of experimental data. The statistical techniques which are needed to determine the efficiency and its uncertainty are reviewed. Frequentist and Bayesian approaches are illustrated, and the problem of choosing a meaningful prior is explicitly addressed. Several practical use cases are considered, from the problem of combining different samples to complex situations in which non-unit weights or non-independent selections have been used. The Bayesian approach allows to find analytical expressions which solve even the most complicate problems, which make use of the family of Beta distributions, the conjugate priors for the binomial sampling.
△ Less
Submitted 25 July, 2012; v1 submitted 2 August, 2009;
originally announced August 2009.