-
Statistical witchhunts: Science, justice & the p-value crisis
Authors:
Spencer Wheatley,
Didier Sornette
Abstract:
We provide accessible insight into the current 'replication crisis' in 'statistical science', by revisiting the old metaphor of 'court trial as hypothesis test'. Inter alia, we define and diagnose harmful statistical witch-hunting both in justice and science, which extends to the replication crisis itself, where a hunt on p-values is currently underway.
We provide accessible insight into the current 'replication crisis' in 'statistical science', by revisiting the old metaphor of 'court trial as hypothesis test'. Inter alia, we define and diagnose harmful statistical witch-hunting both in justice and science, which extends to the replication crisis itself, where a hunt on p-values is currently underway.
△ Less
Submitted 1 July, 2019; v1 submitted 11 April, 2019;
originally announced April 2019.
-
Data breaches in the catastrophe framework & beyond
Authors:
Spencer Wheatley,
Annette Hofmann,
Didier Sornette
Abstract:
Development of sustainable insurance for cyber risks, with associated benefits, inter alia requires reduction of ambiguity of the risk. Considering cyber risk, and data breaches in particular, as a man-made catastrophe clarifies the actuarial need for multiple levels of analysis - going beyond claims-driven loss statistics alone to include exposure, hazard, breach size, and so on - and necessitati…
▽ More
Development of sustainable insurance for cyber risks, with associated benefits, inter alia requires reduction of ambiguity of the risk. Considering cyber risk, and data breaches in particular, as a man-made catastrophe clarifies the actuarial need for multiple levels of analysis - going beyond claims-driven loss statistics alone to include exposure, hazard, breach size, and so on - and necessitating specific advances in scope, quality, and standards of both data and models. The prominent human element, as well as dynamic, networked, and multi-type nature, of cyber risk makes it perhaps uniquely challenging. Complementary top-down statistical, and bottom-up analytical approaches are discussed. Focusing on data breach severity, measured in private information items ('ids') extracted, we exploit relatively mature open data for U.S. data breaches. We show that this extremely heavy-tailed risk is worsening for external attacker ('hack') events - both in frequency and severity. Writing in Q2-2018, the median predicted number of ids breached in the U.S. due to hacking, for the last 6 months of 2018, is 0.5 billion. But with a 5% chance that the figure exceeds 7 billion - doubling the historical total. 'Fortunately' the total breach in that period turned out to be near the median.
△ Less
Submitted 16 May, 2019; v1 submitted 3 January, 2019;
originally announced January 2019.
-
The Extreme Risk of Personal Data Breaches & The Erosion of Privacy
Authors:
Spencer Wheatley,
Thomas Maillart,
Didier Sornette
Abstract:
Personal data breaches from organisations, enabling mass identity fraud, constitute an \emph{extreme risk}. This risk worsens daily as an ever-growing amount of personal data are stored by organisations and on-line, and the attack surface surrounding this data becomes larger and harder to secure. Further, breached information is distributed and accumulates in the hands of cyber criminals, thus dri…
▽ More
Personal data breaches from organisations, enabling mass identity fraud, constitute an \emph{extreme risk}. This risk worsens daily as an ever-growing amount of personal data are stored by organisations and on-line, and the attack surface surrounding this data becomes larger and harder to secure. Further, breached information is distributed and accumulates in the hands of cyber criminals, thus driving a cumulative erosion of privacy. Statistical modeling of breach data from 2000 through 2015 provides insights into this risk: A current maximum breach size of about 200 million is detected, and is expected to grow by fifty percent over the next five years. The breach sizes are found to be well modeled by an \emph{extremely heavy tailed} truncated Pareto distribution, with tail exponent parameter decreasing linearly from 0.57 in 2007 to 0.37 in 2015. With this current model, given a breach contains above fifty thousand items, there is a ten percent probability of exceeding ten million. A size effect is unearthed where both the frequency and severity of breaches scale with organisation size like $s^{0.6}$. Projections indicate that the total amount of breached information is expected to double from two to four billion items within the next five years, eclipsing the population of users of the Internet. This massive and uncontrolled dissemination of personal identities raises fundamental concerns about privacy.
△ Less
Submitted 25 February, 2016; v1 submitted 28 May, 2015;
originally announced May 2015.
-
Estimation of the Hawkes Process With Renewal Immigration Using the EM Algorithm
Authors:
Spencer Wheatley,
Vladimir Filimonov,
Didier Sornette
Abstract:
We introduce the Hawkes process with renewal immigration and make its statistical estimation possible with two Expectation Maximization (EM) algorithms. The standard Hawkes process introduces immigrant points via a Poisson process, and each immigrant has a subsequent cluster of associated offspring of multiple generations. We generalize the immigration to come from a Renewal process; introducing d…
▽ More
We introduce the Hawkes process with renewal immigration and make its statistical estimation possible with two Expectation Maximization (EM) algorithms. The standard Hawkes process introduces immigrant points via a Poisson process, and each immigrant has a subsequent cluster of associated offspring of multiple generations. We generalize the immigration to come from a Renewal process; introducing dependence between neighbouring clusters, and allowing for over/under dispersion in cluster locations. This complicates evaluation of the likelihood since one needs to know which subset of the observed points are immigrants. Two EM algorithms enable estimation here: The first is an extension of an existing algorithm that treats the entire branching structure - which points are immigrants, and which point is the parent of each offspring - as missing data. The second considers only if a point is an immigrant or not as missing data and can be implemented with linear time complexity. Both algorithms are found to be consistent in simulation studies. Further, we show that misspecifying the immigration process introduces signficant bias into model estimation-- especially the branching ratio, which quantifies the strength of self excitation. Thus, this extended model provides a valuable alternative model in practice.
△ Less
Submitted 26 July, 2014;
originally announced July 2014.