-
SampleHST: Efficient On-the-Fly Selection of Distributed Traces
Authors:
Alim Ul Gias,
Yicheng Gao,
Matthew Sheldon,
José A. Perusquía,
Owen O'Brien,
Giuliano Casale
Abstract:
Since only a small number of traces generated from distributed tracing helps in troubleshooting, its storage requirement can be significantly reduced by biasing the selection towards anomalous traces. To aid in this scenario, we propose SampleHST, a novel approach to sample on-the-fly from a stream of traces in an unsupervised manner. SampleHST adjusts the storage quota of normal and anomalous tra…
▽ More
Since only a small number of traces generated from distributed tracing helps in troubleshooting, its storage requirement can be significantly reduced by biasing the selection towards anomalous traces. To aid in this scenario, we propose SampleHST, a novel approach to sample on-the-fly from a stream of traces in an unsupervised manner. SampleHST adjusts the storage quota of normal and anomalous traces depending on the size of its budget. Initially, it utilizes a forest of Half Space Trees (HSTs) for trace scoring. This is based on the distribution of the mass scores across the trees, which characterizes the probability of observing different traces. The mass distribution from HSTs is subsequently used to cluster the traces online leveraging a variant of the mean-shift algorithm. This trace-cluster association eventually drives the sampling decision. We have compared the performance of SampleHST with a recently suggested method using data from a cloud data center and demonstrated that SampleHST improves sampling performance up to by 9.5x.
△ Less
Submitted 9 September, 2022;
originally announced October 2022.
-
Beta-CoRM: A Bayesian Approach for $n$-gram Profiles Analysis
Authors:
José A. Perusquía,
Jim E. Griffin,
Cristiano Villa
Abstract:
$n…
▽ More
$n$-gram profiles have been successfully and widely used to analyse long sequences of potentially differing lengths for clustering or classification. Mainly, machine learning algorithms have been used for this purpose but, despite their predictive performance, these methods cannot discover hidden structures or provide a full probabilistic representation of the data. A novel class of Bayesian generative models designed for $n$-gram profiles used as binary attributes have been designed to address this. The flexibility of the proposed modelling allows to consider a straightforward approach to feature selection in the generative model. Furthermore, a slice sampling algorithm is derived for a fast inferential procedure, which is applied to synthetic and real data scenarios and shows that feature selection can improve classification accuracy.
△ Less
Submitted 1 September, 2024; v1 submitted 23 November, 2020;
originally announced November 2020.
-
Bayesian Models Applied to Cyber Security Anomaly Detection Problems
Authors:
José A. Perusquía,
Jim E. Griffin,
Cristiano Villa
Abstract:
Cyber security is an important concern for all individuals, organisations and governments globally. Cyber attacks have become more sophisticated, frequent and dangerous than ever, and traditional anomaly detection methods have been proved to be less effective when dealing with these new classes of cyber threats. In order to address this, both classical and Bayesian models offer a valid and innovat…
▽ More
Cyber security is an important concern for all individuals, organisations and governments globally. Cyber attacks have become more sophisticated, frequent and dangerous than ever, and traditional anomaly detection methods have been proved to be less effective when dealing with these new classes of cyber threats. In order to address this, both classical and Bayesian models offer a valid and innovative alternative to the traditional signature-based methods, motivating the increasing interest in statistical research that it has been observed in recent years. In this review we provide a description of some typical cyber security challenges, typical types of data and statistical methods, paying special attention to Bayesian approaches for these problems.
△ Less
Submitted 3 June, 2021; v1 submitted 23 March, 2020;
originally announced March 2020.