-
Approximate learning of parsimonious Bayesian context trees
Authors:
Daniyar Ghani,
Nicholas A. Heard,
Francesco Sanna Passino
Abstract:
Models for categorical sequences typically assume exchangeable or first-order dependent sequence elements. These are common assumptions, for example, in models of computer malware traces and protein sequences. Although such simplifying assumptions lead to computational tractability, these models fail to capture long-range, complex dependence structures that may be harnessed for greater predictive…
▽ More
Models for categorical sequences typically assume exchangeable or first-order dependent sequence elements. These are common assumptions, for example, in models of computer malware traces and protein sequences. Although such simplifying assumptions lead to computational tractability, these models fail to capture long-range, complex dependence structures that may be harnessed for greater predictive power. To this end, a Bayesian modelling framework is proposed to parsimoniously capture rich dependence structures in categorical sequences, with memory efficiency suitable for real-time processing of data streams. Parsimonious Bayesian context trees are introduced as a form of variable-order Markov model with conjugate prior distributions. The novel framework requires fewer parameters than fixed-order Markov models by dropping redundant dependencies and clustering sequential contexts. Approximate inference on the context tree structure is performed via a computationally efficient model-based agglomerative clustering procedure. The proposed framework is tested on synthetic and real-world data examples, and it outperforms existing sequence models when fitted to real protein sequences and honeypot computer terminal sessions.
△ Less
Submitted 27 July, 2024;
originally announced July 2024.
-
Nested Dirichlet models for unsupervised attack pattern detection in honeypot data
Authors:
Francesco Sanna Passino,
Anastasia Mantziou,
Daniyar Ghani,
Philip Thiede,
Ross Bevington,
Nicholas A. Heard
Abstract:
Cyber-systems are under near-constant threat from intrusion attempts. Attacks types vary, but each attempt typically has a specific underlying intent, and the perpetrators are typically groups of individuals with similar objectives. Clustering attacks appearing to share a common intent is very valuable to threat-hunting experts. This article explores Dirichlet distribution topic models for cluster…
▽ More
Cyber-systems are under near-constant threat from intrusion attempts. Attacks types vary, but each attempt typically has a specific underlying intent, and the perpetrators are typically groups of individuals with similar objectives. Clustering attacks appearing to share a common intent is very valuable to threat-hunting experts. This article explores Dirichlet distribution topic models for clustering terminal session commands collected from honeypots, which are special network hosts designed to entice malicious attackers. The main practical implications of clustering the sessions are two-fold: finding similar groups of attacks, and identifying outliers. A range of statistical models are considered, adapted to the structures of command-line syntax. In particular, concepts of primary and secondary topics, and then session-level and command-level topics, are introduced into the models to improve interpretability. The proposed methods are further extended in a Bayesian nonparametric fashion to allow unboundedness in the vocabulary size and the number of latent intents. The methods are shown to discover an unusual MIRAI variant which attempts to take over existing cryptocurrency coin-mining infrastructure, not detected by traditional topic-modelling approaches.
△ Less
Submitted 21 December, 2024; v1 submitted 6 January, 2023;
originally announced January 2023.
-
Changepoint detection in non-exchangeable data
Authors:
Karl L. Hallgren,
Nicholas A. Heard,
Niall M. Adams
Abstract:
Changepoint models typically assume the data within each segment are independent and identically distributed conditional on some parameters which change across segments. This construction may be inadequate when data are subject to local correlation patterns, often resulting in many more changepoints fitted than preferable. This article proposes a Bayesian changepoint model which relaxes the assump…
▽ More
Changepoint models typically assume the data within each segment are independent and identically distributed conditional on some parameters which change across segments. This construction may be inadequate when data are subject to local correlation patterns, often resulting in many more changepoints fitted than preferable. This article proposes a Bayesian changepoint model which relaxes the assumption of exchangeability within segments. The proposed model supposes data within a segment are $m$-dependent for some unkown $m \geqslant0$ which may vary between segments, resulting in a model suitable for detecting clear discontinuities in data which are subject to different local temporal correlations. The approach is suited to both continuous and discrete data. A novel reversible jump MCMC algorithm is proposed to sample from the model; in particular, a detailed analysis of the parameter space is exploited to build proposals for the orders of dependence. Two applications demonstrate the benefits of the proposed model: computer network monitoring via change detection in count data, and segmentation of financial time series.
△ Less
Submitted 9 November, 2021;
originally announced November 2021.
-
Latent structure blockmodels for Bayesian spectral graph clustering
Authors:
Francesco Sanna Passino,
Nicholas A. Heard
Abstract:
Spectral embedding of network adjacency matrices often produces node representations living approximately around low-dimensional submanifold structures. In particular, hidden substructure is expected to arise when the graph is generated from a latent position model. Furthermore, the presence of communities within the network might generate community-specific submanifold structures in the embedding…
▽ More
Spectral embedding of network adjacency matrices often produces node representations living approximately around low-dimensional submanifold structures. In particular, hidden substructure is expected to arise when the graph is generated from a latent position model. Furthermore, the presence of communities within the network might generate community-specific submanifold structures in the embedding, but this is not explicitly accounted for in most statistical models for networks. In this article, a class of models called latent structure block models (LSBM) is proposed to address such scenarios, allowing for graph clustering when community-specific one dimensional manifold structure is present. LSBMs focus on a specific class of latent space model, the random dot product graph (RDPG), and assign a latent submanifold to the latent positions of each community. A Bayesian model for the embeddings arising from LSBMs is discussed, and shown to have a good performance on simulated and real world network data. The model is able to correctly recover the underlying communities living in a one-dimensional manifold, even when the parametric form of the underlying curves is unknown, achieving remarkable results on a variety of real data.
△ Less
Submitted 2 January, 2022; v1 submitted 4 July, 2021;
originally announced July 2021.
-
Mutually exciting point process graphs for modelling dynamic networks
Authors:
Francesco Sanna Passino,
Nicholas A. Heard
Abstract:
A new class of models for dynamic networks is proposed, called mutually exciting point process graphs (MEG). MEG is a scalable network-wide statistical model for point processes with dyadic marks, which can be used for anomaly detection when assessing the significance of future events, including previously unobserved connections between nodes. The model combines mutually exciting point processes t…
▽ More
A new class of models for dynamic networks is proposed, called mutually exciting point process graphs (MEG). MEG is a scalable network-wide statistical model for point processes with dyadic marks, which can be used for anomaly detection when assessing the significance of future events, including previously unobserved connections between nodes. The model combines mutually exciting point processes to estimate dependencies between events and latent space models to infer relationships between the nodes. The intensity functions for each network edge are characterised exclusively by node-specific parameters, which allows information to be shared across the network. This construction enables estimation of intensities even for unobserved edges, which is particularly important in real world applications, such as computer networks arising in cyber-security. A recursive form of the log-likelihood function for MEG is obtained, which is used to derive fast inferential procedures via modern gradient ascent algorithms. An alternative EM algorithm is also derived. The model and algorithms are tested on simulated graphs and real world datasets, demonstrating excellent performance.
△ Less
Submitted 22 December, 2021; v1 submitted 11 February, 2021;
originally announced February 2021.
-
Changepoint detection on a graph of time series
Authors:
Karl L. Hallgren,
Nicholas A. Heard,
Melissa J. M. Turcotte
Abstract:
When analysing multiple time series that may be subject to changepoints, it is sometimes possible to specify a priori, by means of a graph, which pairs of time series are likely to be impacted by simultaneous changepoints. This article proposes an informative prior for changepoints which encodes the information contained in the graph, inducing a changepoint model for multiple time series that borr…
▽ More
When analysing multiple time series that may be subject to changepoints, it is sometimes possible to specify a priori, by means of a graph, which pairs of time series are likely to be impacted by simultaneous changepoints. This article proposes an informative prior for changepoints which encodes the information contained in the graph, inducing a changepoint model for multiple time series that borrows strength across clusters of connected time series to detect weak signals for synchronous changepoints. The graphical model for changepoints is further extended to allow dependence between nearby but not necessarily synchronous changepoints across neighbouring time series in the graph. A novel reversible jump Markov chain Monte Carlo (MCMC) algorithm making use of auxiliary variables is proposed to sample from the graphical changepoint model. The merit of the proposed approach is demonstrated through a changepoint analysis of computer network authentication logs from Los Alamos National Laboratory (LANL), demonstrating an improvement at detecting weak signals for network intrusions across users linked by network connectivity, whilst limiting the number of false alerts.
△ Less
Submitted 8 February, 2023; v1 submitted 8 February, 2021;
originally announced February 2021.
-
Spectral clustering on spherical coordinates under the degree-corrected stochastic blockmodel
Authors:
Francesco Sanna Passino,
Nicholas A. Heard,
Patrick Rubin-Delanchy
Abstract:
Spectral clustering is a popular method for community detection in network graphs: starting from a matrix representation of the graph, the nodes are clustered on a low dimensional projection obtained from a truncated spectral decomposition of the matrix. Estimating correctly the number of communities and the dimension of the reduced latent space is critical for good performance of spectral cluster…
▽ More
Spectral clustering is a popular method for community detection in network graphs: starting from a matrix representation of the graph, the nodes are clustered on a low dimensional projection obtained from a truncated spectral decomposition of the matrix. Estimating correctly the number of communities and the dimension of the reduced latent space is critical for good performance of spectral clustering algorithms. Furthermore, many real-world graphs, such as enterprise computer networks studied in cyber-security applications, often display heterogeneous within-community degree distributions. Such heterogeneous degree distributions are usually not well captured by standard spectral clustering algorithms. In this article, a novel spectral clustering algorithm is proposed for community detection under the degree-corrected stochastic blockmodel. The proposed method is based on a transformation of the spectral embedding to spherical coordinates, and a novel modelling assumption in the transformed space. The method allows for simultaneous and automated selection of the number of communities and the latent dimension for spectral embeddings of graphs with uneven node degrees. Results show improved performance over competing methods in representing computer networks.
△ Less
Submitted 8 September, 2021; v1 submitted 9 November, 2020;
originally announced November 2020.
-
Graph link prediction in computer networks using Poisson matrix factorisation
Authors:
Francesco Sanna Passino,
Melissa J. M. Turcotte,
Nicholas A. Heard
Abstract:
Graph link prediction is an important task in cyber-security: relationships between entities within a computer network, such as users interacting with computers, or system libraries and the corresponding processes that use them, can provide key insights into adversary behaviour. Poisson matrix factorisation (PMF) is a popular model for link prediction in large networks, particularly useful for its…
▽ More
Graph link prediction is an important task in cyber-security: relationships between entities within a computer network, such as users interacting with computers, or system libraries and the corresponding processes that use them, can provide key insights into adversary behaviour. Poisson matrix factorisation (PMF) is a popular model for link prediction in large networks, particularly useful for its scalability. In this article, PMF is extended to include scenarios that are commonly encountered in cyber-security applications. Specifically, an extension is proposed to explicitly handle binary adjacency matrices and include known categorical covariates associated with the graph nodes. A seasonal PMF model is also presented to handle seasonal networks. To allow the methods to scale to large graphs, variational methods are discussed for performing fast inference. The results show an improved performance over the standard PMF model and other statistical network models.
△ Less
Submitted 21 May, 2021; v1 submitted 26 January, 2020;
originally announced January 2020.
-
Link prediction in dynamic networks using random dot product graphs
Authors:
Francesco Sanna Passino,
Anna S. Bertiger,
Joshua C. Neil,
Nicholas A. Heard
Abstract:
The problem of predicting links in large networks is an important task in a variety of practical applications, including social sciences, biology and computer security. In this paper, statistical techniques for link prediction based on the popular random dot product graph model are carefully presented, analysed and extended to dynamic settings. Motivated by a practical application in cyber-securit…
▽ More
The problem of predicting links in large networks is an important task in a variety of practical applications, including social sciences, biology and computer security. In this paper, statistical techniques for link prediction based on the popular random dot product graph model are carefully presented, analysed and extended to dynamic settings. Motivated by a practical application in cyber-security, this paper demonstrates that random dot product graphs not only represent a powerful tool for inferring differences between multiple networks, but are also efficient for prediction purposes and for understanding the temporal evolution of the network. The probabilities of links are obtained by fusing information at two stages: spectral methods provide estimates of latent positions for each node, and time series models are used to capture temporal dynamics. In this way, traditional link prediction methods, usually based on decompositions of the entire network adjacency matrix, are extended using temporal information. The methods presented in this article are applied to a number of simulated and real-world graphs, showing promising results.
△ Less
Submitted 13 July, 2021; v1 submitted 22 December, 2019;
originally announced December 2019.
-
Bayesian estimation of the latent dimension and communities in stochastic blockmodels
Authors:
Francesco Sanna Passino,
Nicholas A. Heard
Abstract:
Spectral embedding of adjacency or Laplacian matrices of undirected graphs is a common technique for representing a network in a lower dimensional latent space, with optimal theoretical guarantees. The embedding can be used to estimate the community structure of the network, with strong consistency results in the stochastic blockmodel framework. One of the main practical limitations of standard al…
▽ More
Spectral embedding of adjacency or Laplacian matrices of undirected graphs is a common technique for representing a network in a lower dimensional latent space, with optimal theoretical guarantees. The embedding can be used to estimate the community structure of the network, with strong consistency results in the stochastic blockmodel framework. One of the main practical limitations of standard algorithms for community detection from spectral embeddings is that the number of communities and the latent dimension of the embedding must be specified in advance. In this article, a novel Bayesian model for simultaneous and automatic selection of the appropriate dimension of the latent space and the number of blocks is proposed. Extensions to directed and bipartite graphs are discussed. The model is tested on simulated and real world network data, showing promising performance for recovering latent community structure.
△ Less
Submitted 28 May, 2020; v1 submitted 6 April, 2019;
originally announced April 2019.
-
Adaptive sequential Monte Carlo for multiple changepoint analysis
Authors:
Melissa J. M. Turcotte,
Nicholas A. Heard
Abstract:
Process monitoring and control requires detection of structural changes in a data stream in real time. This article introduces an efficient sequential Monte Carlo algorithm designed for learning unknown changepoints in continuous time. The method is intuitively simple: new changepoints for the latest window of data are proposed by conditioning only on data observed since the most recent estimated…
▽ More
Process monitoring and control requires detection of structural changes in a data stream in real time. This article introduces an efficient sequential Monte Carlo algorithm designed for learning unknown changepoints in continuous time. The method is intuitively simple: new changepoints for the latest window of data are proposed by conditioning only on data observed since the most recent estimated changepoint, as these carry most of the information about the state of the process prior to the update. The proposed method shows improved performance over the current state of the art. Another advantage of the proposed algorithm is that it can be made adaptive, varying the number of particles according to the apparent local complexity of the target changepoint probability distribution. This saves valuable computing time when changes in the change- point distribution are negligible, and enables re-balancing of the importance weights of ex- isting particles when a significant change in the target distribution is encountered. The plain and adaptive versions of the method are illustrated using the canonical con- tinuous time changepoint problem of inferring the intensity of an inhomogeneous Poisson process. Performance is demonstrated using both conjugate and non-conjugate Bayesian models for the intensity.
△ Less
Submitted 28 September, 2015;
originally announced September 2015.
-
A test for dependence between two point processes on the real line
Authors:
Patrick Rubin-Delanchy,
Nicholas A. Heard
Abstract:
Many scientific questions rely on determining whether two sequences of event times are associated. This article introduces a likelihood ratio test which can be parameterised in several ways to detect different forms of dependence. A common finite-sample distribution is derived, and shown to be asymptotically related to a weighted Kolmogorov-Smirnov test. Analysis leading to these results also moti…
▽ More
Many scientific questions rely on determining whether two sequences of event times are associated. This article introduces a likelihood ratio test which can be parameterised in several ways to detect different forms of dependence. A common finite-sample distribution is derived, and shown to be asymptotically related to a weighted Kolmogorov-Smirnov test. Analysis leading to these results also motivates a more general tool for diagnosing dependence. The methodology is demonstrated on data generated on an email network, showing evidence of information flow using only timing information. Implementation code is available in the R package `mppa'.
△ Less
Submitted 20 December, 2014; v1 submitted 17 August, 2014;
originally announced August 2014.
-
Bayesian anomaly detection methods for social networks
Authors:
Nicholas A. Heard,
David J. Weston,
Kiriaki Platanioti,
David J. Hand
Abstract:
Learning the network structure of a large graph is computationally demanding, and dynamically monitoring the network over time for any changes in structure threatens to be more challenging still. This paper presents a two-stage method for anomaly detection in dynamic graphs: the first stage uses simple, conjugate Bayesian models for discrete time counting processes to track the pairwise links of a…
▽ More
Learning the network structure of a large graph is computationally demanding, and dynamically monitoring the network over time for any changes in structure threatens to be more challenging still. This paper presents a two-stage method for anomaly detection in dynamic graphs: the first stage uses simple, conjugate Bayesian models for discrete time counting processes to track the pairwise links of all nodes in the graph to assess normality of behavior; the second stage applies standard network inference tools on a greatly reduced subset of potentially anomalous nodes. The utility of the method is demonstrated on simulated and real data sets.
△ Less
Submitted 8 November, 2010;
originally announced November 2010.