-
Hierarchical Representations for Evolving Acyclic Vector Autoregressions (HEAVe)
Authors:
Cameron Cornell,
Lewis Mitchell,
Matthew Roughan
Abstract:
Causal networks offer an intuitive framework to understand influence structures within time series systems. However, the presence of cycles can obscure dynamic relationships and hinder hierarchical analysis. These networks are typically identified through multivariate predictive modelling, but enforcing acyclic constraints significantly increases computational and analytical complexity. Despite re…
▽ More
Causal networks offer an intuitive framework to understand influence structures within time series systems. However, the presence of cycles can obscure dynamic relationships and hinder hierarchical analysis. These networks are typically identified through multivariate predictive modelling, but enforcing acyclic constraints significantly increases computational and analytical complexity. Despite recent advances, there remains a lack of simple, flexible approaches that are easily tailorable to specific problem instances. We propose an evolutionary approach to fitting acyclic vector autoregressive processes and introduces a novel hierarchical representation that directly models structural elements within a time series system. On simulated datasets, our model retains most of the predictive accuracy of unconstrained models and outperforms permutation-based alternatives. When applied to a dataset of 100 cryptocurrency return series, our method generates acyclic causal networks capturing key structural properties of the unconstrained model. The acyclic networks are approximately sub-graphs of the unconstrained networks, and most of the removed links originate from low-influence nodes. Given the high levels of feature preservation, we conclude that this cryptocurrency price system functions largely hierarchically. Our findings demonstrate a flexible, intuitive approach for identifying hierarchical causal networks in time series systems, with broad applications to fields like econometrics and social network analysis.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Evolutionary Generation of Random Surreal Numbers for Benchmarking
Authors:
Matthew Roughan
Abstract:
There are many areas of scientific endeavour where large, complex datasets are needed for benchmarking. Evolutionary computing provides a means towards creating such sets. As a case study, we consider Conway's Surreal numbers. They have largely been treated as a theoretical construct, with little effort towards empirical study, at least in part because of the difficulty of working with all but the…
▽ More
There are many areas of scientific endeavour where large, complex datasets are needed for benchmarking. Evolutionary computing provides a means towards creating such sets. As a case study, we consider Conway's Surreal numbers. They have largely been treated as a theoretical construct, with little effort towards empirical study, at least in part because of the difficulty of working with all but the smallest numbers. To advance this status, we need efficient algorithms, and in order to develop such we need benchmark data sets of surreal numbers. In this paper, we present a method for generating ensembles of random surreal numbers to benchmark algorithms. The approach uses an evolutionary algorithm to create the benchmark datasets where we can analyse and control features of the resulting test sets. Ultimately, the process is designed to generate networks with defined properties, and we expect this to be useful for other types of network data.
△ Less
Submitted 9 April, 2025;
originally announced April 2025.
-
Modified CMA-ES Algorithm for Multi-Modal Optimization: Incorporating Niching Strategies and Dynamic Adaptation Mechanism
Authors:
Wathsala Karunarathne,
Indu Bala,
Dikshit Chauhan,
Matthew Roughan,
Lewis Mitchell
Abstract:
This study modifies the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) algorithm for multi-modal optimization problems. The enhancements focus on addressing the challenges of multiple global minima, improving the algorithm's ability to maintain diversity and explore complex fitness landscapes. We incorporate niching strategies and dynamic adaptation mechanisms to refine the algorithm's p…
▽ More
This study modifies the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) algorithm for multi-modal optimization problems. The enhancements focus on addressing the challenges of multiple global minima, improving the algorithm's ability to maintain diversity and explore complex fitness landscapes. We incorporate niching strategies and dynamic adaptation mechanisms to refine the algorithm's performance in identifying and optimizing multiple global optima. The algorithm generates a population of candidate solutions by sampling from a multivariate normal distribution centered around the current mean vector, with the spread determined by the step size and covariance matrix. Each solution's fitness is evaluated as a weighted sum of its contributions to all global minima, maintaining population diversity and preventing premature convergence. We implemented the algorithm on 8 tunable composite functions for the GECCO 2024 Competition on Benchmarking Niching Methods for Multi-Modal Optimization (MMO), adhering to the competition's benchmarking framework. The results are presenting in many ways such as Peak Ratio, F1 score on various dimensions. They demonstrate the algorithm's robustness and effectiveness in handling both global optimization and MMO- specific challenges, providing a comprehensive solution for complex multi-modal optimization problems.
△ Less
Submitted 30 June, 2024;
originally announced July 2024.
-
The entropy rate of Linear Additive Markov Processes
Authors:
Bridget Smart,
Matthew Roughan,
Lewis Mitchell
Abstract:
This work derives a theoretical value for the entropy of a Linear Additive Markov Process (LAMP), an expressive model able to generate sequences with a given autocorrelation structure. While a first-order Markov Chain model generates new values by conditioning on the current state, the LAMP model takes the transition state from the sequence's history according to some distribution which does not h…
▽ More
This work derives a theoretical value for the entropy of a Linear Additive Markov Process (LAMP), an expressive model able to generate sequences with a given autocorrelation structure. While a first-order Markov Chain model generates new values by conditioning on the current state, the LAMP model takes the transition state from the sequence's history according to some distribution which does not have to be bounded. The LAMP model captures complex relationships and long-range dependencies in data with similar expressibility to a higher-order Markov process. While a higher-order Markov process has a polynomial parameter space, a LAMP model is characterised only by a probability distribution and the transition matrix of an underlying first-order Markov Chain. We prove that the theoretical entropy rate of a LAMP is equivalent to the theoretical entropy rate of the underlying first-order Markov Chain. This surprising result is explained by the randomness introduced by the random process which selects the LAMP transitioning state, and provides a tool to model complex dependencies in data while retaining useful theoretical results. We use the LAMP model to estimate the entropy rate of the LastFM, BrightKite, Wikispeedia and Reuters-21578 datasets. We compare estimates calculated using frequency probability estimates, a first-order Markov model and the LAMP model, and consider two approaches to ensuring the transition matrix is irreducible. In most cases the LAMP entropy rates are lower than those of the alternatives, suggesting that LAMP model is better at accommodating structural dependencies in the processes.
△ Less
Submitted 9 January, 2023; v1 submitted 10 November, 2022;
originally announced November 2022.
-
Performance Analysis: Discovering Semi-Markov Models From Event Logs
Authors:
Anna Kalenkova,
Lewis Mitchell,
Matthew Roughan
Abstract:
Process mining is a well-established discipline of data analysis focused on the discovery of process models from information systems' event logs. Recently, an emerging subarea of process mining, known as stochastic process discovery, has started to evolve. Stochastic process discovery considers frequencies of events in the event data and allows for a more comprehensive analysis. In particular, whe…
▽ More
Process mining is a well-established discipline of data analysis focused on the discovery of process models from information systems' event logs. Recently, an emerging subarea of process mining, known as stochastic process discovery, has started to evolve. Stochastic process discovery considers frequencies of events in the event data and allows for a more comprehensive analysis. In particular, when the durations of activities are presented in the event log, performance characteristics of the discovered stochastic models can be analyzed, e.g., the overall process execution time can be estimated. Existing performance analysis techniques usually discover stochastic process models from event data, and then simulate these models to evaluate their execution times. These methods rely on empirical approaches. This paper proposes analytical techniques for performance analysis that allow for the derivation of statistical characteristics of the overall processes' execution times in the presence of arbitrary time distributions of events modeled by semi-Markov processes. The proposed methods include express analysis, focused on the mean execution time estimation, and full analysis techniques that build probability density functions (PDFs) of process execution times in both continuous and discrete forms. These methods are implemented and tested on real-world event data, demonstrating their potential for what-if analysis by providing solutions without resorting to simulation. Specifically, we demonstrated that the discrete approach is more time-efficient for small duration support sizes compared to the simulation technique. Furthermore, we showed that the continuous approach, with PDFs represented as Mixtures of Gaussian Models (GMMs), facilitates the discovery of more compact and interpretable models.
△ Less
Submitted 6 March, 2025; v1 submitted 29 June, 2022;
originally announced June 2022.
-
Information flow estimation: a study of news on Twitter
Authors:
Tobin South,
Bridget Smart,
Matthew Roughan,
Lewis Mitchell
Abstract:
News media has long been an ecosystem of creation, reproduction, and critique, where news outlets report on current events and add commentary to ongoing stories. Understanding the dynamics of news information creation and dispersion is important to accurately ascribe credit to influential work and understand how societal narratives develop. These dynamics can be modelled through a combination of i…
▽ More
News media has long been an ecosystem of creation, reproduction, and critique, where news outlets report on current events and add commentary to ongoing stories. Understanding the dynamics of news information creation and dispersion is important to accurately ascribe credit to influential work and understand how societal narratives develop. These dynamics can be modelled through a combination of information-theoretic natural language processing and networks; and can be parameterised using large quantities of textual data. However, it is challenging to see "the wood for the trees", i.e., to detect small but important flows of information in a sea of noise. Here we develop new comparative techniques to estimate temporal information flow between pairs of text producers. Using both simulated and real text data we compare the reliability and sensitivity of methods for estimating textual information flow, showing that a metric that normalises by local neighbourhood structure provides a robust estimate of information flow in large networks. We apply this metric to a large corpus of news organisations on Twitter and demonstrate its usefulness in identifying influence within an information ecosystem, finding that average information contribution to the network is not correlated with the number of followers or the number of tweets. This suggests that small local organisations and right-wing organisations which have lower average follower counts still contribute significant information to the ecosystem. Further, the methods are applied to smaller full-text datasets of specific news events across news sites and Russian troll accounts on Twitter. The information flow estimation reveals and quantifies features of how these events develop and the role of groups of trolls in setting disinformation narratives.
△ Less
Submitted 28 September, 2022; v1 submitted 12 May, 2022;
originally announced May 2022.
-
Boolean Expressions in Firewall Analysis
Authors:
Adam Hamilton,
Matthew Roughan,
Giang T. Nguyen
Abstract:
Firewall policies are an important line of defence in cybersecurity, specifying which packets are allowed to pass through a network and which are not. These firewall policies are made up of a list of interacting rules. In practice, firewall can consist of hundreds or thousands of rules. This can be very difficult for a human to correctly configure. One proposed solution is to model firewall polici…
▽ More
Firewall policies are an important line of defence in cybersecurity, specifying which packets are allowed to pass through a network and which are not. These firewall policies are made up of a list of interacting rules. In practice, firewall can consist of hundreds or thousands of rules. This can be very difficult for a human to correctly configure. One proposed solution is to model firewall policies as Boolean expressions and use existing computer programs such as SAT solvers to verify that the firewall satisfies certain conditions. This paper takes an in-depth look at the Boolean expressions that represent firewall policies. We present an algorithm that translates a list of firewall rules into a Boolean expression in conjunctive normal form (CNF) or disjunctive normal form (DNF). We also place an upper bound on the size of the CNF and DNF that is polynomial in the number of rules in the firewall policy. This shows that past results suggesting a combinatorial explosion when converting from a Boolean expression in CNF to one in DNF does note occur in the context of firewall analysis
△ Less
Submitted 3 May, 2022;
originally announced May 2022.
-
Em-K Indexing for Approximate Query Matching in Large-scale ER
Authors:
Samudra Herath,
Matthew Roughan,
Gary Glonek
Abstract:
Accurate and efficient entity resolution (ER) is a significant challenge in many data mining and analysis projects requiring integrating and processing massive data collections. It is becoming increasingly important in real-world applications to develop ER solutions that produce prompt responses for entity queries on large-scale databases. Some of these applications demand entity query matching ag…
▽ More
Accurate and efficient entity resolution (ER) is a significant challenge in many data mining and analysis projects requiring integrating and processing massive data collections. It is becoming increasingly important in real-world applications to develop ER solutions that produce prompt responses for entity queries on large-scale databases. Some of these applications demand entity query matching against large-scale reference databases within a short time. We define this as the query matching problem in ER in this work. Indexing or blocking techniques reduce the search space and execution time in the ER process. However, approximate indexing techniques that scale to very large-scale datasets remain open to research. In this paper, we investigate the query matching problem in ER to propose an indexing method suitable for approximate and efficient query matching.
We first use spatial mappings to embed records in a multidimensional Euclidean space that preserves the domain-specific similarity. Among the various mapping techniques, we choose multidimensional scaling. Then using a Kd-tree and the nearest neighbour search, the method returns a block of records that includes potential matches for a query. Our method can process queries against a large-scale dataset using only a fraction of the data $L$ (given the dataset size is $N$), with a $O(L^2)$ complexity where $L \ll N$. The experiments conducted on several datasets showed the effectiveness of the proposed method.
△ Less
Submitted 7 November, 2021;
originally announced November 2021.
-
High Performance Out-of-sample Embedding Techniques for Multidimensional Scaling
Authors:
Samudra Herath,
Matthew Roughan,
Gary Glonek
Abstract:
The recent rapid growth of the dimension of many datasets means that many approaches to dimension reduction (DR) have gained significant attention. High-performance DR algorithms are required to make data analysis feasible for big and fast data sets. However, many traditional DR techniques are challenged by truly large data sets. In particular multidimensional scaling (MDS) does not scale well. MD…
▽ More
The recent rapid growth of the dimension of many datasets means that many approaches to dimension reduction (DR) have gained significant attention. High-performance DR algorithms are required to make data analysis feasible for big and fast data sets. However, many traditional DR techniques are challenged by truly large data sets. In particular multidimensional scaling (MDS) does not scale well. MDS is a popular group of DR techniques because it can perform DR on data where the only input is a dissimilarity function. However, common approaches are at least quadratic in memory and computation and, hence, prohibitive for large-scale data.
We propose an out-of-sample embedding (OSE) solution to extend the MDS algorithm for large-scale data utilising the embedding of only a subset of the given data. We present two OSE techniques: the first based on an optimisation approach and the second based on a neural network model. With a minor trade-off in the approximation, the out-of-sample techniques can process large-scale data with reasonable computation and memory requirements. While both methods perform well, the neural network model outperforms the optimisation approach of the OSE solution in terms of efficiency. OSE has the dual benefit that it allows fast DR on streaming datasets as well as static databases.
△ Less
Submitted 7 November, 2021;
originally announced November 2021.
-
Convergence of Conditional Entropy for Long Range Dependent Markov Chains
Authors:
Andrew Feutrill,
Matthew Roughan
Abstract:
In this paper we consider the convergence of the conditional entropy to the entropy rate for Markov chains. Convergence of certain statistics of long range dependent processes, such as the sample mean, is slow. It has been shown in Carpio and Daley \cite{carpio2007long} that the convergence of the $n$-step transition probabilities to the stationary distribution is slow, without quantifying the con…
▽ More
In this paper we consider the convergence of the conditional entropy to the entropy rate for Markov chains. Convergence of certain statistics of long range dependent processes, such as the sample mean, is slow. It has been shown in Carpio and Daley \cite{carpio2007long} that the convergence of the $n$-step transition probabilities to the stationary distribution is slow, without quantifying the convergence rate. We prove that the slow convergence also applies to convergence to an information-theoretic measure, the entropy rate, by showing that the convergence rate is equivalent to the convergence rate of the $n$-step transition probabilities to the stationary distribution, which is equivalent to the Markov chain mixing time problem. Then we quantify this convergence rate, and show that it is $O(n^{2H-2})$, where $n$ is the number of steps of the Markov chain and $H$ is the Hurst parameter. Finally, we show that due to this slow convergence, the mutual information between past and future is infinite if and only if the Markov chain is long range dependent. This is a discrete analogue of characterisations which have been shown for other long range dependent processes.
△ Less
Submitted 28 October, 2021;
originally announced October 2021.
-
NPD Entropy: A Non-Parametric Differential Entropy Rate Estimator
Authors:
Andrew Feutrill,
Matthew Roughan
Abstract:
The estimation of entropy rates for stationary discrete-valued stochastic processes is a well studied problem in information theory. However, estimating the entropy rate for stationary continuous-valued stochastic processes has not received as much attention. In fact, many current techniques are not able to accurately estimate or characterise the complexity of the differential entropy rate for str…
▽ More
The estimation of entropy rates for stationary discrete-valued stochastic processes is a well studied problem in information theory. However, estimating the entropy rate for stationary continuous-valued stochastic processes has not received as much attention. In fact, many current techniques are not able to accurately estimate or characterise the complexity of the differential entropy rate for strongly correlated processes, such as Fractional Gaussian Noise and ARFIMA(0,d,0). To the point that some cannot even detect the trend of the entropy rate, e.g. when it increases/decreases, maximum, or asymptotic trends, as a function of their Hurst parameter. However, a recently developed technique provides accurate estimates at a high computational cost. In this paper, we define a robust technique for non-parametrically estimating the differential entropy rate of a continuous valued stochastic process from observed data, by making an explicit link between the differential entropy rate and the Shannon entropy rate of a quantised version of the original data. Estimation is performed by a Shannon entropy rate estimator, and then converted to a differential entropy rate estimate. We show that this technique inherits many important statistical properties from the Shannon entropy rate estimator. The estimator is able to provide better estimates than the defined relative measures and much quicker estimates than known absolute measures, for strongly correlated processes. Finally, we analyse the complexity of the estimation technique and test the robustness to non-stationarity, and show that none of the current techniques are robust to non-stationarity, even if they are robust to strong correlations.
△ Less
Submitted 24 May, 2021;
originally announced May 2021.
-
Differential Entropy Rate Characterisations of Long Range Dependent Processes
Authors:
Andrew Feutrill,
Matthew Roughan
Abstract:
A quantity of interest to characterise continuous-valued stochastic processes is the differential entropy rate. The rate of convergence of many properties of LRD processes is slower than might be expected, based on the intuition for conventional processes, e.g. Markov processes. Is this also true of the entropy rate?
In this paper we consider the properties of the differential entropy rate of st…
▽ More
A quantity of interest to characterise continuous-valued stochastic processes is the differential entropy rate. The rate of convergence of many properties of LRD processes is slower than might be expected, based on the intuition for conventional processes, e.g. Markov processes. Is this also true of the entropy rate?
In this paper we consider the properties of the differential entropy rate of stochastic processes that have an autocorrelation function that decays as a power law. We show that power law decaying processes with similar autocorrelation and spectral density functions, Fractional Gaussian Noise and ARFIMA(0,d,0), have different entropic properties, particularly for negatively correlated parameterisations. Then we provide an equivalence between the mutual information between past and future and the differential excess entropy for stationary Gaussian processes, showing the finiteness of this quantity is the boundary between long and short range dependence. Finally, we analyse the convergence of the conditional entropy to the differential entropy rate and show that for short range dependence that the rate of convergence is of the order $O(n^{-1})$, but it is slower for long range dependent processes and depends on the Hurst parameter.
△ Less
Submitted 30 October, 2021; v1 submitted 10 February, 2021;
originally announced February 2021.
-
The Polylogarithm Function in Julia
Authors:
Matthew Roughan
Abstract:
The polylogarithm function is one of the constellation of important mathematical functions. It has a long history, and many connections to other special functions and series, and many applications, for instance in statistical physics. However, the practical aspects of its numerical evaluation have not received the type of comprehensive treatments lavished on its siblings. Only a handful of formal…
▽ More
The polylogarithm function is one of the constellation of important mathematical functions. It has a long history, and many connections to other special functions and series, and many applications, for instance in statistical physics. However, the practical aspects of its numerical evaluation have not received the type of comprehensive treatments lavished on its siblings. Only a handful of formal publications consider the evaluation of the function, and most focus on a specific domain and/or presume arbitrary precision arithmetic will be used. And very little of the literature contains any formal validation of numerical performance. In this paper we present an algorithm for calculating polylogarithms for both complex parameter and argument and evaluate it thoroughly in comparison to the arbitrary precision implementation in Mathematica. The implementation was created in a new scientific computing language Julia, which is ideal for the purpose, but also allows us to write the code in a simple, natural manner so as to make it easy to port the implementation to other such languages.
△ Less
Submitted 16 October, 2020;
originally announced October 2020.
-
Simulating Name-like Vectors for Testing Large-scale Entity Resolution
Authors:
Samudra Herath,
Matthew Roughan,
Gary Glonek
Abstract:
Accurate and efficient entity resolution (ER) has been a problem in data analysis and data mining projects for decades. In our work, we are interested in developing ER methods to handle big data. Good public datasets are restricted in this area and usually small in size. Simulation is one technique for generating datasets for testing. Existing simulation tools have problems of complexity, scalabil…
▽ More
Accurate and efficient entity resolution (ER) has been a problem in data analysis and data mining projects for decades. In our work, we are interested in developing ER methods to handle big data. Good public datasets are restricted in this area and usually small in size. Simulation is one technique for generating datasets for testing. Existing simulation tools have problems of complexity, scalability and limitations of resampling. We address these problems by introducing a better way of simulating testing data for big data ER. Our proposed simulation model is simple, inexpensive and fast. We focus on avoiding the detail-level simulation of records using a simple vector representation. In this paper, we will discuss how to simulate simple vectors that approximate the properties of names (commonly used as identification keys).
△ Less
Submitted 7 September, 2020;
originally announced September 2020.
-
Popularity and Centrality in Spotify Networks: Critical transitions in eigenvector centrality
Authors:
Tobin South,
Matthew Roughan,
Lewis Mitchell
Abstract:
The modern age of digital music access has increased the availability of data about music consumption and creation, facilitating the large-scale analysis of the complex networks that connect music together. Data about user streaming behaviour, and the musical collaboration networks are particularly important with new data-driven recommendation systems. Without thorough analysis, such collaboration…
▽ More
The modern age of digital music access has increased the availability of data about music consumption and creation, facilitating the large-scale analysis of the complex networks that connect music together. Data about user streaming behaviour, and the musical collaboration networks are particularly important with new data-driven recommendation systems. Without thorough analysis, such collaboration graphs can lead to false or misleading conclusions. Here we present a new collaboration network of artists from the online music streaming service Spotify, and demonstrate a critical change in the eigenvector centrality of artists, as low popularity artists are removed. The critical change in centrality, from classical artists to rap artists, demonstrates deeper structural properties of the network. A Social Group Centrality model is presented to simulate this critical transition behaviour, and switching between dominant eigenvectors is observed. This model presents a novel investigation of the effect of popularity bias on how centrality and importance are measured, and provides a new tool for examining such flaws in networks.
△ Less
Submitted 29 August, 2021; v1 submitted 26 August, 2020;
originally announced August 2020.
-
Counting Candy Crush Configurations
Authors:
Adam Hamilton,
Giang T. Nguyen,
Matthew Roughan
Abstract:
A k-stable c-coloured Candy Crush grid is a weak proper c-colouring of a particular type of k-uniform hypergraph. In this paper we introduce a fully polynomial randomised approximation scheme (FPRAS) which counts the number of k-stable c-coloured Candy Crush grids of a given size (m, n) for certain values of c and k. We implemented this algorithm on Matlab, and found that in a Candy Crush grid wit…
▽ More
A k-stable c-coloured Candy Crush grid is a weak proper c-colouring of a particular type of k-uniform hypergraph. In this paper we introduce a fully polynomial randomised approximation scheme (FPRAS) which counts the number of k-stable c-coloured Candy Crush grids of a given size (m, n) for certain values of c and k. We implemented this algorithm on Matlab, and found that in a Candy Crush grid with7 available colours there are approximately 4.3*10^61 3-stable colourings. (Note that, typical Candy Crush games are played with 6 colours and our FPRAS is not guaranteed to work in expected polynomial time with k= 3 and c= 6.) We also discuss the applicability of this FPRAS to the problem of counting the number of weak c-colourings of other, more general hypergraphs.
△ Less
Submitted 26 August, 2019;
originally announced August 2019.
-
Bayesian inference of network structure from information cascades
Authors:
Caitlin Gray,
Lewis Mitchell,
Matthew Roughan
Abstract:
Contagion processes are strongly linked to the network structures on which they propagate, and learning these structures is essential for understanding and intervention on complex network processes such as epidemics and (mis)information propagation. However, using contagion data to infer network structure is a challenging inverse problem. In particular, it is imperative to have appropriate measure…
▽ More
Contagion processes are strongly linked to the network structures on which they propagate, and learning these structures is essential for understanding and intervention on complex network processes such as epidemics and (mis)information propagation. However, using contagion data to infer network structure is a challenging inverse problem. In particular, it is imperative to have appropriate measures of uncertainty in network structure estimates, however these are largely ignored in most machine-learning approaches. We present a probabilistic framework that uses samples from the distribution of networks that are compatible with the dynamics observed to produce network and uncertainty estimates. We demonstrate the method using the well known independent cascade model to sample from the distribution of networks P(G) conditioned on the observation of a set of infections C. We evaluate the accuracy of the method by using the marginal probabilities of each edge in the distribution, and show the bene ts of quantifying uncertainty to improve estimates and understanding, particularly with small amounts of data.
△ Less
Submitted 9 August, 2019;
originally announced August 2019.
-
How the Avengers assemble: Ecological modelling of effective cast sizes for movies
Authors:
Matthew Roughan,
Lewis Mitchell,
Tobin South
Abstract:
The number of characters in a movie is an interesting feature. However, it is non-trivial to measure directly. Naive metrics such as the number of credited characters vary wildly. Here, we show that a metric based on the notion of "ecological diversity" as expressed through a Shannon-entropy based metric can characterise the number of characters in a movie, and is useful in taxonomic classificatio…
▽ More
The number of characters in a movie is an interesting feature. However, it is non-trivial to measure directly. Naive metrics such as the number of credited characters vary wildly. Here, we show that a metric based on the notion of "ecological diversity" as expressed through a Shannon-entropy based metric can characterise the number of characters in a movie, and is useful in taxonomic classification. We also show how the metric can be generalised using Jensen-Shannon divergence to provide a measure of the similarity of characters appearing in different movies, for instance of use in recommender systems, e.g., Netflix. We apply our measures to the Marvel Cinematic Universe (MCU), and show what they teach us about this highly successful franchise of movies. In particular, these measures provide a useful predictor of "success" for films in the MCU, as well as a natural means to understand the relationships between the stories in the overall film arc.
△ Less
Submitted 19 June, 2019;
originally announced June 2019.
-
ForestFirewalls: Getting Firewall Configuration Right in Critical Networks (Technical Report)
Authors:
Dinesha Ranathunga,
Matthew Roughan,
Paul Tune,
Phil Kernick,
Nick Falkner
Abstract:
Firewall configuration is critical, yet often conducted manually with inevitable errors, leaving networks vulnerable to cyber attack [40]. The impact of misconfigured firewalls can be catastrophic in Supervisory Control and Data Acquisition (SCADA) networks. These networks control the distributed assets of industrial systems such as power generation and water distribution systems. Automation can m…
▽ More
Firewall configuration is critical, yet often conducted manually with inevitable errors, leaving networks vulnerable to cyber attack [40]. The impact of misconfigured firewalls can be catastrophic in Supervisory Control and Data Acquisition (SCADA) networks. These networks control the distributed assets of industrial systems such as power generation and water distribution systems. Automation can make designing firewall configurations less tedious and their deployment more reliable. In this paper, we propose ForestFirewalls, a high-level approach to configuring SCADA firewalls. Our goals are three-fold. We aim to: first, decouple implementation details from security policy design by abstracting the former; second, simplify policy design; and third, provide automated checks, pre and post-deployment, to guarantee configuration accuracy. We achieve these goals by automating the implementation of a policy to a network and by auto-validating each stage of the configuration process. We test our approach on a real SCADA network to demonstrate its effectiveness.
△ Less
Submitted 15 February, 2019;
originally announced February 2019.
-
Verifying and Monitoring IoTs Network Behavior using MUD Profiles
Authors:
Ayyoob Hamza,
Dinesha Ranathunga,
Hassan Habibi Gharakheili,
Theophilus A. Benson,
Matthew Roughan,
Vijay Sivaraman
Abstract:
IoT devices are increasingly being implicated in cyber-attacks, raising community concern about the risks they pose to critical infrastructure, corporations, and citizens. In order to reduce this risk, the IETF is pushing IoT vendors to develop formal specifications of the intended purpose of their IoT devices, in the form of a Manufacturer Usage Description (MUD), so that their network behavior i…
▽ More
IoT devices are increasingly being implicated in cyber-attacks, raising community concern about the risks they pose to critical infrastructure, corporations, and citizens. In order to reduce this risk, the IETF is pushing IoT vendors to develop formal specifications of the intended purpose of their IoT devices, in the form of a Manufacturer Usage Description (MUD), so that their network behavior in any operating environment can be locked down and verified rigorously. This paper aims to assist IoT manufacturers in developing and verifying MUD profiles, while also helping adopters of these devices to ensure they are compatible with their organizational policies and track devices network behavior based on their MUD profile. Our first contribution is to develop a tool that takes the traffic trace of an arbitrary IoT device as input and automatically generates the MUD profile for it. We contribute our tool as open source, apply it to 28 consumer IoT devices, and highlight insights and challenges encountered in the process. Our second contribution is to apply a formal semantic framework that not only validates a given MUD profile for consistency, but also checks its compatibility with a given organizational policy. We apply our framework to representative organizations and selected devices, to demonstrate how MUD can reduce the effort needed for IoT acceptance testing. Finally, we show how operators can dynamically identify IoT devices using known MUD profiles and monitor their behavioral changes on their network.
△ Less
Submitted 7 February, 2019;
originally announced February 2019.
-
The one comparing narrative social network extraction techniques
Authors:
Michelle Edwards,
Lewis Mitchell,
Jonathan Tuke,
Matthew Roughan
Abstract:
Analysing narratives through their social networks is an expanding field in quantitative literary studies. Manually extracting a social network from any narrative can be time consuming, so automatic extraction methods of varying complexity have been developed. However, the effect of different extraction methods on the analysis is unknown. Here we model and compare three extraction methods for soci…
▽ More
Analysing narratives through their social networks is an expanding field in quantitative literary studies. Manually extracting a social network from any narrative can be time consuming, so automatic extraction methods of varying complexity have been developed. However, the effect of different extraction methods on the analysis is unknown. Here we model and compare three extraction methods for social networks in narratives: manual extraction, co-occurrence automated extraction and automated extraction using machine learning. Although the manual extraction method produces more precise results in the network analysis, it is much more time consuming and the automatic extraction methods yield comparable conclusions for density, centrality measures and edge weights. Our results provide evidence that social networks extracted automatically are reliable for many analyses. We also describe which aspects of analysis are not reliable with such a social network. We anticipate that our findings will make it easier to analyse more narratives, which help us improve our understanding of how stories are written and evolve, and how people interact with each other.
△ Less
Submitted 4 November, 2018;
originally announced November 2018.
-
Generating Connected Random Graphs
Authors:
Caitlin Gray,
Lewis Mitchell,
Matthew Roughan
Abstract:
Sampling random graphs is essential in many applications, and often algorithms use Markov chain Monte Carlo methods to sample uniformly from the space of graphs. However, often there is a need to sample graphs with some property that we are unable, or it is too inefficient, to sample using standard approaches. In this paper, we are interested in sampling graphs from a conditional ensemble of the u…
▽ More
Sampling random graphs is essential in many applications, and often algorithms use Markov chain Monte Carlo methods to sample uniformly from the space of graphs. However, often there is a need to sample graphs with some property that we are unable, or it is too inefficient, to sample using standard approaches. In this paper, we are interested in sampling graphs from a conditional ensemble of the underlying graph model. We present an algorithm to generate samples from an ensemble of connected random graphs using a Metropolis-Hastings framework. The algorithm extends to a general framework for sampling from a known distribution of graphs, conditioned on a desired property. We demonstrate the method to generate connected spatially embedded random graphs, specifically the well known Waxman network, and illustrate the convergence and practicalities of the algorithm.
△ Less
Submitted 25 October, 2018; v1 submitted 29 June, 2018;
originally announced June 2018.
-
Clear as MUD: Generating, Validating and Applying IoT Behaviorial Profiles (Technical Report)
Authors:
Ayyoob Hamza,
Dinesha Ranathunga,
H. Habibi Gharakheili,
Matthew Roughan,
Vijay Sivaraman
Abstract:
IoT devices are increasingly being implicated in cyber-attacks, driving community concern about the risks they pose to critical infrastructure, corporations, and citizens. In order to reduce this risk, the IETF is pushing IoT vendors to develop formal specifications of the intended purpose of their IoT devices, in the form of a Manufacturer Usage Description (MUD), so that their network behavior i…
▽ More
IoT devices are increasingly being implicated in cyber-attacks, driving community concern about the risks they pose to critical infrastructure, corporations, and citizens. In order to reduce this risk, the IETF is pushing IoT vendors to develop formal specifications of the intended purpose of their IoT devices, in the form of a Manufacturer Usage Description (MUD), so that their network behavior in any operating environment can be locked down and verified rigorously.
This paper aims to assist IoT manufacturers in developing and verifying MUD profiles, while also helping adopters of these devices to ensure they are compatible with their organizational policies. Our first contribution is to develop a tool that takes the traffic trace of an arbitrary IoT device as input and automatically generates a MUD profile for it. We contribute our tool as open source, apply it to 28 consumer IoT devices, and highlight insights and challenges encountered in the process. Our second contribution is to apply a formal semantic framework that not only validates a given MUD profile for consistency, but also checks its compatibility with a given organizational policy. Finally, we apply our framework to representative organizations and selected devices, to demonstrate how MUD can reduce the effort needed for IoT acceptance testing.
△ Less
Submitted 12 April, 2018;
originally announced April 2018.
-
Super-blockers and the effect of network structure on information cascades
Authors:
Caitlin Gray,
Lewis Mitchell,
Matthew Roughan
Abstract:
Modelling information cascades over online social networks is important in fields from marketing to civil unrest prediction, however the underlying network structure strongly affects the probability and nature of such cascades. Even with simple cascade dynamics the probability of large cascades are almost entirely dictated by network properties, with well-known networks such as Erdos-Renyi and Bar…
▽ More
Modelling information cascades over online social networks is important in fields from marketing to civil unrest prediction, however the underlying network structure strongly affects the probability and nature of such cascades. Even with simple cascade dynamics the probability of large cascades are almost entirely dictated by network properties, with well-known networks such as Erdos-Renyi and Barabasi-Albert producing wildly different cascades from the same model. Indeed, the notion of 'superspreaders' has arisen to describe highly influential nodes promoting global cascades in a social network. Here we use a simple model of global cascades to show that the presence of locality in the network increases the probability of a global cascade due to the increased vulnerability of connecting nodes. Rather than 'super-spreaders', we find that the presence of these highly connected 'super-blockers' in heavy-tailed networks in fact reduces the probability of global cascades, while promoting information spread when targeted as the initial spreader.
△ Less
Submitted 21 March, 2018; v1 submitted 14 February, 2018;
originally announced February 2018.
-
Rigorous statistical analysis of HTTPS reachability
Authors:
George Michaelson,
Matthew Roughan,
Jonathan Tuke,
Matt P. Wand,
Randy Bush
Abstract:
The use of secure connections using HTTPS as the default means, or even the only means, to connect to web servers is increasing. It is being pushed from both sides: from the bottom up by client distributions and plugins, and from the top down by organisations such as Google. However, there are potential technical hurdles that might lock some clients out of the modern web. This paper seeks to measu…
▽ More
The use of secure connections using HTTPS as the default means, or even the only means, to connect to web servers is increasing. It is being pushed from both sides: from the bottom up by client distributions and plugins, and from the top down by organisations such as Google. However, there are potential technical hurdles that might lock some clients out of the modern web. This paper seeks to measure and precisely quantify those hurdles in the wild. More than three million measurements provide statistically significant evidence of degradation. We show this through a variety of statistical techniques. Various factors are shown to influence the problem, ranging from the client's browser, to the locale from which they connect.
△ Less
Submitted 8 June, 2017;
originally announced June 2017.
-
The Mathematical Foundations for Mapping Policies to Network Devices (Technical Report)
Authors:
Dinesha Ranathunga,
Matthew Roughan,
Phil Kernick,
Nick Falkner
Abstract:
A common requirement in policy specification languages is the ability to map policies to the underlying network devices. Doing so, in a provably correct way, is important in a security policy context, so administrators can be confident of the level of protection provided by the policies for their networks. Existing policy languages allow policy composition but lack formal semantics to allocate pol…
▽ More
A common requirement in policy specification languages is the ability to map policies to the underlying network devices. Doing so, in a provably correct way, is important in a security policy context, so administrators can be confident of the level of protection provided by the policies for their networks. Existing policy languages allow policy composition but lack formal semantics to allocate policy to network devices.
Our research tackles this from first principles: we ask how network policies can be described at a high-level, independent of firewall-vendor and network minutiae. We identify the algebraic requirements of the policy mapping process and propose semantic foundations to formally verify if a policy is implemented by the correct set of policy-arbiters. We show the value of our proposed algebras in maintaining concise network-device configurations by applying them to real-world networks.
△ Less
Submitted 30 May, 2016;
originally announced May 2016.
-
Fast Generation of Spatially Embedded Random Networks
Authors:
Eric Parsonage,
Matthew Roughan
Abstract:
Spatially Embedded Random Networks such as the Waxman random graph have been used in a variety of settings for synthesizing networks. However, little thought has been put into fast generation of these networks. Existing techniques are $O(n^2)$ where $n$ is the number of nodes in the graph. In this paper we present an $O(n + e)$ algorithm, where $e$ is the number of edges.
Spatially Embedded Random Networks such as the Waxman random graph have been used in a variety of settings for synthesizing networks. However, little thought has been put into fast generation of these networks. Existing techniques are $O(n^2)$ where $n$ is the number of nodes in the graph. In this paper we present an $O(n + e)$ algorithm, where $e$ is the number of edges.
△ Less
Submitted 11 December, 2015;
originally announced December 2015.
-
All networks look the same to me: Testing for homogeneity in networks
Authors:
Jonathan Tuke,
Matthew Roughan
Abstract:
How can researchers test for heterogeneity in the local structure of a network? In this paper, we present a framework that utilizes random sampling to give subgraphs which are then used in a goodness of fit test to test for heterogeneity. We illustrate how to use the goodness of fit test for an analytically derived distribution as well as an empirical distribution. To demonstrate our framework, we…
▽ More
How can researchers test for heterogeneity in the local structure of a network? In this paper, we present a framework that utilizes random sampling to give subgraphs which are then used in a goodness of fit test to test for heterogeneity. We illustrate how to use the goodness of fit test for an analytically derived distribution as well as an empirical distribution. To demonstrate our framework, we consider the simple case of testing for edge probability heterogeneity. We examine the significance level, power and computation time for this case with appropriate examples. Finally we outline how to apply our framework to other heterogeneity problems.
△ Less
Submitted 2 December, 2015;
originally announced December 2015.
-
Unravelling Graph-Exchange File Formats
Authors:
Matthew Roughan,
Jonathan Tuke
Abstract:
A graph is used to represent data in which the relationships between the objects in the data are at least as important as the objects themselves. Over the last two decades nearly a hundred file formats have been proposed or used to provide portable access to such data. This paper seeks to review these formats, and provide some insight to both reduce the ongoing creation of unnecessary formats, and…
▽ More
A graph is used to represent data in which the relationships between the objects in the data are at least as important as the objects themselves. Over the last two decades nearly a hundred file formats have been proposed or used to provide portable access to such data. This paper seeks to review these formats, and provide some insight to both reduce the ongoing creation of unnecessary formats, and guide the development of new formats where needed.
△ Less
Submitted 10 March, 2015;
originally announced March 2015.
-
Hidden Markov Model Identifiability via Tensors
Authors:
Paul Tune,
Hung X. Nguyen,
Matthew Roughan
Abstract:
The prevalence of hidden Markov models (HMMs) in various applications of statistical signal processing and communications is a testament to the power and flexibility of the model. In this paper, we link the identifiability problem with tensor decomposition, in particular, the Canonical Polyadic decomposition. Using recent results in deriving uniqueness conditions for tensor decomposition, we are a…
▽ More
The prevalence of hidden Markov models (HMMs) in various applications of statistical signal processing and communications is a testament to the power and flexibility of the model. In this paper, we link the identifiability problem with tensor decomposition, in particular, the Canonical Polyadic decomposition. Using recent results in deriving uniqueness conditions for tensor decomposition, we are able to provide a necessary and sufficient condition for the identification of the parameters of discrete time finite alphabet HMMs. This result resolves a long standing open problem regarding the derivation of a necessary and sufficient condition for uniquely identifying an HMM. We then further extend recent preliminary work on the identification of HMMs with multiple observers by deriving necessary and sufficient conditions for identifiability in this setting.
△ Less
Submitted 1 May, 2013;
originally announced May 2013.