-
Compression and Symmetry of Small-World Graphs and Structures
Authors:
Ioannis Kontoyiannis,
Yi Heng Lim,
Katia Papakonstantinopoulou,
Wojtek Szpankowski
Abstract:
For various purposes and, in particular, in the context of data compression, a graph can be examined at three levels. Its structure can be described as the unlabeled version of the graph; then the labeling of its structure can be added; and finally, given then structure and labeling, the contents of the labels can be described. Determining the amount of information present at each level and quanti…
▽ More
For various purposes and, in particular, in the context of data compression, a graph can be examined at three levels. Its structure can be described as the unlabeled version of the graph; then the labeling of its structure can be added; and finally, given then structure and labeling, the contents of the labels can be described. Determining the amount of information present at each level and quantifying the degree of dependence between them, requires the study of symmetry, graph automorphism, entropy, and graph compressibility. In this paper, we focus on a class of small-world graphs. These are geometric random graphs where vertices are first connected to their nearest neighbors on a circle and then pairs of non-neighbors are connected according to a distance-dependent probability distribution. We establish the degree distribution of this model, and use it to prove the model's asymmetry in an appropriate range of parameters. Then we derive the relevant entropy and structural entropy of these random graphs, in connection with graph compression.
△ Less
Submitted 22 November, 2021; v1 submitted 31 July, 2020;
originally announced July 2020.
-
Hidden Words Statistics for Large Patterns
Authors:
Svante Janson,
Wojciech Szpankowski
Abstract:
We study here the so called subsequence pattern matching also known as hidden pattern matching in which one searches for a given pattern $w$ of length $m$ as a subsequence in a random text of length $n$. The quantity of interest is the number of occurrences of $w$ as a subsequence (i.e., occurring in not necessarily consecutive text locations). This problem finds many applications from intrusion d…
▽ More
We study here the so called subsequence pattern matching also known as hidden pattern matching in which one searches for a given pattern $w$ of length $m$ as a subsequence in a random text of length $n$. The quantity of interest is the number of occurrences of $w$ as a subsequence (i.e., occurring in not necessarily consecutive text locations). This problem finds many applications from intrusion detection, to trace reconstruction, to deletion channel, and to DNA-based storage systems. In all of these applications, the pattern $w$ is of variable length. To the best of our knowledge this problem was only tackled for a fixed length $m=O(1)$ [Flajolet, Szpankowski and Vallée, 2006]. In our main result we prove that for $m=o(n^{1/3})$ the number of subsequence occurrences is normally distributed. In addition, we show that under some constraints on the structure of $w$ the asymptotic normality can be extended to $m=o(\sqrt{n})$. For a special pattern $w$ consisting of the same symbol, we indicate that for $m=o(n)$ the distribution of number of subsequences is either asymptotically normal or asymptotically log normal. We conjecture that this dichotomy is true for all patterns. We use Hoeffding's projection method for $U$-statistics to prove our findings.
△ Less
Submitted 21 March, 2020;
originally announced March 2020.
-
Toward Universal Testing of Dynamic Network Models
Authors:
Abram Magner,
Wojciech Szpankowski
Abstract:
Numerous networks in the real world change over time, in the sense that nodes and edges enter and leave the networks. Various dynamic random graph models have been proposed to explain the macroscopic properties of these systems and to provide a foundation for statistical inferences and predictions. It is of interest to have a rigorous way to determine how well these models match observed networks.…
▽ More
Numerous networks in the real world change over time, in the sense that nodes and edges enter and leave the networks. Various dynamic random graph models have been proposed to explain the macroscopic properties of these systems and to provide a foundation for statistical inferences and predictions. It is of interest to have a rigorous way to determine how well these models match observed networks. We thus ask the following goodness of fit question: given a sequence of observations/snapshots of a growing random graph, along with a candidate model M, can we determine whether the snapshots came from M or from some arbitrary alternative model that is well-separated from M in some natural metric? We formulate this problem precisely and boil it down to goodness of fit testing for graph-valued, infinite-state Markov processes and exhibit and analyze a universal test based on non-stationary sampling for a natural class of models.
△ Less
Submitted 13 February, 2020; v1 submitted 5 April, 2019;
originally announced April 2019.
-
Asymmetric Rényi Problem
Authors:
Michael Drmota,
Abram Magner,
Wojciech Szpankowski
Abstract:
In 1960 Rényi in his Michigan State University lectures asked for the number of random queries necessary to recover a hidden bijective labeling of $n$ distinct objects. In each query one selects a random subset of labels and asks, which objects have these labels? We consider here an asymmetric version of the problem in which in every query an object is chosen with probability $p > 1/2$ and we igno…
▽ More
In 1960 Rényi in his Michigan State University lectures asked for the number of random queries necessary to recover a hidden bijective labeling of $n$ distinct objects. In each query one selects a random subset of labels and asks, which objects have these labels? We consider here an asymmetric version of the problem in which in every query an object is chosen with probability $p > 1/2$ and we ignore "inconclusive" queries. We study the number of queries needed to recover the labeling in its entirety ($H_n$), before at least one element is recovered ($F_n$), and to recover a randomly chosen element $(D_n)$. This problem exhibits several remarkable behaviors: $D_n$ converges in probability but not almost surely, $H_n$ and $F_n$ exhibit phase transitions with respect to $p$ in the second term. We prove that for $p>1/2$ with high probability (whp) we need $H_n=\log_{1/p} n +\frac 12 \log_{p/(1-p)}\log n +o(\log \log n) $ queries to recover the entire bijection. This should be compared to its symmetric ($p=1/2$) counterpart established by Pittel and Rubin, who proved that in this case one requires $H_n=\log_{2} n +\sqrt{2 \log_{2} n} +o(\sqrt{\log n}) $ queries. As a bonus, our analysis implies novel results for random PATRICIA tries, as the problem is probabilistically equivalent to that of the height, fillup level, and typical depth of a PATRICIA trie built from $n$ independent binary sequences generated by a biased($p$) memoryless source.
△ Less
Submitted 5 November, 2017;
originally announced November 2017.
-
Asymmetry and structural information in preferential attachment graphs
Authors:
Tomasz Luczak,
Abram Magner,
Wojciech Szpankowski
Abstract:
Graph symmetries intervene in diverse applications, from enumeration, to graph structure compression, to the discovery of graph dynamics (e.g., node arrival order inference). Whereas Erdős-Rényi graphs are typically asymmetric, real networks are highly symmetric. So a natural question is whether preferential attachment graphs, where in each step a new node with $m$ edges is added, exhibit any symm…
▽ More
Graph symmetries intervene in diverse applications, from enumeration, to graph structure compression, to the discovery of graph dynamics (e.g., node arrival order inference). Whereas Erdős-Rényi graphs are typically asymmetric, real networks are highly symmetric. So a natural question is whether preferential attachment graphs, where in each step a new node with $m$ edges is added, exhibit any symmetry. In recent work it was proved that preferential attachment graphs are symmetric for $m=1$, and there is some non-negligible probability of symmetry for $m=2$. It was conjectured that these graphs are asymmetric when $m \geq 3$. We settle this conjecture in the affirmative, then use it to estimate the structural entropy of the model. To do this, we also give bounds on the number of ways that the given graph structure could have arisen by preferential attachment. These results have further implications for information theoretic problems of interest on preferential attachment graphs.
△ Less
Submitted 23 December, 2018; v1 submitted 14 July, 2016;
originally announced July 2016.
-
Asymmetric Rényi Problem and PATRICIA Tries
Authors:
Michael Drmota,
Abram Magner,
Wojciech Szpankowski
Abstract:
In 1960, Rényi asked for the number of random queries necessary to recover a hidden bijective labeling of n distinct objects. In each query one selects a random subset of labels and asks, what is the set of objects that have these labels? We consider here an asymmetric version of the problem in which in every query an object is chosen with probability p > 1/2 and we ignore "inconclusive" queries.…
▽ More
In 1960, Rényi asked for the number of random queries necessary to recover a hidden bijective labeling of n distinct objects. In each query one selects a random subset of labels and asks, what is the set of objects that have these labels? We consider here an asymmetric version of the problem in which in every query an object is chosen with probability p > 1/2 and we ignore "inconclusive" queries. We study the number of queries needed to recover the labeling in its entirety (the height), to recover at least one single element (the fillup level), and to recover a randomly chosen element (the typical depth). This problem exhibits several remarkable behaviors: the depth D_n converges in probability but not almost surely and while it satisfies the central limit theorem its local limit theorem doesn't hold; the height H_n and the fillup level F_n exhibit phase transitions with respect to p in the second term. To obtain these results, we take a unified approach via the analysis of the external profile, defined at level k as the number of elements recovered by the kth query. We first establish new precise asymptotic results for the average and variance, and a central limit law, for the external profile in the regime where it grows polynomially with n. We then extend the external profile results to the boundaries of the central region, leading to the solution of our problem for the height and fillup level. As a bonus, our analysis implies novel results for analogous parameters of random PATRICIA tries.
△ Less
Submitted 6 May, 2016;
originally announced May 2016.
-
A Limit Theorem for Radix Sort and Tries with Markovian Input
Authors:
Kevin Leckey,
Ralph Neininger,
Wojciech Szpankowski
Abstract:
Tries are among the most versatile and widely used data structures on words. In particular, they are used in fundamental sorting algorithms such as radix sort which we study in this paper. While the performance of radix sort and tries under a realistic probabilistic model for the generation of words is of significant importance, its analysis, even for simplest memoryless sources, has proved diffic…
▽ More
Tries are among the most versatile and widely used data structures on words. In particular, they are used in fundamental sorting algorithms such as radix sort which we study in this paper. While the performance of radix sort and tries under a realistic probabilistic model for the generation of words is of significant importance, its analysis, even for simplest memoryless sources, has proved difficult. In this paper we consider a more realistic model where words are generated by a Markov source. By a novel use of the contraction method combined with moment transfer techniques we prove a central limit theorem for the complexity of radix sort and for the external path length in a trie. This is the first application of the contraction method to the analysis of algorithms and data structures with Markovian inputs; it relies on the use of systems of stochastic recurrences combined with a product version of the Zolotarev metric.
△ Less
Submitted 27 May, 2015;
originally announced May 2015.
-
Towards More Realistic Probabilistic Models for Data Structures: The External Path Length in Tries under the Markov Model
Authors:
Kevin Leckey,
Ralph Neininger,
Wojciech Szpankowski
Abstract:
Tries are among the most versatile and widely used data structures on words. They are pertinent to the (internal) structure of (stored) words and several splitting procedures used in diverse contexts ranging from document taxonomy to IP addresses lookup, from data compression (i.e., Lempel-Ziv'77 scheme) to dynamic hashing, from partial-match queries to speech recognition, from leader election alg…
▽ More
Tries are among the most versatile and widely used data structures on words. They are pertinent to the (internal) structure of (stored) words and several splitting procedures used in diverse contexts ranging from document taxonomy to IP addresses lookup, from data compression (i.e., Lempel-Ziv'77 scheme) to dynamic hashing, from partial-match queries to speech recognition, from leader election algorithms to distributed hashing tables and graph compression. While the performance of tries under a realistic probabilistic model is of significant importance, its analysis, even for simplest memoryless sources, has proved difficult. Rigorous findings about inherently complex parameters were rarely analyzed (with a few notable exceptions) under more realistic models of string generations. In this paper we meet these challenges: By a novel use of the contraction method combined with analytic techniques we prove a central limit theorem for the external path length of a trie under a general Markov source. In particular, our results apply to the Lempel-Ziv'77 code. We envision that the methods described here will have further applications to other trie parameters and data structures.
△ Less
Submitted 18 September, 2012; v1 submitted 2 July, 2012;
originally announced July 2012.
-
Partial fillup and search time in LC tries
Authors:
Svante Janson,
Wojciech Szpankowski
Abstract:
Andersson and Nilsson introduced in 1993 a level-compressed trie (in short: LC trie) in which a full subtree of a node is compressed to a single node of degree being the size of the subtree. Recent experimental results indicated a 'dramatic improvement' when full subtrees are replaced by partially filled subtrees. In this paper, we provide a theoretical justification of these experimental result…
▽ More
Andersson and Nilsson introduced in 1993 a level-compressed trie (in short: LC trie) in which a full subtree of a node is compressed to a single node of degree being the size of the subtree. Recent experimental results indicated a 'dramatic improvement' when full subtrees are replaced by partially filled subtrees. In this paper, we provide a theoretical justification of these experimental results showing, among others, a rather moderate improvement of the search time over the original LC tries. For such an analysis, we assume that n strings are generated independently by a binary memoryless source with p denoting the probability of emitting a 1. We first prove that the so called alpha-fillup level (i.e., the largest level in a trie with alpha fraction of nodes present at this level) is concentrated on two values with high probability. We give these values explicitly up to O(1), and observe that the value of alpha (strictly between 0 and 1) does not affect the leading term.
This result directly yields the typical depth (search time) in the alpha-LC tries with p not equal to 1/2, which turns out to be C loglog n for an explicitly given constant C (depending on p but not on alpha). This should be compared with recently found typical depth in the original LC tries which is C' loglog n for a larger constant C'. The search time in alpha-LC tries is thus smaller but of the same order as in the original LC tries.
△ Less
Submitted 6 October, 2005;
originally announced October 2005.