-
The cluster structure function
Abstract: For each partition of a data set into a given number of parts there is a partition such that every part is as much as possible a good model (an "algorithmic sufficient statistic") for the data in that part. Since this can be done for every number between one and the number of data, the result is a function, the cluster structure function. It maps the number of parts of a partition to values relate… ▽ More
Submitted 14 October, 2022; v1 submitted 4 January, 2022; originally announced January 2022.
-
arXiv:1908.10805 [pdf, ps, other]
Logical depth for reversible Turing machines with an application to the rate of decrease in logical depth for general Turing machines
Abstract: The logical depth of a {\em reversible} Turing machine equals the shortest running time of a shortest program for it. This is applied to show that the result in L.F. Antunes, A. Souto, and P.M.B. Vitányi, On the Rate of Decrease in Logical Depth, Theor. Comput. Sci., 702(2017), 60--64 is valid notwithstanding the error noted in Corrigendum P.M.B. Vitányi, Corrigendum to "On the rate of decrease in… ▽ More
Submitted 28 August, 2019; originally announced August 2019.
Comments: Latex 4 pages
Journal ref: Theor. Comput. Sci., 778(2019), 78-80
-
arXiv:1708.01611 [pdf, ps, other]
Identification of Probabilities
Abstract: Within psychology, neuroscience and artificial intelligence, there has been increasing interest in the proposal that the brain builds probabilistic models of sensory and linguistic input: that is, to infer a probabilistic model from a sample. The practical problems of such inference are substantial: the brain has limited data and restricted computational resources. But there is a more fundamental… ▽ More
Submitted 4 August, 2017; originally announced August 2017.
Comments: 31 pages LaTeX. arXiv admin note: substantial text overlap with arXiv:1311.7385
Journal ref: Journal of Mathematical Psychology 51, 135-163 (2007)
-
Web Similarity in Sets of Search Terms using Database Queries
Abstract: Normalized web distance (NWD) is a similarity or normalized semantic distance based on the World Wide Web or another large electronic database, for instance Wikipedia, and a search engine that returns reliable aggregate page counts. For sets of search terms the NWD gives a common similarity (common semantics) on a scale from 0 (identical) to 1 (completely different). The NWD approximates the simil… ▽ More
Submitted 23 July, 2020; v1 submitted 20 February, 2015; originally announced February 2015.
Comments: LaTeX 18 pages, 3 tables. A precursor is arXiv:1308.3177
Journal ref: SN COMPUT. SCI. 1, 161(2020)
-
arXiv:1501.06461 [pdf, ps, other]
On The Average-Case Complexity of Shellsort
Abstract: We prove a lower bound expressed in the increment sequence on the average-case complexity of the number of inversions of Shellsort. This lower bound is sharp in every case where it could be checked. A special case of this lower bound yields the general Jiang-Li-Vitányi lower bound. We obtain new results e.g. determining the average-case complexity precisely in the Yao-Janson-Knuth 3-pass case.
Submitted 8 February, 2017; v1 submitted 26 January, 2015; originally announced January 2015.
Comments: 13 pages LaTeX
Journal ref: Random Structures and Algorithms, 52:2(2018), 354-363
-
arXiv:1410.7328 [pdf, ps, other]
Exact Expression For Information Distance
Abstract: Information distance can be defined not only between two strings but also in a finite multiset of strings of cardinality greater than two. We give an elementary proof for expressing the information distance in terms of plain Kolmogorov complexity. It is exact since for each cardinality of the multiset the lower bound for some multiset equals the upper bound for all multisets up to a constant addit… ▽ More
Submitted 11 July, 2017; v1 submitted 27 October, 2014; originally announced October 2014.
Comments: 6 pages LaTeX. added material and corrected it
Journal ref: IEEE Trans. Inform. Theory, 63:8(2017), 4725-4728
-
arXiv:1409.4276 [pdf, ps, other]
A Fast Quartet Tree Heuristic for Hierarchical Clustering
Abstract: The Minimum Quartet Tree Cost problem is to construct an optimal weight tree from the $3{n \choose 4}$ weighted quartet topologies on $n$ objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal topologies). We present a Monte Carlo heuristic, based on randomized hill climbing… ▽ More
Submitted 12 September, 2014; originally announced September 2014.
Comments: LaTeX, 40 pages, 11 figures; this paper has substantial overlap with arXiv:cs/0606048 in cs.DS
Journal ref: Pattern Recognition, 44 (2011) 662-677
-
arXiv:1311.7385 [pdf, ps, other]
Algorithmic Identification of Probabilities
Abstract: TThe problem is to identify a probability associated with a set of natural numbers, given an infinite data sequence of elements from the set. If the given sequence is drawn i.i.d. and the probability mass function involved (the target) belongs to a computably enumerable (c.e.) or co-computably enumerable (co-c.e.) set of computable probability mass functions, then there is an algorithm to almost s… ▽ More
Submitted 11 July, 2014; v1 submitted 28 November, 2013; originally announced November 2013.
Comments: 19 pages LaTeX.Corrected errors and rewrote the entire paper. arXiv admin note: text overlap with arXiv:1208.5003
-
arXiv:1310.6976 [pdf, ps, other]
On Logical Depth and the Running Time of Shortest Programs
Abstract: The logical depth with significance $b$ of a finite binary string $x$ is the shortest running time of a binary program for $x$ that can be compressed by at most $b$ bits. There is another definition of logical depth. We give two theorems about the quantitative relation between these versions: the first theorem concerns a variation of a known fact with a new proof, the second theorem and its proof… ▽ More
Submitted 25 October, 2013; originally announced October 2013.
Comments: 12 pages LaTex (this supercedes arXiv:1301.4451)
-
Normalized Google Distance of Multisets with Applications
Abstract: Normalized Google distance (NGD) is a relative semantic distance based on the World Wide Web (or any other large electronic database, for instance Wikipedia) and a search engine that returns aggregate page counts. The earlier NGD between pairs of search terms (including phrases) is not sufficient for all applications. We propose an NGD of finite multisets of search terms that is better for many ap… ▽ More
Submitted 14 August, 2013; originally announced August 2013.
Comments: 25 pages, LaTeX, 3 figures/tables
-
arXiv:1301.4451 [pdf, ps, other]
On the logical depth function
Abstract: For a finite binary string $x$ its logical depth $d$ for significance $b$ is the shortest running time of a program for $x$ of length $K(x)+b$. There is another definition of logical depth. We give a new proof that the two versions are close. There is an infinite sequence of strings of consecutive lengths such that for every string there is a $b$ such that incrementing $b$ by 1 makes the associate… ▽ More
Submitted 5 July, 2013; v1 submitted 18 January, 2013; originally announced January 2013.
Comments: 11 pages LaTeX; previous version was incorrect, this is a new version with almost the same results
-
Language learning from positive evidence, reconsidered: A simplicity-based approach
Abstract: Children learn their native language by exposure to their linguistic and communicative environment, but apparently without requiring that their mistakes are corrected. Such learning from positive evidence has been viewed as raising logical problems for language acquisition. In particular, without correction, how is the child to recover from conjecturing an over-general grammar, which will be consi… ▽ More
Submitted 18 January, 2013; originally announced January 2013.
Comments: 39 pages, pdf, 1 figure
Journal ref: A.S. Hsu, N. Chater, P.M.B. Vitanyi, Language learning from positive evidence, reconsidered: A simplicity-based approach. Topics in Cognitive Science, 5:1(2013), 35-55
-
Normalized Compression Distance of Multisets with Applications
Abstract: Normalized compression distance (NCD) is a parameter-free, feature-free, alignment-free, similarity measure between a pair of finite objects based on compression. However, it is not sufficient for all applications. We propose an NCD of finite multisets (a.k.a. multiples) of finite objects that is also a metric. Previously, attempts to obtain such an NCD failed. We cover the entire trajectory from… ▽ More
Submitted 29 March, 2013; v1 submitted 22 December, 2012; originally announced December 2012.
Comments: LaTeX 28 pages, 3 figures. This version is changed from the preliminary version to the final version. Updates of the theory. How to compute it, special recepies for classification, more applications and better results (see abstract and especially the detailed results in the paper). The title was changed to reflect this. In v4 corrected the proof of Theorem III-7
ACM Class: I.5.3; H.3.3; E.4; J.3
Journal ref: IEEE Trans. Pattern Analysis and Machine Intelligence, 37:8(2015), 1602-1614
-
Identification of Probabilities of Languages
Abstract: We consider the problem of inferring the probability distribution associated with a language, given data consisting of an infinite sequence of elements of the languge. We do this under two assumptions on the algorithms concerned: (i) like a real-life algorothm it has round-off errors, and (ii) it has no round-off errors. Assuming (i) we (a) consider a probability mass function of the elements of t… ▽ More
Submitted 15 July, 2014; v1 submitted 24 August, 2012; originally announced August 2012.
Comments: 23 pages LaTeX, no pictures 1311.7385 This paper has been withdrawn by the auther due to crucial errors. The same subject is attacked more succesfully with reduced claims in ArXiV 1311.7385
MSC Class: 68
-
arXiv:1206.0983 [pdf, ps, other]
Conditional Kolmogorov Complexity and Universal Probability
Abstract: The Coding Theorem of L.A. Levin connects unconditional prefix Kolmogorov complexity with the discrete universal distribution. There are conditional versions referred to in several publications but as yet there exist no written proofs in English. Here we provide those proofs. They use a different definition than the standard one for the conditional version of the discrete universal distribution. U… ▽ More
Submitted 22 January, 2013; v1 submitted 5 June, 2012; originally announced June 2012.
Comments: 17 pages (LaTeX); Corrected previous version. arXiv admin note: text overlap with arXiv:cs/0204037
MSC Class: 68Q30; 03D32
-
arXiv:1201.1223 [pdf, ps, other]
Turing Machines and Understanding Computational Complexity
Abstract: We describe the Turing Machine, list some of its many influences on the theory of computation and complexity of computations, and illustrate its importance.
Submitted 5 January, 2012; originally announced January 2012.
Comments: 9 pages, 1 figure, LaTeX. To appear in: Alan Turing - His Work and Impact, Elsevier
Journal ref: In: S. Barry Cooper, Jan van Leeuwen (eds.), "Alan Turing: His Work and Impact", Elsevier, Amsterdam, London, New York, Tokyo, 2013, pp.57-63
-
arXiv:1201.1221 [pdf, ps, other]
Information Distance: New Developments
Abstract: In pattern recognition, learning, and data mining one obtains information from information-carrying objects. This involves an objective definition of the information in a single object, the information to go from one object to another object in a pair of objects, the information to go from one object to any other object in a multiple of objects, and the shared information between objects. This is… ▽ More
Submitted 5 January, 2012; originally announced January 2012.
Comments: 4 pages, Latex; Series of Publications C, Report C-2011-45, Department of Computer Science, University of Helsinki, pp. 71-74
Journal ref: Proc. 4th Workshop on Information Theoretic Methods in Science and Engineering (WITSME 2011), 2011, pp. 71-74
-
arXiv:1110.4544 [pdf, ps, other]
Compression-based Similarity
Abstract: First we consider pair-wise distances for literal objects consisting of finite binary files. These files are taken to contain all of their meaning, like genomes or books. The distances are based on compression of the objects concerned, normalized, and can be viewed as similarity distances. Second, we consider pair-wise distances between names of objects, like "red" or "christianity." In this case… ▽ More
Submitted 20 October, 2011; originally announced October 2011.
Comments: Latex, 8 pages, 2 fgures, in Proc. IEEE 1st Int. Conf. Data Compression, Communication and Processing, Palurno, Italy, June 21-24, 2011, 111--118
-
arXiv:1103.5985 [pdf, ps, other]
On Empirical Entropy
Abstract: We propose a compression-based version of the empirical entropy of a finite string over a finite alphabet. Whereas previously one considers the naked entropy of (possibly higher order) Markov processes, we consider the sum of the description of the random variable involved plus the entropy it induces. We assume only that the distribution involved is computable. To test the new notion we compare th… ▽ More
Submitted 30 March, 2011; originally announced March 2011.
Comments: 14 pages, LaTeX
MSC Class: 68; 94 ACM Class: H.1; F.1; J.1
-
arXiv:1006.3520 [pdf, ps, other]
Information Distance
Abstract: While Kolmogorov complexity is the accepted absolute measure of information content in an individual finite object, a similarly absolute notion is needed for the information distance between two individual objects, for example, two pictures. We give several natural definitions of a universal information metric, based on length of shortest programs for either ordinary computations or reversible (di… ▽ More
Submitted 17 June, 2010; originally announced June 2010.
Comments: 39 pages, LaTeX, 2 Figures/Tables
MSC Class: 68Q30; 94A15; 94A17
Journal ref: C.H. Bennett, P. Gács, M. Li, P.M.B. Vitányi, and W. Zurek, Information Distance, IEEE Trans. Information Theory, 44:4(1998) 1407--1423
-
arXiv:1006.3275 [pdf, ps, other]
Normalized Information Distance is Not Semicomputable
Abstract: Normalized information distance (NID) uses the theoretical notion of Kolmogorov complexity, which for practical purposes is approximated by the length of the compressed version of the file involved, using a real-world compression program. This practical application is called 'normalized compression distance' and it is trivially computable. It is a parameter-free similarity measure based on compres… ▽ More
Submitted 16 June, 2010; originally announced June 2010.
Comments: 9 pages, LaTeX, No figures, To appear in J. Comput. Syst. Sci
MSC Class: 03Dxx; 62B10; 68T10; 91C20
-
The probabilistic analysis of language acquisition: Theoretical, computational, and experimental analysis
Abstract: There is much debate over the degree to which language learning is governed by innate language-specific biases, or acquired through cognition-general principles. Here we examine the probabilistic language acquisition hypothesis on three levels: We outline a novel theoretical result showing that it is possible to learn the exact generative model underlying a wide class of languages, purely from obs… ▽ More
Submitted 16 June, 2010; originally announced June 2010.
Comments: 26 pages, pdf, 4 figures, Submitted to "Cognition"
MSC Class: 91E10; 97C30; 68T50
-
arXiv:0910.4353 [pdf, ps, other]
Nonapproximablity of the Normalized Information Distance
Abstract: Normalized information distance (NID) uses the theoretical notion of Kolmogorov complexity, which for practical purposes is approximated by the length of the compressed version of the file involved, using a real-world compression program. This practical application is called `normalized compression distance' and it is trivially computable. It is a parameter-free similarity measure based on compr… ▽ More
Submitted 23 October, 2009; v1 submitted 22 October, 2009; originally announced October 2009.
Comments: LaTeX 8 pages, Submitted. 2nd version corrected some typos
-
arXiv:0906.0731 [pdf, ps]
Distributed elections in an Archimedean ring of processors
Abstract: Unlimited asynchronism is intolerable in real physically distributed computer systems. Such systems, synchronous or not, use clocks and timeouts. Therefore the magnitudes of elapsed absolute time in the system need to satisfy the axiom of Archimedes. Under this restriction of asynchronicity logically time-independent solutions can be derived which are nonetheless better (in number of message pas… ▽ More
Submitted 27 May, 2009; originally announced June 2009.
Journal ref: 16th ACM Symposium on Theory of Computing, Washington D.C., 1984, 542 - 547
-
arXiv:0905.4452 [pdf, ps, other]
Analysis of Sorting Algorithms by Kolmogorov Complexity (A Survey)
Abstract: Recently, many results on the computational complexity of sorting algorithms were obtained using Kolmogorov complexity (the incompressibility method). Especially, the usually hard average-case analysis is ammenable to this method. Here we survey such results about Bubblesort, Heapsort, Shellsort, Dobosiewicz-sort, Shakersort, and sorting with stacks and queues in sequential or parallel mode. Esp… ▽ More
Submitted 27 May, 2009; originally announced May 2009.
Comments: 18 Pages, 2 figures, LaTeX
Journal ref: Pp.209--232 in: In: Entropy, Search, Complexity, Bolyai Society Mathematical Studies, 16, I. Csiszar, G.O.H. Katona, G. Tardos, Eds., Springer-Verlag, 2007
-
arXiv:0905.4039 [pdf, ps, other]
Normalized Web Distance and Word Similarity
Abstract: There is a great deal of work in cognitive psychology, linguistics, and computer science, about using word (or phrase) frequencies in context in text corpora to develop measures for word similarity or word association, going back to at least the 1960s. The goal of this chapter is to introduce the normalizedis a general way to tap the amorphous low-grade knowledge available for free on the Intern… ▽ More
Submitted 25 May, 2009; originally announced May 2009.
Comments: Latex, 20 pages, 7 figures, to appear in: Handbook of Natural Language Processing, Second Edition, Nitin Indurkhya and Fred J. Damerau Eds., CRC Press, Taylor and Francis Group, Boca Raton, FL, 2010, ISBN 978-1420085921
-
arXiv:0905.3347 [pdf, ps, other]
Information Distance in Multiples
Abstract: Information distance is a parameter-free similarity measure based on compression, used in pattern recognition, data mining, phylogeny, clustering, and classification. The notion of information distance is extended from pairs to multiples (finite lists). We study maximal overlap, metricity, universality, minimal overlap, additivity, and normalized information distance in multiples. We use the the… ▽ More
Submitted 20 May, 2009; originally announced May 2009.
Comments: LateX 14 pages, Submitted to a technical journal
ACM Class: J.3; E.4
-
arXiv:0809.2965 [pdf, ps, other]
On Time-Bounded Incompressibility of Compressible Strings and Sequences
Abstract: For every total recursive time bound $t$, a constant fraction of all compressible (low Kolmogorov complexity) strings is $t$-bounded incompressible (high time-bounded Kolmogorov complexity); there are uncountably many infinite sequences of which every initial segment of length $n$ is compressible to $\log n$ yet $t$-bounded incompressible below ${1/4}n - \log n$; and there are countable infinite… ▽ More
Submitted 11 August, 2009; v1 submitted 17 September, 2008; originally announced September 2008.
Comments: 9 pages, LaTeX, no figures, submitted to Information Processing Letters. Changed and added a Barzdins-like lemma for infinite sequences with different quantification oreder, a fixed constant, and uncountably many sequences
-
arXiv:0809.2754 [pdf, ps, other]
Algorithmic information theory
Abstract: We introduce algorithmic information theory, also known as the theory of Kolmogorov complexity. We explain the main concepts of this quantitative approach to defining `information'. We discuss the extent to which Kolmogorov's and Shannon's information theory have a common purpose, and where they are fundamentally different. We indicate how recent developments within the theory allow one to forma… ▽ More
Submitted 17 September, 2008; v1 submitted 16 September, 2008; originally announced September 2008.
Comments: 37 pages, 2 figures, pdf, in: Philosophy of Information, P. Adriaans and J. van Benthem, Eds., A volume in Handbook of the philosophy of science, D. Gabbay, P. Thagard, and J. Woods, Eds., Elsevier, 2008. In version 1 of September 16 the refs are missing. Corrected in version 2 of September 17
-
Normalized Information Distance
Abstract: The normalized information distance is a universal distance measure for objects of all kinds. It is based on Kolmogorov complexity and thus uncomputable, but there are ways to utilize it. First, compression algorithms can be used to approximate the Kolmogorov complexity if the objects have a string representation. Second, for names and abstract concepts, page count statistics from the World Wide… ▽ More
Submitted 15 September, 2008; originally announced September 2008.
Comments: 33 pages, 12 figures, pdf, in: Normalized information distance, in: Information Theory and Statistical Learning, Eds. M. Dehmer, F. Emmert-Streib, Springer-Verlag, New-York, To appear
-
arXiv:cs/0612133 [pdf, ps, other]
Tales of Huffman
Abstract: We study the new problem of Huffman-like codes subject to individual restrictions on the code-word lengths of a subset of the source words. These are prefix codes with minimal expected code-word length for a random source where additionally the code-word lengths of a subset of the source words is prescribed, possibly differently for every such source word. Based on a structural analysis of prope… ▽ More
Submitted 25 December, 2006; originally announced December 2006.
Comments: LaTex 8 pages
-
arXiv:cs/0612025 [pdf, ps, other]
Registers
Abstract: Entry in: Encyclopedia of Algorithms, Ming-Yang Kao, Ed., Springer, To appear. Synonyms: Wait-free registers, wait-free shared variables, asynchronous communication hardware. Problem Definition: Consider a system of asynchronous processes that communicate among themselves by only executing read and write operations on a set of shared variables (also known as shared registers). The system has n… ▽ More
Submitted 5 December, 2006; originally announced December 2006.
Comments: 5 pages, LaTeX, Entry in: Encyclopedia of Algorithms, Ming-Yang Kao, Ed., Springer, To appear
-
arXiv:cs/0606048 [pdf, ps, other]
A New Quartet Tree Heuristic for Hierarchical Clustering
Abstract: We consider the problem of constructing an an optimal-weight tree from the 3*(n choose 4) weighted quartet topologies on n objects, where optimality means that the summed weight of the embedded quartet topologiesis optimal (so it can be the case that the optimal tree embeds all quartets as non-optimal topologies). We present a heuristic for reconstructing the optimal-weight tree, and a canonical… ▽ More
Submitted 11 June, 2006; originally announced June 2006.
Comments: 22 pages, 14 figures
ACM Class: F.2.2; G.1.6
-
arXiv:cs/0412098 [pdf, ps, other]
The Google Similarity Distance
Abstract: Words and phrases acquire meaning from the way they are used in society, from their relative semantics to other words and phrases. For computers the equivalent of `society' is `database,' and the equivalent of `use' is `way to search the database.' We present a new theory of similarity between words and phrases based on information distance and Kolmogorov complexity. To fix thoughts we use the w… ▽ More
Submitted 30 May, 2007; v1 submitted 21 December, 2004; originally announced December 2004.
Comments: 15 pages, 10 figures; changed some text/figures/notation/part of theorem. Incorporated referees comments. This is the final published version up to some minor changes in the galley proofs
ACM Class: I.2.4; I.2.7
Journal ref: R.L. Cilibrasi, P.M.B. Vitanyi, The Google Similarity Distance, IEEE Trans. Knowledge and Data Engineering, 19:3(2007), 370-383
-
arXiv:cs/0411014 [pdf, ps, other]
Rate Distortion and Denoising of Individual Data Using Kolmogorov complexity
Abstract: We examine the structure of families of distortion balls from the perspective of Kolmogorov complexity. Special attention is paid to the canonical rate-distortion function of a source word which returns the minimal Kolmogorov complexity of all distortion balls containing that word subject to a bound on their cardinality. This canonical rate-distortion function is related to the more standard alg… ▽ More
Submitted 26 November, 2009; v1 submitted 6 November, 2004; originally announced November 2004.
Comments: LaTex, 31 pages, 2 figures. The new version is again completely rewritten, newly titled, and adds new results
ACM Class: E.4; H.1.1
-
arXiv:math/0110086 [pdf, ps, other]
Randomness
Abstract: Here we present in a single essay a combination and completion of the several aspects of the problem of randomness of individual objects which of necessity occur scattered in our texbook "An Introduction to Kolmogorov Complexity and Its Applications" (M. Li and P. Vitanyi), 2nd Ed., Springer-Verlag, 1997.
Submitted 10 October, 2001; v1 submitted 8 October, 2001; originally announced October 2001.
Comments: LaTeX source, 48 pages, Section contributed to `Matematica, Logica, Informatica' Volume 12 of the "Storia del XX Secolo", published by the "Instituto della Enciclopedia Italiana" (smal addition in new version)
MSC Class: 60-02; 60A05; 62-02; 62A01
-
Quantum Kolmogorov Complexity Based on Classical Descriptions
Abstract: We develop a theory of the algorithmic information in bits contained in an individual pure quantum state. This extends classical Kolmogorov complexity to the quantum domain retaining classical descriptions. Quantum Kolmogorov complexity coincides with the classical Kolmogorov complexity on the classical domain. Quantum Kolmogorov complexity is upper bounded and can be effectively approximated fr… ▽ More
Submitted 9 October, 2001; v1 submitted 21 February, 2001; originally announced February 2001.
Comments: 17 pages, LaTeX, final and extended version of quant-ph/9907035, with corrections to the published journal version (the two displayed equations in the right-hand column on page 2466 had the left-hand sides of the displayed formulas erroneously interchanged)
Journal ref: IEEE Transactions on Information Theory, Vol. 47, No. 6, September 2001, 2464-2479