Search | arXiv e-print repository

On de Bruijn Array Codes Part II: Linear Codes

Authors: Simon Blackburn, Yeow Meng Chee, Tuvi Etzion, Huimin Lao

Abstract: An M-sequence generated by a primitive polynomial has many interesting and desirable properties. A pseudo-random array is the two-dimensional generalization of an M-sequence. Similarly to primitive polynomials, there are irreducible and reducible polynomials whose all nonzero sequences have the same length. In this paper, a two-dimensional generalization for such sequences is given. This generaliz… ▽ More An M-sequence generated by a primitive polynomial has many interesting and desirable properties. A pseudo-random array is the two-dimensional generalization of an M-sequence. Similarly to primitive polynomials, there are irreducible and reducible polynomials whose all nonzero sequences have the same length. In this paper, a two-dimensional generalization for such sequences is given. This generalization is for a pseudo-random array code which is a set of $r_1 \times r_2$ arrays in which each $n_1 \times n_2$ nonzero matrix is contained exactly once as a window in one of the arrays. Moreover, these arrays have the shift-and-add property, i.e., the bitwise addition of two arrays (or a nontrivial shift of such arrays) is another array (or a shift of another array) from the code. All the known arrays can be formed by folding sequences generated from an irreducible polynomial or a reducible polynomial whose factors have the same degree and the same exponent. Two proof techniques are used to prove the parameters of the constructed arrays. The first one is based on another method, different from folding, for constructing some of these arrays. The second one is a generalization of a known proof technique. This generalization enables to present pseudo-random arrays with parameters not known before and also a variety of pseudo-random array codes which cannot be generated by the first method. The two techniques also suggest two different hierarchies between pseudo-random array codes. Finally, two methods to verify whether a folding of sequences, generated by these polynomials, yields a pseudo-random array or a pseudo-random array code, will be presented. △ Less

Submitted 18 June, 2025; v1 submitted 21 January, 2025; originally announced January 2025.

arXiv:2405.01616 [pdf, other]

Generative Active Learning for the Search of Small-molecule Protein Binders

Authors: Maksym Korablyov, Cheng-Hao Liu, Moksh Jain, Almer M. van der Sloot, Eric Jolicoeur, Edward Ruediger, Andrei Cristian Nica, Emmanuel Bengio, Kostiantyn Lapchevskyi, Daniel St-Cyr, Doris Alexandra Schuetz, Victor Ion Butoi, Jarrid Rector-Brooks, Simon Blackburn, Leo Feng, Hadi Nekoei, SaiKrishna Gottipati, Priyesh Vijayan, Prateek Gupta, Ladislav Rampášek, Sasikanth Avancha, Pierre-Luc Bacon, William L. Hamilton, Brooks Paige, Sanchit Misra , et al. (9 additional authors not shown)

Abstract: Despite substantial progress in machine learning for scientific discovery in recent years, truly de novo design of small molecules which exhibit a property of interest remains a significant challenge. We introduce LambdaZero, a generative active learning approach to search for synthesizable molecules. Powered by deep reinforcement learning, LambdaZero learns to search over the vast space of molecu… ▽ More Despite substantial progress in machine learning for scientific discovery in recent years, truly de novo design of small molecules which exhibit a property of interest remains a significant challenge. We introduce LambdaZero, a generative active learning approach to search for synthesizable molecules. Powered by deep reinforcement learning, LambdaZero learns to search over the vast space of molecules to discover candidates with a desired property. We apply LambdaZero with molecular docking to design novel small molecules that inhibit the enzyme soluble Epoxide Hydrolase 2 (sEH), while enforcing constraints on synthesizability and drug-likeliness. LambdaZero provides an exponential speedup in terms of the number of calls to the expensive molecular docking oracle, and LambdaZero de novo designed molecules reach docking scores that would otherwise require the virtual screening of a hundred billion molecules. Importantly, LambdaZero discovers novel scaffolds of synthesizable, drug-like inhibitors for sEH. In in vitro experimental validation, a series of ligands from a generated quinazoline-based scaffold were synthesized, and the lead inhibitor N-(4,6-di(pyrrolidin-1-yl)quinazolin-2-yl)-N-methylbenzamide (UM0152893) displayed sub-micromolar enzyme inhibition of sEH. △ Less

Submitted 2 May, 2024; originally announced May 2024.

arXiv:2211.10309 [pdf, ps, other]

Constructions and bounds for codes with restricted overlaps

Authors: Simon R. Blackburn, Navid Nasr Esfahani, Donald L. Kreher, Douglas R. Stinson

Abstract: Non-overlapping codes have been studied for almost 60 years. In such a code, no proper, non-empty prefix of any codeword is a suffix of any codeword. In this paper, we study codes in which overlaps of certain specified sizes are forbidden. We prove some general bounds and we give several constructions in the case of binary codes. Our techniques also allow us to provide an alternative, elementary p… ▽ More Non-overlapping codes have been studied for almost 60 years. In such a code, no proper, non-empty prefix of any codeword is a suffix of any codeword. In this paper, we study codes in which overlaps of certain specified sizes are forbidden. We prove some general bounds and we give several constructions in the case of binary codes. Our techniques also allow us to provide an alternative, elementary proof of a lower bound on non-overlapping codes due to Levenshtein in 1964. △ Less

Submitted 22 August, 2023; v1 submitted 18 November, 2022; originally announced November 2022.

Comments: 17 pages. Theorems etc renumbered

MSC Class: 94A45

arXiv:2210.17175 [pdf, other]

doi 10.1145/3519939.3523440

Low-Latency, High-Throughput Garbage Collection (Extended Version)

Authors: Wenyu Zhao, Stephen M. Blackburn, Kathryn S. McKinley

Abstract: Production garbage collectors make substantial compromises in pursuit of reduced pause times. They require far more CPU cycles and memory than prior simpler collectors. concurrent copying collectors (C4, ZGC, and Shenandoah) suffer from the following design limitations. 1) Concurrent copying. They only reclaim memory by copying, which is inherently expensive with high memory bandwidth demands. Con… ▽ More Production garbage collectors make substantial compromises in pursuit of reduced pause times. They require far more CPU cycles and memory than prior simpler collectors. concurrent copying collectors (C4, ZGC, and Shenandoah) suffer from the following design limitations. 1) Concurrent copying. They only reclaim memory by copying, which is inherently expensive with high memory bandwidth demands. Concurrent copying also requires expensive read and write barriers. 2) Scalability. They depend on tracing, which in the limit and in practice does not scale. 3) Immediacy. They do not reclaim older objects promptly, incurring high memory overheads. We present LXR, which takes a very different approach to optimizing responsiveness and throughput by minimizing concurrent collection work and overheads. 1) LXR reclaims most memory without any copying by using the Immix heap structure. It then combats fragmentation with limited judicious stop-the-world copying. 2) LXR uses reference counting to achieve both scalability and immediacy, promptly reclaiming young and old objects. It uses concurrent tracing as needed for identifying cyclic garbage. 3) To minimize pause times while allowing judicious copying of mature objects, LXR introduces remembered sets for reference counting and concurrent decrement processing. 4) LXR introduces a novel low-overhead write barrier that combines coalescing reference counting, concurrent tracing, and remembered set maintenance. The result is a collector with excellent responsiveness and throughput. On the widely-used Lucene search engine with a generously sized heap, LXR has 6x higher throughput while delivering 30x lower 99.9 percentile tail latency than the popular Shenandoah production collector in its default configuration. △ Less

Submitted 31 October, 2022; originally announced October 2022.

Comments: 17 pages, 7 Figures. This extends the original publication with an LBO analysis (Section 5.5)

ACM Class: D.3.4

Journal ref: p76-91,PLDI '22: 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation, San Diego, CA, USA, June 13 - 17, 2022

arXiv:2210.14100 [pdf, ps, other]

The capacity of a finite field matrix channel

Authors: Simon R. Blackburn, Jessica Claridge

Abstract: The Additive-Multiplicative Matrix Channel (AMMC) was introduced by Silva, Kschischang and Kötter in 2010 to model data transmission using random linear network coding. The input and output of the channel are $n\times m$ matrices over a finite field $\mathbb{F}_q$. On input the matrix $X$, the channel outputs $Y=A(X+W)$ where $A$ is a uniformly chosen $n\times n$ invertible matrix over… ▽ More The Additive-Multiplicative Matrix Channel (AMMC) was introduced by Silva, Kschischang and Kötter in 2010 to model data transmission using random linear network coding. The input and output of the channel are $n\times m$ matrices over a finite field $\mathbb{F}_q$. On input the matrix $X$, the channel outputs $Y=A(X+W)$ where $A$ is a uniformly chosen $n\times n$ invertible matrix over $\mathbb{F}_q$ and where $W$ is a uniformly chosen $n\times m$ matrix over $\mathbb{F}_q$ of rank $t$. Silva \emph{et al} considered the case when $2n\leq m$. They determined the asymptotic capacity of the AMMC when $t$, $n$ and $m$ are fixed and $q\rightarrow\infty$. They also determined the leading term of the capacity when $q$ is fixed, and $t$, $n$ and $m$ grow linearly. We generalise these results, showing that the condition $2n\geq m$ can be removed. (Our formula for the capacity falls into two cases, one of which generalises the $2n\geq m$ case.) We also improve the error term in the case when $q$ is fixed. △ Less

Submitted 20 January, 2025; v1 submitted 25 October, 2022; originally announced October 2022.

Comments: 31 pages, 1 figure. Minor changes for clarity

MSC Class: 94A40

arXiv:2112.07880 [pdf, other]

doi 10.1109/ISPASS55109.2022.00005

Distilling the Real Cost of Production Garbage Collectors

Authors: Zixian Cai, Stephen M. Blackburn, Michael D. Bond, Martin Maas

Abstract: Abridged abstract: despite the long history of garbage collection (GC) and its prevalence in modern programming languages, there is surprisingly little clarity about its true cost. Without understanding their cost, crucial tradeoffs made by garbage collectors (GCs) go unnoticed. This can lead to misguided design constraints and evaluation criteria used by GC researchers and users, hindering the de… ▽ More Abridged abstract: despite the long history of garbage collection (GC) and its prevalence in modern programming languages, there is surprisingly little clarity about its true cost. Without understanding their cost, crucial tradeoffs made by garbage collectors (GCs) go unnoticed. This can lead to misguided design constraints and evaluation criteria used by GC researchers and users, hindering the development of high-performance, low-cost GCs. In this paper, we develop a methodology that allows us to empirically estimate the cost of GC for any given set of metrics. By distilling out the explicitly identifiable GC cost, we estimate the intrinsic application execution cost using different GCs. The minimum distilled cost forms a baseline. Subtracting this baseline from the total execution costs, we can then place an empirical lower bound on the absolute costs of different GCs. Using this methodology, we study five production GCs in OpenJDK 17, a high-performance Java runtime. We measure the cost of these collectors, and expose their respective key performance tradeoffs. We find that with a modestly sized heap, production GCs incur substantial overheads across a diverse suite of modern benchmarks, spending at least 7-82% more wall-clock time and 6-92% more CPU cycles relative to the baseline cost. We show that these costs can be masked by concurrency and generous provisioning of memory/compute. In addition, we find that newer low-pause GCs are significantly more expensive than older GCs, and, surprisingly, sometimes deliver worse application latency than stop-the-world GCs. Our findings reaffirm that GC is by no means a solved problem and that a low-cost, low-latency GC remains elusive. We recommend adopting the distillation methodology together with a wider range of cost metrics for future GC evaluations. △ Less

Submitted 5 May, 2022; v1 submitted 14 December, 2021; originally announced December 2021.

Comments: Camera-ready version

arXiv:2004.12485 [pdf, other]

Learning To Navigate The Synthetically Accessible Chemical Space Using Reinforcement Learning

Authors: Sai Krishna Gottipati, Boris Sattarov, Sufeng Niu, Yashaswi Pathak, Haoran Wei, Shengchao Liu, Karam M. J. Thomas, Simon Blackburn, Connor W. Coley, Jian Tang, Sarath Chandar, Yoshua Bengio

Abstract: Over the last decade, there has been significant progress in the field of machine learning for de novo drug design, particularly in deep generative models. However, current generative approaches exhibit a significant challenge as they do not ensure that the proposed molecular structures can be feasibly synthesized nor do they provide the synthesis routes of the proposed small molecules, thereby se… ▽ More Over the last decade, there has been significant progress in the field of machine learning for de novo drug design, particularly in deep generative models. However, current generative approaches exhibit a significant challenge as they do not ensure that the proposed molecular structures can be feasibly synthesized nor do they provide the synthesis routes of the proposed small molecules, thereby seriously limiting their practical applicability. In this work, we propose a novel forward synthesis framework powered by reinforcement learning (RL) for de novo drug design, Policy Gradient for Forward Synthesis (PGFS), that addresses this challenge by embedding the concept of synthetic accessibility directly into the de novo drug design system. In this setup, the agent learns to navigate through the immense synthetically accessible chemical space by subjecting commercially available small molecule building blocks to valid chemical reactions at every time step of the iterative virtual multi-step synthesis process. The proposed environment for drug discovery provides a highly challenging test-bed for RL algorithms owing to the large state space and high-dimensional continuous action space with hierarchical actions. PGFS achieves state-of-the-art performance in generating structures with high QED and penalized clogP. Moreover, we validate PGFS in an in-silico proof-of-concept associated with three HIV targets. Finally, we describe how the end-to-end training conceptualized in this study represents an important paradigm in radically expanding the synthesizable chemical space and automating the drug discovery process. △ Less

Submitted 19 May, 2020; v1 submitted 26 April, 2020; originally announced April 2020.

Comments: added the statistics of top-100 compounds used logP metric with scaled components added values of the initial reactants to the box plots some values in tables are recalculated due to the inconsistent environments on different machines. corresponding benchmarks were rerun with the requirements on github. no significant changes in the results. corrected figures in the Appendix

arXiv:1909.02212 [pdf, other]

Author Growth Outstrips Publication Growth in Computer Science and Publication Quality Correlates with Collaboration

Authors: Stephen M. Blackburn, Kathryn S. McKinley, Lexing Xie

Abstract: Although the computer science community successfully harnessed exponential increases in computer performance to drive societal and economic change, the exponential growth in publications is proving harder to accommodate. To gain a deeper understanding of publication growth and inform how the computer science community should handle this growth, we analyzed publication practices from several perspe… ▽ More Although the computer science community successfully harnessed exponential increases in computer performance to drive societal and economic change, the exponential growth in publications is proving harder to accommodate. To gain a deeper understanding of publication growth and inform how the computer science community should handle this growth, we analyzed publication practices from several perspectives: ACM sponsored publications in the ACM Digital Library as a whole: subdisciplines captured by ACM's Special Interest Groups (SIGs); ten top conferences; institutions; four top U.S. departments; authors; faculty; and PhDs between 1990 and 2012. ACM publishes a large fraction of all computer science research. We first summarize how we believe our main findings inform (1) expectations on publication growth, (2) how to distinguish research quality from output quantity; and (3) the evaluation of individual researchers. We then further motivate the study of computer science publication practices and describe our methodology and results in detail. △ Less

Submitted 5 September, 2019; originally announced September 2019.

arXiv:1907.12748 [pdf, other]

Influence Flowers of Academic Entities

Authors: Minjeong Shin, Alexander Soen, Benjamin T. Readshaw, Stephen M. Blackburn, Mitchell Whitelaw, Lexing Xie

Abstract: We present the Influence Flower, a new visual metaphor for the influence profile of academic entities, including people, projects, institutions, conferences, and journals. While many tools quantify influence, we aim to expose the flow of influence between entities. The Influence Flower is an ego-centric graph, with a query entity placed in the centre. The petals are styled to reflect the strength… ▽ More We present the Influence Flower, a new visual metaphor for the influence profile of academic entities, including people, projects, institutions, conferences, and journals. While many tools quantify influence, we aim to expose the flow of influence between entities. The Influence Flower is an ego-centric graph, with a query entity placed in the centre. The petals are styled to reflect the strength of influence to and from other entities of the same or different type. For example, one can break down the incoming and outgoing influences of a research lab by research topics. The Influence Flower uses a recent snapshot of Microsoft Academic Graph, consisting of 212million authors, their 176 million publications, and 1.2 billion citations. An interactive web app, Influence Map, is constructed around this central metaphor for searching and curating visualisations. We also propose a visual comparison method that highlights change in influence patterns over time. We demonstrate through several case studies that the Influence Flower supports data-driven inquiries about the following: researchers' careers over time; paper(s) and projects, including those with delayed recognition; the interdisciplinary profile of a research institution; and the shifting topical trends in conferences. We also use this tool on influence data beyond academic citations, by contrasting the academic and Twitter activities of a researcher. △ Less

Submitted 30 July, 2019; originally announced July 2019.

Comments: VAST 2019

arXiv:1810.07970 [pdf, other]

Inglenook Shunting Puzzles

Authors: Simon R. Blackburn

Abstract: An inglenook puzzle is a classic shunting (switching) puzzle often found on model railway layouts. A collection of wagons sits in a fan of sidings with a limited length headshunt (lead track). The aim of the puzzle is to rearrange the wagons into a desired order (often a randomly chosen order). This article answers the question: When can you be sure this can always be done? The problem of finding… ▽ More An inglenook puzzle is a classic shunting (switching) puzzle often found on model railway layouts. A collection of wagons sits in a fan of sidings with a limited length headshunt (lead track). The aim of the puzzle is to rearrange the wagons into a desired order (often a randomly chosen order). This article answers the question: When can you be sure this can always be done? The problem of finding a solution in a minimum number of moves is also addressed. △ Less

Submitted 3 April, 2019; v1 submitted 18 October, 2018; originally announced October 2018.

Comments: 23 pages, 4 figures. Minor typos in previous version corrected

MSC Class: 68P10

arXiv:1807.06036 [pdf, other]

Pangloss: Fast Entity Linking in Noisy Text Environments

Authors: Michael Conover, Matthew Hayes, Scott Blackburn, Pete Skomoroch, Sam Shah

Abstract: Entity linking is the task of mapping potentially ambiguous terms in text to their constituent entities in a knowledge base like Wikipedia. This is useful for organizing content, extracting structured data from textual documents, and in machine learning relevance applications like semantic search, knowledge graph construction, and question answering. Traditionally, this work has focused on text th… ▽ More Entity linking is the task of mapping potentially ambiguous terms in text to their constituent entities in a knowledge base like Wikipedia. This is useful for organizing content, extracting structured data from textual documents, and in machine learning relevance applications like semantic search, knowledge graph construction, and question answering. Traditionally, this work has focused on text that has been well-formed, like news articles, but in common real world datasets such as messaging, resumes, or short-form social media, non-grammatical, loosely-structured text adds a new dimension to this problem. This paper presents Pangloss, a production system for entity disambiguation on noisy text. Pangloss combines a probabilistic linear-time key phrase identification algorithm with a semantic similarity engine based on context-dependent document embeddings to achieve better than state-of-the-art results (>5% in F1) compared to other research or commercially available systems. In addition, Pangloss leverages a local embedded database with a tiered architecture to house its statistics and metadata, which allows rapid disambiguation in streaming contexts and on-device disambiguation in low-memory environments such as mobile phones. △ Less

Submitted 16 July, 2018; originally announced July 2018.

Comments: KDD 2018

arXiv:1807.00071 [pdf]

GOTO Rankings Considered Helpful

Authors: Emery Berger, Stephen M. Blackburn, Carla Brodley, H. V. Jagadish, Kathryn S. McKinley, Mario A. Nascimento, Minjeong Shin, Lexing Xie

Abstract: Rankings are a fact of life. Whether or not one likes them, they exist and are influential. Within academia, and in computer science in particular, rankings not only capture our attention but also widely influence people who have a limited understanding of computing science research, including prospective students, university administrators, and policy-makers. In short, rankings matter. This posit… ▽ More Rankings are a fact of life. Whether or not one likes them, they exist and are influential. Within academia, and in computer science in particular, rankings not only capture our attention but also widely influence people who have a limited understanding of computing science research, including prospective students, university administrators, and policy-makers. In short, rankings matter. This position paper advocates for the adoption of "GOTO rankings": rankings that use Good data, are Open, Transparent, and Objective, and the rejection of rankings that do not meet these criteria. △ Less

Submitted 24 April, 2019; v1 submitted 29 June, 2018; originally announced July 2018.

Comments: Accepted, to appear in Communications of the ACM

arXiv:1609.07070 [pdf, ps, other]

PIR Array Codes with Optimal PIR Rates

Authors: Simon R. Blackburn, Tuvi Etzion

Abstract: There has been much recent interest in Private information Retrieval (PIR) in models where a database is stored across several servers using coding techniques from distributed storage, rather than being simply replicated. In particular, a recent breakthrough result of Fazelli, Vardy and Yaakobi introduces the notion of a PIR code and a PIR array code, and uses this notion to produce efficient prot… ▽ More There has been much recent interest in Private information Retrieval (PIR) in models where a database is stored across several servers using coding techniques from distributed storage, rather than being simply replicated. In particular, a recent breakthrough result of Fazelli, Vardy and Yaakobi introduces the notion of a PIR code and a PIR array code, and uses this notion to produce efficient protocols. In this paper we are interested in designing PIR array codes. We consider the case when we have $m$ servers, with each server storing a fraction $(1/\omegaR)$ of the bits of the database; here $\omegaR$ is a fixed rational number with $\omegaR > 1$. We study the maximum PIR rate of a PIR array code with the $k$-PIR property (which enables a $k$-server PIR protocol to be emulated on the $m$ servers), where the PIR rate is defined to be $k/m$. We present upper bounds on the achievable rate, some constructions, and ideas how to obtain PIR array codes with the highest possible PIR rate. In particular, we present constructions that asymptotically meet our upper bounds, and the exact largest PIR rate is obtained when $1 < \omegaR \leq 2$. △ Less

Submitted 17 December, 2016; v1 submitted 22 September, 2016; originally announced September 2016.

Comments: A conference version for arXiv:1607.00235

arXiv:1609.07027 [pdf, ps, other]

PIR schemes with small download complexity and low storage requirements

Authors: Simon R. Blackburn, Tuvi Etzion, Maura B. Paterson

Abstract: In the classical model for (information theoretically secure) Private Information Retrieval (PIR), a user wishes to retrieve one bit of a database that is stored on a set of $n$ servers, in such a way that no individual server gains information about which bit the user is interested in. The aim is to design schemes that minimise communication between the user and the servers. More recently, there… ▽ More In the classical model for (information theoretically secure) Private Information Retrieval (PIR), a user wishes to retrieve one bit of a database that is stored on a set of $n$ servers, in such a way that no individual server gains information about which bit the user is interested in. The aim is to design schemes that minimise communication between the user and the servers. More recently, there have been moves to consider more realistic models where the total storage of the set of servers, or the per server storage, should be minimised (possibly using techniques from distributed storage), and where the database is divided into $R$-bit records with $R>1$, and the user wishes to retrieve one record rather than one bit. When $R$ is large, downloads from the servers to the user dominate the communication complexity and so the aim is to minimise the total number of downloaded bits. Shah, Rashmi and Ramchandran show that at least $R+1$ bits must be downloaded from servers in the worst case, and provide PIR schemes meeting this bound. Sun and Jafar determine the best asymptotic download cost of a PIR scheme (as $R\rightarrow\infty$), where this cost is defined as the ratio of the message length $R$ and the total number of bits downloaded. This paper provides various bounds on the download complexity of a PIR scheme, generalising those of Shah et al. to the case when the number $n$ of servers is bounded, and providing links with classical techniques due to Chor et al. The paper also provides a range of constructions for PIR schemes that are either simpler or perform better than previously known schemes, including explicit schemes that achieve the best asymptotic download complexity of Sun and Jafar with significantly lower upload complexity, and general techniques for constructing a scheme with good worst case download complexity from a scheme with good download complexity on average. △ Less

Submitted 4 December, 2018; v1 submitted 22 September, 2016; originally announced September 2016.

Comments: 30 pages. Minor updates and corrections throughout, with updated bibliography

MSC Class: 94A60

arXiv:1607.00235 [pdf, ps, other]

PIR Array Codes with Optimal Virtual Server Rate

Authors: Simon Blackburn, Tuvi Etzion

Abstract: There has been much recent interest in Private information Retrieval (PIR) in models where a database is stored across several servers using coding techniques from distributed storage, rather than being simply replicated. In particular, a recent breakthrough result of Fazelli, Vardy and Yaakobi introduces the notion of a PIR code and a PIR array code, and uses this notion to produce efficient PIR… ▽ More There has been much recent interest in Private information Retrieval (PIR) in models where a database is stored across several servers using coding techniques from distributed storage, rather than being simply replicated. In particular, a recent breakthrough result of Fazelli, Vardy and Yaakobi introduces the notion of a PIR code and a PIR array code, and uses this notion to produce efficient PIR protocols. In this paper we are interested in designing PIR array codes. We consider the case when we have $m$ servers, with each server storing a fraction $(1/s)$ of the bits of the database; here $s$ is a fixed rational number with $s > 1$. A PIR array code with the $k$-PIR property enables a $k$-server PIR protocol (with $k\leq m$) to be emulated on $m$ servers, with the overall storage requirements of the protocol being reduced. The communication complexity of a PIR protocol reduces as $k$ grows, so the virtual server rate, defined to be $k/m$, is an important parameter. We study the maximum virtual server rate of a PIR array code with the $k$-PIR property. We present upper bounds on the achievable virtual server rate, some constructions, and ideas how to obtain PIR array codes with the highest possible virtual server rate. In particular, we present constructions that asymptotically meet our upper bounds, and the exact largest virtual server rate is obtained when $1 < s \leq 2$. A $k$-PIR code (and similarly a $k$-PIR array code) is also a locally repairable code with symbol availability $k-1$. Such a code ensures $k$ parallel reads for each information symbol. So the virtual server rate is very closely related to the symbol availability of the code when used as a locally repairable code. The results of this paper are discussed also in this context, where subspace codes also have an important role. △ Less

Submitted 6 February, 2018; v1 submitted 1 July, 2016; originally announced July 2016.

arXiv:1602.00860 [pdf, ps, other]

On the security of the Algebraic Eraser tag authentication protocol

Authors: Simon R. Blackburn, M. J. B. Robshaw

Abstract: The Algebraic Eraser has been gaining prominence as SecureRF, the company commercializing the algorithm, increases its marketing reach. The scheme is claimed to be well-suited to IoT applications but a lack of detail in available documentation has hampered peer-review. Recently more details of the system have emerged after a tag authentication protocol built using the Algebraic Eraser was proposed… ▽ More The Algebraic Eraser has been gaining prominence as SecureRF, the company commercializing the algorithm, increases its marketing reach. The scheme is claimed to be well-suited to IoT applications but a lack of detail in available documentation has hampered peer-review. Recently more details of the system have emerged after a tag authentication protocol built using the Algebraic Eraser was proposed for standardization in ISO/IEC SC31 and SecureRF provided an open public description of the protocol. In this paper we describe a range of attacks on this protocol that include very efficient and practical tag impersonation as well as partial, and total, tag secret key recovery. Most of these results have been practically verified, they contrast with the 80-bit security that is claimed for the protocol, and they emphasize the importance of independent public review for any cryptographic proposal. △ Less

Submitted 2 June, 2016; v1 submitted 2 February, 2016; originally announced February 2016.

Comments: 21 pages. Minor changes. Final version accepted for ACNS 2016

MSC Class: 94A60

arXiv:1601.06037 [pdf, other]

Finite field matrix channels for network coding

Authors: Simon R. Blackburn, Jessica Claridge

Abstract: In 2010, Silva, Kschischang and Kötter studied certain classes of finite field matrix channels in order to model random linear network coding where exactly $t$ random errors are introduced. In this paper we consider a generalisation of these matrix channels where the number of errors is not required to be constant, indeed the number of errors may follow any distribution. We show that a capacity-… ▽ More In 2010, Silva, Kschischang and Kötter studied certain classes of finite field matrix channels in order to model random linear network coding where exactly $t$ random errors are introduced. In this paper we consider a generalisation of these matrix channels where the number of errors is not required to be constant, indeed the number of errors may follow any distribution. We show that a capacity-achieving input distribution can always be taken to have a very restricted form (the distribution should be uniform given the rank of the input matrix). This result complements, and is inspired by, a paper of Nobrega, Silva and Uchoa-Filho, that establishes a similar result for a class of matrix channels that model network coding with link erasures. Our result shows that the capacity of our channels can be expressed as a maximisation over probability distributions on the set of possible ranks of input matrices: a set of linear rather than exponential size. △ Less

Submitted 31 January, 2018; v1 submitted 22 January, 2016; originally announced January 2016.

Comments: 21 pages. A significant revision: the main counting arguments shortened; computational results added; other minor revisions throughout

MSC Class: 94A40

arXiv:1511.03870 [pdf, ps, other]

A Practical Cryptanalysis of the Algebraic Eraser

Authors: Adi Ben-Zvi, Simon R. Blackburn, Boaz Tsaban

Abstract: Anshel, Anshel, Goldfeld and Lemieaux introduced the Colored Burau Key Agreement Protocol (CBKAP) as the concrete instantiation of their Algebraic Eraser scheme. This scheme, based on techniques from permutation groups, matrix groups and braid groups, is designed for lightweight environments such as RFID tags and other IoT applications. It is proposed as an underlying technology for ISO/IEC 29167-… ▽ More Anshel, Anshel, Goldfeld and Lemieaux introduced the Colored Burau Key Agreement Protocol (CBKAP) as the concrete instantiation of their Algebraic Eraser scheme. This scheme, based on techniques from permutation groups, matrix groups and braid groups, is designed for lightweight environments such as RFID tags and other IoT applications. It is proposed as an underlying technology for ISO/IEC 29167-20. SecureRF, the company owning the trademark Algebraic Eraser, has presented the scheme to the IRTF with a view towards standardisation. We present a novel cryptanalysis of this scheme. For parameter sizes corresponding to claimed 128-bit security, our implementation recovers the shared key using less than 8 CPU hours, and less than 64MB of memory. △ Less

Submitted 2 June, 2016; v1 submitted 12 November, 2015; originally announced November 2015.

Comments: 15 pages. Updated references, with brief comments added. Minor typos corrected. Final version, accepted for CRYPTO 2016

MSC Class: 20F36; 94A60; 20B40

arXiv:1509.02748 [pdf, other]

Maximum likelihood decoding for multilevel channels with gain and offset mismatch

Authors: Simon R. Blackburn

Abstract: K.A.S. Immink and J.H. Weber recently defined and studied a channel with both gain and offset mismatch, modelling the behaviour of charge-leakage in flash memory. They proposed a decoding measure for this channel based on minimising Pearson distance (a notion from cluster analysis). The paper derives a formula for maximum likelihood decoding for this channel, and also defines and justifies a notio… ▽ More K.A.S. Immink and J.H. Weber recently defined and studied a channel with both gain and offset mismatch, modelling the behaviour of charge-leakage in flash memory. They proposed a decoding measure for this channel based on minimising Pearson distance (a notion from cluster analysis). The paper derives a formula for maximum likelihood decoding for this channel, and also defines and justifies a notion of minimum distance of a code in this context. △ Less

Submitted 9 September, 2015; originally announced September 2015.

Comments: 17 pages, 7 figures

arXiv:1509.00291 [pdf, other]

doi 10.1109/TIT.2015.2490219

Pearson codes

Authors: Jos H. Weber, Kees A. Schouhamer Immink, Simon R. Blackburn

Abstract: The Pearson distance has been advocated for improving the error performance of noisy channels with unknown gain and offset. The Pearson distance can only fruitfully be used for sets of $q$-ary codewords, called Pearson codes, that satisfy specific properties. We will analyze constructions and properties of optimal Pearson codes. We will compare the redundancy of optimal Pearson codes with the redu… ▽ More The Pearson distance has been advocated for improving the error performance of noisy channels with unknown gain and offset. The Pearson distance can only fruitfully be used for sets of $q$-ary codewords, called Pearson codes, that satisfy specific properties. We will analyze constructions and properties of optimal Pearson codes. We will compare the redundancy of optimal Pearson codes with the redundancy of prior art $T$-constrained codes, which consist of $q$-ary sequences in which $T$ pre-determined reference symbols appear at least once. In particular, it will be shown that for $q\le 3$ the $2$-constrained codes are optimal Pearson codes, while for $q\ge 4$ these codes are not optimal. △ Less

Submitted 29 September, 2015; v1 submitted 1 September, 2015; originally announced September 2015.

Comments: 17 pages. Minor revisions and corrections since previous version. Author biographies added. To appear in IEEE Trans. Inform. Theory

arXiv:1505.02597 [pdf, ps, other]

doi 10.1109/TIT.2015.2473848

Probabilistic existence results for separable codes

Authors: Simon R. Blackburn

Abstract: Separable codes were defined by Cheng and Miao in 2011, motivated by applications to the identification of pirates in a multimedia setting. Combinatorially, $\overline{t}$-separable codes lie somewhere between $t$-frameproof and $(t-1)$-frameproof codes: all $t$-frameproof codes are $\overline{t}$-separable, and all $\overline{t}$-separable codes are $(t-1)$-frameproof. Results for frameproof code… ▽ More Separable codes were defined by Cheng and Miao in 2011, motivated by applications to the identification of pirates in a multimedia setting. Combinatorially, $\overline{t}$-separable codes lie somewhere between $t$-frameproof and $(t-1)$-frameproof codes: all $t$-frameproof codes are $\overline{t}$-separable, and all $\overline{t}$-separable codes are $(t-1)$-frameproof. Results for frameproof codes show that (when $q$ is large) there are $q$-ary $\overline{t}$-separable codes of length $n$ with approximately $q^{\lceil n/t\rceil}$ codewords, and that no $q$-ary $\overline{t}$-separable codes of length $n$ can have more than approximately $q^{\lceil n/(t-1)\rceil}$ codewords. The paper provides improved probabilistic existence results for $\overline{t}$-separable codes when $t\geq 3$. More precisely, for all $t\geq 3$ and all $n\geq 3$, there exists a constant $κ$ (depending only on $t$ and $n$) such that there exists a $q$-ary $\overline{t}$-separable code of length $n$ with at least $κq^{n/(t-1)}$ codewords for all sufficiently large integers $q$. This shows, in particular, that the upper bound (derived from the bound on $(t-1)$-frameproof codes) on the number of codewords in a $\overline{t}$-separable code is realistic. The results above are more surprising after examining the situation when $t=2$. Results due to Gao and Ge show that a $q$-ary $\overline{2}$-separable code of length $n$ can contain at most $\frac{3}{2}q^{2\lceil n/3\rceil}-\frac{1}{2}q^{\lceil n/3\rceil}$ codewords, and that codes with at least $κq^{2n/3}$ codewords exist. So optimal $\overline{2}$-separable codes behave neither like $2$-frameproof nor $1$-frameproof codes. Also, the Gao--Ge bound is strengthened to show that a $q$-ary $\overline{2}$-separable code of length $n$ can have at most \[ q^{\lceil 2n/3\rceil}+\tfrac{1}{2}q^{\lfloor n/3\rfloor}(q^{\lfloor n/3\rfloor}-1) \] codewords. △ Less

Submitted 25 August, 2015; v1 submitted 11 May, 2015; originally announced May 2015.

Comments: 16 pages. Typos corrected and minor changes since last version. Accepted by IEEE Transactions on Information Theory

arXiv:1303.1026 [pdf, ps, other]

doi 10.1109/TIT.2015.2456634

Non-overlapping codes

Authors: Simon R. Blackburn

Abstract: We say that a $q$-ary length $n$ code is \emph{non-overlapping} if the set of non-trivial prefixes of codewords and the set of non-trivial suffices of codewords are disjoint. These codes were first studied by Levenshtein in 1964, motivated by applications in synchronisation. More recently these codes were independently invented (under the name \emph{cross-bifix-free} codes) by Bajić and Stojanović… ▽ More We say that a $q$-ary length $n$ code is \emph{non-overlapping} if the set of non-trivial prefixes of codewords and the set of non-trivial suffices of codewords are disjoint. These codes were first studied by Levenshtein in 1964, motivated by applications in synchronisation. More recently these codes were independently invented (under the name \emph{cross-bifix-free} codes) by Bajić and Stojanović. We provide a simple construction for a class of non-overlapping codes which has optimal cardinality whenever $n$ divides $q$. Moreover, for all parameters $n$ and $q$ we show that a code from this class is close to optimal, in the sense that it has cardinality within a constant factor of an upper bound due to Levenshtein from 1970. Previous constructions have cardinality within a constant factor of the upper bound only when $q$ is fixed. Chee, Kiah, Purkayastha and Wang showed that a $q$-ary length $n$ non-overlapping code contains at most $q^n/(2n-1)$ codewords; this bound is weaker than the Levenshtein bound. Their proof appealed to the application in synchronisation: we provide a direct combinatorial argument to establish the bound of Chee \emph{et al}. We also consider codes of short length, finding the leading term of the maximal cardinality of a non-overlapping code when $n$ is fixed and $q\rightarrow \infty$. The largest cardinality of non-overlapping codes of lengths $3$ or less is determined exactly. △ Less

Submitted 8 July, 2015; v1 submitted 5 March, 2013; originally announced March 2013.

Comments: 14 pages. Extra explanations added at some points, and an extra citation. To appear in IEEE Trans Information Theory

arXiv:1111.2713 [pdf, ps, other]

The asymptotic behavior of Grassmannian codes

Authors: Simon R. Blackburn, Tuvi Etzion

Abstract: The iterated Johnson bound is the best known upper bound on a size of an error-correcting code in the Grassmannian $\mathcal{G}_q(n,k)$. The iterated Schönheim bound is the best known lower bound on the size of a covering code in $\mathcal{G}_q(n,k)$. We use probabilistic methods to prove that both bounds are asymptotically attained for fixed $k$ and fixed radius, as $n$ approaches infinity. We al… ▽ More The iterated Johnson bound is the best known upper bound on a size of an error-correcting code in the Grassmannian $\mathcal{G}_q(n,k)$. The iterated Schönheim bound is the best known lower bound on the size of a covering code in $\mathcal{G}_q(n,k)$. We use probabilistic methods to prove that both bounds are asymptotically attained for fixed $k$ and fixed radius, as $n$ approaches infinity. We also determine the asymptotics of the size of the best Grassmannian codes and covering codes when $n-k$ and the radius are fixed, as $n$ approaches infinity. △ Less

Submitted 11 November, 2011; originally announced November 2011.

Comments: 5 pages

MSC Class: 94B60

arXiv:1102.2358 [pdf, ps, other]

Cryptanalysis of three matrix-based key establishment protocols

Authors: Simon R. Blackburn, Carlos Cid, Ciaran Mullan

Abstract: We cryptanalyse a matrix-based key transport protocol due to Baumslag, Camps, Fine, Rosenberger and Xu from 2006. We also cryptanalyse two recently proposed matrix-based key agreement protocols, due to Habeeb, Kahrobaei and Shpilrain, and due to Romanczuk and Ustimenko. We cryptanalyse a matrix-based key transport protocol due to Baumslag, Camps, Fine, Rosenberger and Xu from 2006. We also cryptanalyse two recently proposed matrix-based key agreement protocols, due to Habeeb, Kahrobaei and Shpilrain, and due to Romanczuk and Ustimenko. △ Less

Submitted 11 February, 2011; originally announced February 2011.

Comments: 9 pages

arXiv:1102.1053 [pdf, ps, other]

On the Distribution of the Subset Sum Pseudorandom Number Generator on Elliptic Curves

Authors: Simon R. Blackburn, Alina Ostafe, Igor E. Shparlinski

Abstract: Given a prime $p$, an elliptic curve $\E/\F_p$ over the finite field $\F_p$ of $p$ elements and a binary \lrs\ $$u(n)$_{n =1}^\infty$ of order~$r$, we study the distribution of the sequence of points $$ \sum_{j=0}^{r-1} u(n+j)P_j, \qquad n =1,..., N, $$ on average over all possible choices of $\F_p$-rational points $P_1,..., P_r$ on~$\E$. For a sufficiently large $N$ we improve and generalise a… ▽ More Given a prime $p$, an elliptic curve $\E/\F_p$ over the finite field $\F_p$ of $p$ elements and a binary \lrs\ $$u(n)$_{n =1}^\infty$ of order~$r$, we study the distribution of the sequence of points $$ \sum_{j=0}^{r-1} u(n+j)P_j, \qquad n =1,..., N, $$ on average over all possible choices of $\F_p$-rational points $P_1,..., P_r$ on~$\E$. For a sufficiently large $N$ we improve and generalise a previous result in this direction due to E.~El~Mahassni. △ Less

Submitted 5 February, 2011; originally announced February 2011.

MSC Class: Primary 11K45; 11T71; Secondary 11G05; 11T23; 65C05; 94A60

arXiv:1101.1172 [pdf, ps, other]

The existence of k-radius sequences

Authors: Simon R Blackburn

Abstract: Let $n$ and $k$ be positive integers, and let $F$ be an alphabet of size $n$. A sequence over $F$ of length $m$ is a \emph{$k$-radius sequence} if any two distinct elements of $F$ occur within distance $k$ of each other somewhere in the sequence. These sequences were introduced by Jaromczyk and Lonc in 2004, in order to produce an efficient caching strategy when computing certain functions on larg… ▽ More Let $n$ and $k$ be positive integers, and let $F$ be an alphabet of size $n$. A sequence over $F$ of length $m$ is a \emph{$k$-radius sequence} if any two distinct elements of $F$ occur within distance $k$ of each other somewhere in the sequence. These sequences were introduced by Jaromczyk and Lonc in 2004, in order to produce an efficient caching strategy when computing certain functions on large data sets such as medical images. Let $f_k(n)$ be the length of the shortest $n$-ary $k$-radius sequence. The paper shows, using a probabilistic argument, that whenever $k$ is fixed and $n\rightarrow\infty$ \[ f_k(n)\sim \frac{1}{k}\binom{n}{2}. \] The paper observes that the same argument generalises to the situation when we require the following stronger property for some integer $t$ such that $2\leq t\leq k+1$: any $t$ distinct elements of $F$ must simultaneously occur within a distance $k$ of each other somewhere in the sequence. △ Less

Submitted 5 August, 2011; v1 submitted 6 January, 2011; originally announced January 2011.

Comments: 8 pages. More papers cited, and a minor reorganisation of the last section, since last version. Typo corrected in the statement of Theorem 4

MSC Class: 94A55

arXiv:0910.4325 [pdf, ps, other]

Putting Dots in Triangles

Authors: Simon R. Blackburn, Maura B. Paterson, Douglas R. Stinson

Abstract: Given a right-angled triangle of squares in a grid whose horizontal and vertical sides are $n$ squares long, let N(n) denote the maximum number of dots that can be placed into the cells of the triangle such that each row, each column, and each diagonal parallel to the long side of the triangle contains at most one dot. It has been proven that $N(n) = \lfloor \frac{2n+1}{3} \rfloor$. In this no… ▽ More Given a right-angled triangle of squares in a grid whose horizontal and vertical sides are $n$ squares long, let N(n) denote the maximum number of dots that can be placed into the cells of the triangle such that each row, each column, and each diagonal parallel to the long side of the triangle contains at most one dot. It has been proven that $N(n) = \lfloor \frac{2n+1}{3} \rfloor$. In this note, we give a new proof of this result using linear programming techniques. △ Less

Submitted 18 May, 2010; v1 submitted 22 October, 2009; originally announced October 2009.

Comments: 10 pages Minor rephrasing: final version to submit to journal.

Showing 1–27 of 27 results for author: Blackburn, S