-
Enhancing DNA Foundation Models to Address Masking Inefficiencies
Authors:
Monireh Safari,
Pablo Millan Arias,
Scott C. Lowe,
Lila Kari,
Angel X. Chang,
Graham W. Taylor
Abstract:
Masked language modelling (MLM) as a pretraining objective has been widely adopted in genomic sequence modelling. While pretrained models can successfully serve as encoders for various downstream tasks, the distribution shift between pretraining and inference detrimentally impacts performance, as the pretraining task is to map [MASK] tokens to predictions, yet the [MASK] is absent during downstrea…
▽ More
Masked language modelling (MLM) as a pretraining objective has been widely adopted in genomic sequence modelling. While pretrained models can successfully serve as encoders for various downstream tasks, the distribution shift between pretraining and inference detrimentally impacts performance, as the pretraining task is to map [MASK] tokens to predictions, yet the [MASK] is absent during downstream applications. This means the encoder does not prioritize its encodings of non-[MASK] tokens, and expends parameters and compute on work only relevant to the MLM task, despite this being irrelevant at deployment time. In this work, we propose a modified encoder-decoder architecture based on the masked autoencoder framework, designed to address this inefficiency within a BERT-based transformer. We empirically show that the resulting mismatch is particularly detrimental in genomic pipelines where models are often used for feature extraction without fine-tuning. We evaluate our approach on the BIOSCAN-5M dataset, comprising over 2 million unique DNA barcodes. We achieve substantial performance gains in both closed-world and open-world classification tasks when compared against causal models and bidirectional architectures pretrained with MLM tasks.
△ Less
Submitted 25 February, 2025;
originally announced February 2025.
-
CGRclust: Chaos Game Representation for Twin Contrastive Clustering of Unlabelled DNA Sequences
Authors:
Fatemeh Alipour,
Kathleen A. Hill,
Lila Kari
Abstract:
This study proposes CGRclust, a novel combination of unsupervised twin contrastive clustering of Chaos Game Representations (CGR) of DNA sequences, with convolutional neural networks (CNNs). To the best of our knowledge, CGRclust is the first method to use unsupervised learning for image classification (herein applied to two-dimensional CGR images) for clustering datasets of DNA sequences. CGRclus…
▽ More
This study proposes CGRclust, a novel combination of unsupervised twin contrastive clustering of Chaos Game Representations (CGR) of DNA sequences, with convolutional neural networks (CNNs). To the best of our knowledge, CGRclust is the first method to use unsupervised learning for image classification (herein applied to two-dimensional CGR images) for clustering datasets of DNA sequences. CGRclust overcomes the limitations of traditional sequence classification methods by leveraging unsupervised twin contrastive learning to detect distinctive sequence patterns, without requiring DNA sequence alignment or biological/taxonomic labels. CGRclust accurately clustered twenty-five diverse datasets, with sequence lengths ranging from 664 bp to 100 kbp, including mitochondrial genomes of fish, fungi, and protists, as well as viral whole genome assemblies and synthetic DNA sequences. Compared with three recent clustering methods for DNA sequences (DeLUCS, iDeLUCS, and MeShClust v3.0.), CGRclust is the only method that surpasses 81.70% accuracy across all four taxonomic levels tested for mitochondrial DNA genomes of fish. Moreover, CGRclust also consistently demonstrates superior performance across all the viral genomic datasets. The high clustering accuracy of CGRclust on these twenty-five datasets, which vary significantly in terms of sequence length, number of genomes, number of clusters, and level of taxonomy, demonstrates its robustness, scalability, and versatility.
△ Less
Submitted 13 November, 2024; v1 submitted 1 July, 2024;
originally announced July 2024.
-
BIOSCAN-5M: A Multimodal Dataset for Insect Biodiversity
Authors:
Zahra Gharaee,
Scott C. Lowe,
ZeMing Gong,
Pablo Millan Arias,
Nicholas Pellegrino,
Austin T. Wang,
Joakim Bruslund Haurum,
Iuliia Zarubiieva,
Lila Kari,
Dirk Steinke,
Graham W. Taylor,
Paul Fieguth,
Angel X. Chang
Abstract:
As part of an ongoing worldwide effort to comprehend and monitor insect biodiversity, this paper presents the BIOSCAN-5M Insect dataset to the machine learning community and establish several benchmark tasks. BIOSCAN-5M is a comprehensive dataset containing multi-modal information for over 5 million insect specimens, and it significantly expands existing image-based biological datasets by includin…
▽ More
As part of an ongoing worldwide effort to comprehend and monitor insect biodiversity, this paper presents the BIOSCAN-5M Insect dataset to the machine learning community and establish several benchmark tasks. BIOSCAN-5M is a comprehensive dataset containing multi-modal information for over 5 million insect specimens, and it significantly expands existing image-based biological datasets by including taxonomic labels, raw nucleotide barcode sequences, assigned barcode index numbers, geographical, and size information. We propose three benchmark experiments to demonstrate the impact of the multi-modal data types on the classification and clustering accuracy. First, we pretrain a masked language model on the DNA barcode sequences of the BIOSCAN-5M dataset, and demonstrate the impact of using this large reference library on species- and genus-level classification performance. Second, we propose a zero-shot transfer learning task applied to images and DNA barcodes to cluster feature embeddings obtained from self-supervised learning, to investigate whether meaningful clusters can be derived from these representation embeddings. Third, we benchmark multi-modality by performing contrastive learning on DNA barcodes, image data, and taxonomic information. This yields a general shared embedding space enabling taxonomic classification using multiple types of information and modalities. The code repository of the BIOSCAN-5M Insect dataset is available at https://github.com/bioscan-ml/BIOSCAN-5M.
△ Less
Submitted 28 February, 2025; v1 submitted 18 June, 2024;
originally announced June 2024.
-
BarcodeBERT: Transformers for Biodiversity Analysis
Authors:
Pablo Millan Arias,
Niousha Sadjadi,
Monireh Safari,
ZeMing Gong,
Austin T. Wang,
Joakim Bruslund Haurum,
Iuliia Zarubiieva,
Dirk Steinke,
Lila Kari,
Angel X. Chang,
Scott C. Lowe,
Graham W. Taylor
Abstract:
In the global challenge of understanding and characterizing biodiversity, short species-specific genomic sequences known as DNA barcodes play a critical role, enabling fine-grained comparisons among organisms within the same kingdom of life. Although machine learning algorithms specifically designed for the analysis of DNA barcodes are becoming more popular, most existing methodologies rely on gen…
▽ More
In the global challenge of understanding and characterizing biodiversity, short species-specific genomic sequences known as DNA barcodes play a critical role, enabling fine-grained comparisons among organisms within the same kingdom of life. Although machine learning algorithms specifically designed for the analysis of DNA barcodes are becoming more popular, most existing methodologies rely on generic supervised training algorithms. We introduce BarcodeBERT, a family of models tailored to biodiversity analysis and trained exclusively on data from a reference library of 1.5M invertebrate DNA barcodes. We compared the performance of BarcodeBERT on taxonomic identification tasks against a spectrum of machine learning approaches including supervised training of classical neural architectures and fine-tuning of general DNA foundation models. Our self-supervised pretraining strategies on domain-specific data outperform fine-tuned foundation models, especially in identification tasks involving lower taxa such as genera and species. We also compared BarcodeBERT with BLAST, one of the most widely used bioinformatics tools for sequence searching, and found that our method matched BLAST's performance in species-level classification while being 55 times faster. Our analysis of masking and tokenization strategies also provides practical guidance for building customized DNA language models, emphasizing the importance of aligning model training strategies with dataset characteristics and domain knowledge. The code repository is available at https://github.com/bioscan-ml/BarcodeBERT.
△ Less
Submitted 21 January, 2025; v1 submitted 4 November, 2023;
originally announced November 2023.
-
Descriptional Complexity of Semi-Simple Splicing Systems
Authors:
Lila Kari,
Timothy Ng
Abstract:
Splicing systems are generative mechanisms introduced by Tom Head in 1987 to model the biological process of DNA recombination. The computational engine of a splicing system is the "splicing operation", a cut-and-paste binary string operation defined by a set of "splicing rules" $r = (α_1, α_2 ; α_3, α_4)$ where $α_1, α_2, α_3, α_4$ are words over an alphabet $Σ$. For two strings…
▽ More
Splicing systems are generative mechanisms introduced by Tom Head in 1987 to model the biological process of DNA recombination. The computational engine of a splicing system is the "splicing operation", a cut-and-paste binary string operation defined by a set of "splicing rules" $r = (α_1, α_2 ; α_3, α_4)$ where $α_1, α_2, α_3, α_4$ are words over an alphabet $Σ$. For two strings $x = x_1 α_1 α_2 x_2$ and $y = y_1 α_3 α_4 y_2$, applying the splicing rule $r$ produces the string $z = x_1 α_1 α_4 y_2$.
In this paper we focus on a particular type of splicing systems, called $(i, j)$ semi-simple splicing systems, $i = 1,2$ and $j = 3, 4$, wherein all splicing rules have the property that the two strings in positions $i$ and $j$ are singleton letters, while the other two strings are empty. The language generated by such a system consists of the set of words that are obtained starting from an initial set called "axiom set", by iteratively applying the splicing rules to strings in the axiom set as well as to intermediately produced strings. We consider semi-simple splicing systems where the axiom set is a regular language, and investigate the descriptional complexity of such systems in terms of the size of the minimal deterministic finite automata that recognize the languages they generate.
△ Less
Submitted 5 September, 2019;
originally announced September 2019.
-
State Complexity of Overlap Assembly
Authors:
Janusz Brzozowski,
Lila Kari,
Bai Li,
Marek Szykuła
Abstract:
The \emph{state complexity} of a regular language $L_m$ is the number $m$ of states in a minimal deterministic finite automaton (DFA) accepting $L_m$. The state complexity of a regularity-preserving binary operation on regular languages is defined as the maximal state complexity of the result of the operation where the two operands range over all languages of state complexities $\le m$ and…
▽ More
The \emph{state complexity} of a regular language $L_m$ is the number $m$ of states in a minimal deterministic finite automaton (DFA) accepting $L_m$. The state complexity of a regularity-preserving binary operation on regular languages is defined as the maximal state complexity of the result of the operation where the two operands range over all languages of state complexities $\le m$ and $\le n$, respectively. We find a tight upper bound on the state complexity of the binary operation \emph{overlap assembly} on regular languages. This operation was introduced by Csuhaj-Varjú, Petre, and Vaszil to model the process of self-assembly of two linear DNA strands into a longer DNA strand, provided that their ends "overlap". We prove that the state complexity of the overlap assembly of languages $L_m$ and $L_n$, where $m\ge 2$ and $n\ge1$, is at most $2 (m-1) 3^{n-1} + 2^n$. Moreover, for $m \ge 2$ and $n \ge 3$ there exist languages $L_m$ and $L_n$ over an alphabet of size $n$ whose overlap assembly meets the upper bound and this bound cannot be met with smaller alphabets. Finally, we prove that $m+n$ is a tight upper bound on the overlap assembly of unary languages, and that there are binary languages whose overlap assembly has exponential state complexity at least $m(2^{n-1}-2)+2$.
△ Less
Submitted 11 December, 2018; v1 submitted 16 October, 2017;
originally announced October 2017.
-
Transducer Descriptions of DNA Code Properties and Undecidability of Antimorphic Problems
Authors:
Lila Kari,
Stavros Konstantinidis,
Steffen Kopecki
Abstract:
This work concerns formal descriptions of DNA code properties, and builds on previous work on transducer descriptions of classic code properties and on trajectory descriptions of DNA code properties. This line of research allows us to give a property as input to an algorithm, in addition to any regular language, which can then answer questions about the language and the property. Here we define DN…
▽ More
This work concerns formal descriptions of DNA code properties, and builds on previous work on transducer descriptions of classic code properties and on trajectory descriptions of DNA code properties. This line of research allows us to give a property as input to an algorithm, in addition to any regular language, which can then answer questions about the language and the property. Here we define DNA code properties via transducers and show that this method is strictly more expressive than that of trajectories, without sacrificing the efficiency of deciding the satisfaction question. We also show that the maximality question can be undecidable. Our undecidability results hold not only for the fixed DNA involution but also for any fixed antimorphic permutation. Moreover, we also show the undecidability of the antimorphic version of the Post Corresponding Problem, for any fixed antimorphic permutation.
△ Less
Submitted 27 February, 2015;
originally announced March 2015.
-
An efficient algorithm for computing the edit distance of a regular language via input-altering transducers
Authors:
Lila Kari,
Stavros Konstantinidis,
Steffen Kopecki,
Meng Yang
Abstract:
We revisit the problem of computing the edit distance of a regular language given via an NFA. This problem relates to the inherent maximal error-detecting capability of the language in question. We present an efficient algorithm for solving this problem which executes in time $O(r^2n^2d)$, where $r$ is the cardinality of the alphabet involved, $n$ is the number of transitions in the given NFA, and…
▽ More
We revisit the problem of computing the edit distance of a regular language given via an NFA. This problem relates to the inherent maximal error-detecting capability of the language in question. We present an efficient algorithm for solving this problem which executes in time $O(r^2n^2d)$, where $r$ is the cardinality of the alphabet involved, $n$ is the number of transitions in the given NFA, and $d$ is the computed edit distance. We have implemented the algorithm and present here performance tests. The correctness of the algorithm is based on the result (also presented here) that the particular error-detection property related to our problem can be defined via an input-altering transducer.
△ Less
Submitted 4 June, 2014;
originally announced June 2014.
-
Binary pattern tile set synthesis is NP-hard
Authors:
Lila Kari,
Steffen Kopecki,
Pierre-Étienne Meunier,
Matthew J. Patitz,
Shinnosuke Seki
Abstract:
In the field of algorithmic self-assembly, a long-standing unproven conjecture has been that of the NP-hardness of binary pattern tile set synthesis (2-PATS). The $k$-PATS problem is that of designing a tile assembly system with the smallest number of tile types which will self-assemble an input pattern of $k$ colors. Of both theoretical and practical significance, $k$-PATS has been studied in a s…
▽ More
In the field of algorithmic self-assembly, a long-standing unproven conjecture has been that of the NP-hardness of binary pattern tile set synthesis (2-PATS). The $k$-PATS problem is that of designing a tile assembly system with the smallest number of tile types which will self-assemble an input pattern of $k$ colors. Of both theoretical and practical significance, $k$-PATS has been studied in a series of papers which have shown $k$-PATS to be NP-hard for $k = 60$, $k = 29$, and then $k = 11$. In this paper, we close the fundamental conjecture that 2-PATS is NP-hard, concluding this line of study.
While most of our proof relies on standard mathematical proof techniques, one crucial lemma makes use of a computer-assisted proof, which is a relatively novel but increasingly utilized paradigm for deriving proofs for complex mathematical problems. This tool is especially powerful for attacking combinatorial problems, as exemplified by the proof of the four color theorem by Appel and Haken (simplified later by Robertson, Sanders, Seymour, and Thomas) or the recent important advance on the Erdős discrepancy problem by Konev and Lisitsa using computer programs. We utilize a massively parallel algorithm and thus turn an otherwise intractable portion of our proof into a program which requires approximately a year of computation time, bringing the use of computer-assisted proofs to a new scale. We fully detail the algorithm employed by our code, and make the code freely available online.
△ Less
Submitted 3 April, 2014;
originally announced April 2014.
-
Map of Life: Measuring and Visualizing Species' Relatedness with "Molecular Distance Maps"
Authors:
Lila Kari,
Kathleen A. Hill,
Abu Sadat Sayem,
Nathaniel Bryans,
Katelyn Davis,
Nikesh S. Dattani
Abstract:
We propose a novel combination of methods that (i) portrays quantitative characteristics of a DNA sequence as an image, (ii) computes distances between these images, and (iii) uses these distances to output a map wherein each sequence is a point in a common Euclidean space. In the resulting "Molecular Distance Map" each point signifies a DNA sequence, and the geometric distance between any two poi…
▽ More
We propose a novel combination of methods that (i) portrays quantitative characteristics of a DNA sequence as an image, (ii) computes distances between these images, and (iii) uses these distances to output a map wherein each sequence is a point in a common Euclidean space. In the resulting "Molecular Distance Map" each point signifies a DNA sequence, and the geometric distance between any two points reflects the degree of relatedness between the corresponding sequences and species.
Molecular Distance Maps present compelling visual representations of relationships between species and could be used for taxonomic clarifications, for species identification, and for studies of evolutionary history. One of the advantages of this method is its general applicability since, as sequence alignment is not required, the DNA sequences chosen for comparison can be completely different regions in different genomes. In fact, this method can be used to compare any two DNA sequences. For example, in our dataset of 3,176 mitochondrial DNA sequences, it correctly finds the mtDNA sequences most closely related to that of the anatomically modern human (the Neanderthal, the Denisovan, and the chimp), and it finds that the sequence most different from it belongs to a cucumber. Furthermore, our method can be used to compare real sequences to artificial, computer-generated, DNA sequences. For example, it is used to determine that the distances between a Homo sapiens sapiens mtDNA and artificial sequences of the same length and same trinucleotide frequencies can be larger than the distance between the same human mtDNA and the mtDNA of a fruit-fly.
We demonstrate this method's promising potential for taxonomical clarifications by applying it to a diverse variety of cases that have been historically controversial, such as the genus Polypterus, the family Tarsiidae, and the vast (super)kingdom Protista.
△ Less
Submitted 14 July, 2013;
originally announced July 2013.
-
3-color Bounded Patterned Self-assembly
Authors:
Lila Kari,
Steffen Kopecki,
Shinnosuke Seki
Abstract:
Patterned self-assembly tile set synthesis PATS is the problem of finding a minimal tile set which uniquely self-assembles into a given pattern. Czeizler and Popa proved the NP-completeness of PATS and Seki showed that the PATS problem is already NP-complete for patterns with 60 colors. In search for the minimal number of colors such that PATS remains NP-complete, we introduce multiple bound PATS…
▽ More
Patterned self-assembly tile set synthesis PATS is the problem of finding a minimal tile set which uniquely self-assembles into a given pattern. Czeizler and Popa proved the NP-completeness of PATS and Seki showed that the PATS problem is already NP-complete for patterns with 60 colors. In search for the minimal number of colors such that PATS remains NP-complete, we introduce multiple bound PATS (mbPATS) where we allow bounds for the numbers of tile types of each color. We show that mbPATS is NP-complete for patterns with just three colors and, as a byproduct of this result, we also obtain a novel proof for the NP-completeness of PATS which is more concise than the previous proofs.
△ Less
Submitted 13 June, 2013;
originally announced June 2013.
-
Hypergraph Automata: A Theoretical Model for Patterned Self-assembly
Authors:
Lila Kari,
Steffen Kopecki,
Amirhossein Simjour
Abstract:
Patterned self-assembly is a process whereby coloured tiles self-assemble to build a rectangular coloured pattern. We propose self-assembly (SA) hypergraph automata as an automata-theoretic model for patterned self-assembly. We investigate the computational power of SA-hypergraph automata and show that for every recognizable picture language, there exists an SA-hypergraph automaton that accepts th…
▽ More
Patterned self-assembly is a process whereby coloured tiles self-assemble to build a rectangular coloured pattern. We propose self-assembly (SA) hypergraph automata as an automata-theoretic model for patterned self-assembly. We investigate the computational power of SA-hypergraph automata and show that for every recognizable picture language, there exists an SA-hypergraph automaton that accepts this language. Conversely, we prove that for any restricted SA-hypergraph automaton, there exists a Wang Tile System, a model for recognizable picture languages, that accepts the same language. The advantage of SA-hypergraph automata over Wang automata, acceptors for the class of recognizable picture languages, is that they do not rely on an a priori defined scanning strategy
△ Less
Submitted 12 February, 2013;
originally announced February 2013.
-
Deciding Whether a Regular Language is Generated by a Splicing System
Authors:
Lila Kari,
Steffen Kopecki
Abstract:
Splicing as a binary word/language operation is inspired by the DNA recombination under the action of restriction enzymes and ligases, and was first introduced by Tom Head in 1987. Shortly thereafter, it was proven that the languages generated by (finite) splicing systems form a proper subclass of the class of regular languages. However, the question of whether or not one can decide if a given reg…
▽ More
Splicing as a binary word/language operation is inspired by the DNA recombination under the action of restriction enzymes and ligases, and was first introduced by Tom Head in 1987. Shortly thereafter, it was proven that the languages generated by (finite) splicing systems form a proper subclass of the class of regular languages. However, the question of whether or not one can decide if a given regular language is generated by a splicing system remained open. In this paper we give a positive answer to this question. Namely, we prove that, if a language is generated by a splicing system, then it is also generated by a splicing system whose size is a function of the size of the syntactic monoid of the input language, and which can be effectively constructed.
△ Less
Submitted 30 August, 2012; v1 submitted 20 December, 2011;
originally announced December 2011.
-
Iterated Hairpin Completions of Non-crossing Words
Authors:
Lila Kari,
Steffen Kopecki,
Shinnosuke Seki
Abstract:
Iterated hairpin completion is an operation on formal languages that is inspired by the hairpin formation in DNA biochemistry. Iterated hairpin completion of a word (or more precisely a singleton language) is always a context-sensitive language and for some words it is known to be non-context-free. However, it is unknown whether regularity of iterated hairpin completion of a given word is decidabl…
▽ More
Iterated hairpin completion is an operation on formal languages that is inspired by the hairpin formation in DNA biochemistry. Iterated hairpin completion of a word (or more precisely a singleton language) is always a context-sensitive language and for some words it is known to be non-context-free. However, it is unknown whether regularity of iterated hairpin completion of a given word is decidable. Also the question whether iterated hairpin completion of a word can be context-free but not regular was asked in literature. In this paper we investigate iterated hairpin completions of non-crossing words and, within this setting, we are able to answer both questions. For non-crossing words we prove that the regularity of iterated hairpin completions is decidable and that if iterated hairpin completion of a non-crossing word is not regular, then it is not context-free either.
△ Less
Submitted 4 October, 2011;
originally announced October 2011.
-
On the regularity of iterated hairpin completion of a single word
Authors:
Lila Kari,
Steffen Kopecki,
Shinnosuke Seki
Abstract:
Hairpin completion is an abstract operation modeling a DNA bio-operation which receives as input a DNA strand $w = xαy \calpha$, and outputs $w' = x αy \barα \bar{x}$, where $\bar{x}$ denotes the Watson-Crick complement of $x$. In this paper, we focus on the problem of finding conditions under which the iterated hairpin completion of a given word is regular. According to the numbers of words $α$ a…
▽ More
Hairpin completion is an abstract operation modeling a DNA bio-operation which receives as input a DNA strand $w = xαy \calpha$, and outputs $w' = x αy \barα \bar{x}$, where $\bar{x}$ denotes the Watson-Crick complement of $x$. In this paper, we focus on the problem of finding conditions under which the iterated hairpin completion of a given word is regular. According to the numbers of words $α$ and $\calpha$ that initiate hairpin completion and how they are scattered, we classify the set of all words $w$. For some basic classes of words $w$ containing small numbers of occurrences of $α$ and $\calpha$, we prove that the iterated hairpin completion of $w$ is regular. For other classes with higher numbers of occurrences of $α$ and $\calpha$, we prove a necessary and sufficient condition for the iterated hairpin completion of a word in these classes to be regular.
△ Less
Submitted 13 April, 2011;
originally announced April 2011.
-
Ciliate Gene Unscrambling with Fewer Templates
Authors:
Lila Kari,
Afroza Rahman
Abstract:
One of the theoretical models proposed for the mechanism of gene unscrambling in some species of ciliates is the template-guided recombination (TGR) system by Prescott, Ehrenfeucht and Rozenberg which has been generalized by Daley and McQuillan from a formal language theory perspective. In this paper, we propose a refinement of this model that generates regular languages using the iterated TGR sys…
▽ More
One of the theoretical models proposed for the mechanism of gene unscrambling in some species of ciliates is the template-guided recombination (TGR) system by Prescott, Ehrenfeucht and Rozenberg which has been generalized by Daley and McQuillan from a formal language theory perspective. In this paper, we propose a refinement of this model that generates regular languages using the iterated TGR system with a finite initial language and a finite set of templates, using fewer templates and a smaller alphabet compared to that of the Daley-McQuillan model. To achieve Turing completeness using only finite components, i.e., a finite initial language and a finite set of templates, we also propose an extension of the contextual template-guided recombination system (CTGR system) by Daley and McQuillan, by adding an extra control called permitting contexts on the usage of templates.
△ Less
Submitted 10 August, 2010;
originally announced August 2010.
-
State Complexity of Catenation Combined with Star and Reversal
Authors:
Bo Cui,
Yuan Gao,
Lila Kari,
Sheng Yu
Abstract:
This paper is a continuation of our research work on state complexity of combined operations. Motivated by applications, we study the state complexities of two particular combined operations: catenation combined with star and catenation combined with reversal. We show that the state complexities of both of these combined operations are considerably less than the compositions of the state complexit…
▽ More
This paper is a continuation of our research work on state complexity of combined operations. Motivated by applications, we study the state complexities of two particular combined operations: catenation combined with star and catenation combined with reversal. We show that the state complexities of both of these combined operations are considerably less than the compositions of the state complexities of their individual participating operations.
△ Less
Submitted 10 August, 2010;
originally announced August 2010.
-
State Complexity of Two Combined Operations: Reversal-Catenation and Star-Catenation
Authors:
Bo Cui,
Yuan Gao,
Lila Kari,
Sheng Yu
Abstract:
In this paper, we show that, due to the structural properties of the resulting automaton obtained from a prior operation, the state complexity of a combined operation may not be equal but close to the mathematical composition of the state complexities of its component operations. In particular, we provide two witness combined operations: reversal combined with catenation and star combined with cat…
▽ More
In this paper, we show that, due to the structural properties of the resulting automaton obtained from a prior operation, the state complexity of a combined operation may not be equal but close to the mathematical composition of the state complexities of its component operations. In particular, we provide two witness combined operations: reversal combined with catenation and star combined with catenation.
△ Less
Submitted 23 June, 2010;
originally announced June 2010.
-
The Power of Nondeterminism in Self-Assembly
Authors:
Nathaniel Bryans,
Ehsan Chiniforooshan,
David Doty,
Lila Kari,
Shinnosuke Seki
Abstract:
We investigate the role of nondeterminism in Winfree's abstract Tile Assembly Model (aTAM), which was conceived to model artificial molecular self-assembling systems constructed from DNA. Of particular practical importance is to find tile systems that minimize resources such as the number of distinct tile types, each of which corresponds to a set of DNA strands that must be custom-synthesized in a…
▽ More
We investigate the role of nondeterminism in Winfree's abstract Tile Assembly Model (aTAM), which was conceived to model artificial molecular self-assembling systems constructed from DNA. Of particular practical importance is to find tile systems that minimize resources such as the number of distinct tile types, each of which corresponds to a set of DNA strands that must be custom-synthesized in actual molecular implementations of the aTAM. We seek to identify to what extent the use of nondeterminism in tile systems affects the resources required by such molecular shape-building algorithms.
We first show a "molecular computability theoretic" result: there is an infinite shape S that is uniquely assembled by a tile system but not by any deterministic tile system. We then show an analogous phenomenon in the finitary "molecular complexity theoretic" case: there is a finite shape S that is uniquely assembled by a tile system with c tile types, but every deterministic tile system that uniquely assembles S has more than c tile types. In fact we extend the technique to derive a stronger (classical complexity theoretic) result, showing that the problem of finding the minimum number of tile types that uniquely assemble a given finite shape is Sigma-P-2-complete. In contrast, the problem of finding the minimum number of deterministic tile types that uniquely assemble a shape was shown to be NP-complete by Adleman, Cheng, Goel, Huang, Kempe, Moisset de Espanés, and Rothemund (Combinatorial Optimization Problems in Self-Assembly, STOC 2002).
The conclusion is that nondeterminism confers extra power to assemble a shape from a small tile system, but unless the polynomial hierarchy collapses, it is computationally more difficult to exploit this power by finding the size of the smallest tile system, compared to finding the size of the smallest deterministic tile system.
△ Less
Submitted 25 November, 2010; v1 submitted 15 June, 2010;
originally announced June 2010.
-
Scalable, Time-Responsive, Digital, Energy-Efficient Molecular Circuits using DNA Strand Displacement
Authors:
Ehsan Chiniforooshan,
David Doty,
Lila Kari,
Shinnosuke Seki
Abstract:
We propose a novel theoretical biomolecular design to implement any Boolean circuit using the mechanism of DNA strand displacement. The design is scalable: all species of DNA strands can in principle be mixed and prepared in a single test tube, rather than requiring separate purification of each species, which is a barrier to large-scale synthesis. The design is time-responsive: the concentratio…
▽ More
We propose a novel theoretical biomolecular design to implement any Boolean circuit using the mechanism of DNA strand displacement. The design is scalable: all species of DNA strands can in principle be mixed and prepared in a single test tube, rather than requiring separate purification of each species, which is a barrier to large-scale synthesis. The design is time-responsive: the concentration of output species changes in response to the concentration of input species, so that time-varying inputs may be continuously processed. The design is digital: Boolean values of wires in the circuit are represented as high or low concentrations of certain species, and we show how to construct a single-input, single-output signal restoration gate that amplifies the difference between high and low, which can be distributed to each wire in the circuit to overcome signal degradation. This means we can achieve a digital abstraction of the analog values of concentrations. Finally, the design is energy-efficient: if input species are specified ideally (meaning absolutely 0 concentration of unwanted species), then output species converge to their ideal concentrations at steady-state, and the system at steady-state is in (dynamic) equilibrium, meaning that no energy is consumed by irreversible reactions until the input again changes.
Drawbacks of our design include the following. If input is provided non-ideally (small positive concentration of unwanted species), then energy must be continually expended to maintain correct output concentrations even at steady-state. In addition, our fuel species - those species that are permanently consumed in irreversible reactions - are not "generic"; each gate in the circuit is powered by its own specific type of fuel species. Hence different circuits must be powered by different types of fuel. Finally, we require input to be given according to the dual-rail convention, so that an input of 0 is specified not only by the absence of a certain species, but by the presence of another. That is, we do not construct a "true NOT gate" that sets its output to high concentration if and only if its input's concentration is low. It remains an open problem to design scalable, time-responsive, digital, energy-efficient molecular circuits that additionally solve one of these problems, or to prove that some subset of their resolutions are mutually incompatible.
△ Less
Submitted 18 March, 2010; v1 submitted 16 March, 2010;
originally announced March 2010.
-
Triangular Self-Assembly
Authors:
Lila Kari,
Shinnosuke Seki,
Zhi Xu
Abstract:
We discuss the self-assembly system of triangular tiles instead of square tiles, in particular right triangular tiles and equilateral triangular tiles. We show that the triangular tile assembly system, either deterministic or non-deterministic, has the same power to the square tile assembly system in computation, which is Turing universal. By providing counter-examples, we show that the triangul…
▽ More
We discuss the self-assembly system of triangular tiles instead of square tiles, in particular right triangular tiles and equilateral triangular tiles. We show that the triangular tile assembly system, either deterministic or non-deterministic, has the same power to the square tile assembly system in computation, which is Turing universal. By providing counter-examples, we show that the triangular tile assembly system and the square tile assembly system are not comparable in general. More precisely, there exists square tile assembly system S such that no triangular tile assembly system is a division of S and produces the same shape; there exists triangular tile assembly system T such that no square tile assembly system produces the same compatible shape with border glues. We also discuss the assembly of triangles by triangular tiles and obtain results similar to the assembly of squares, that is to assemble a triangular of size O(N^2), the minimal number of tiles required is in O(log N/log log N).
△ Less
Submitted 26 February, 2010;
originally announced February 2010.
-
Properties of Pseudo-Primitive Words and their Applications
Authors:
Lila Kari,
Benoît Masson,
Shinnosuke Seki
Abstract:
A pseudo-primitive word with respect to an antimorphic involution θis a word which cannot be written as a catenation of occurrences of a strictly shorter word t and θ(t). Properties of pseudo-primitive words are investigated in this paper. These properties link pseudo-primitive words with essential notions in combinatorics on words such as primitive words, (pseudo)-palindromes, and (pseudo)-comm…
▽ More
A pseudo-primitive word with respect to an antimorphic involution θis a word which cannot be written as a catenation of occurrences of a strictly shorter word t and θ(t). Properties of pseudo-primitive words are investigated in this paper. These properties link pseudo-primitive words with essential notions in combinatorics on words such as primitive words, (pseudo)-palindromes, and (pseudo)-commutativity. Their applications include an improved solution to the extended Lyndon-Schützenberger equation u_1 u_2 ... u_l = v_1 ... v_n w_1 ... w_m, where u_1, ..., u_l \in {u, θ(u)}, v_1, ..., v_n \in {v, θ(v)}, and w_1, ..., w_m \in {w, \theata(w)} for some words u, v, w, integers l, n, m \ge 2, and an antimorphic involution θ. We prove that for l \ge 4, n,m \ge 3, this equation implies that u, v, w can be expressed in terms of a common word t and its image θ(t). Moreover, several cases of this equation where l = 3 are examined.
△ Less
Submitted 22 February, 2010;
originally announced February 2010.
-
Polyominoes Simulating Arbitrary-Neighborhood Zippers and Tilings
Authors:
Lila Kari,
Benoît Masson
Abstract:
This paper provides a bridge between the classical tiling theory and the complex neighborhood self-assembling situations that exist in practice. The neighborhood of a position in the plane is the set of coordinates which are considered adjacent to it. This includes classical neighborhoods of size four, as well as arbitrarily complex neighborhoods. A generalized tile system consists of a set of til…
▽ More
This paper provides a bridge between the classical tiling theory and the complex neighborhood self-assembling situations that exist in practice. The neighborhood of a position in the plane is the set of coordinates which are considered adjacent to it. This includes classical neighborhoods of size four, as well as arbitrarily complex neighborhoods. A generalized tile system consists of a set of tiles, a neighborhood, and a relation which dictates which are the "admissible" neighboring tiles of a given tile. Thus, in correctly formed assemblies, tiles are assigned positions of the plane in accordance to this relation. We prove that any validly tiled path defined in a given but arbitrary neighborhood (a zipper) can be simulated by a simple "ribbon" of microtiles. A ribbon is a special kind of polyomino, consisting of a non-self-crossing sequence of tiles on the plane, in which successive tiles stick along their adjacent edge. Finally, we extend this construction to the case of traditional tilings, proving that we can simulate arbitrary-neighborhood tilings by simple-neighborhood tilings, while preserving some of their essential properties.
△ Less
Submitted 11 April, 2011; v1 submitted 19 February, 2010;
originally announced February 2010.
-
Negative Interactions in Irreversible Self-Assembly
Authors:
David Doty,
Lila Kari,
Benoit Masson
Abstract:
This paper explores the use of negative (i.e., repulsive) interaction the abstract Tile Assembly Model defined by Winfree. Winfree postulated negative interactions to be physically plausible in his Ph.D. thesis, and Reif, Sahu, and Yin explored their power in the context of reversible attachment operations. We explore the power of negative interactions with irreversible attachments, and we achie…
▽ More
This paper explores the use of negative (i.e., repulsive) interaction the abstract Tile Assembly Model defined by Winfree. Winfree postulated negative interactions to be physically plausible in his Ph.D. thesis, and Reif, Sahu, and Yin explored their power in the context of reversible attachment operations. We explore the power of negative interactions with irreversible attachments, and we achieve two main results. Our first result is an impossibility theorem: after t steps of assembly, Omega(t) tiles will be forever bound to an assembly, unable to detach. Thus negative glue strengths do not afford unlimited power to reuse tiles. Our second result is a positive one: we construct a set of tiles that can simulate a Turing machine with space bound s and time bound t, while ensuring that no intermediate assembly grows larger than O(s), rather than O(s * t) as required by the standard Turing machine simulation with tiles.
△ Less
Submitted 13 February, 2010;
originally announced February 2010.
-
Pseudo-Power Avoidance
Authors:
Ehsan Chiniforooshan,
Lila Kari,
Zhi Xu
Abstract:
Repetition avoidance has been studied since Thue's work. In this paper, we considered another type of repetition, which is called pseudo-power. This concept is inspired by Watson-Crick complementarity in DNA sequence and is defined over an antimorphic involution $φ$. We first classify the alphabet $Σ$ and the antimorphic involution $φ$, under which there exists sufficiently long pseudo-$k$th-pow…
▽ More
Repetition avoidance has been studied since Thue's work. In this paper, we considered another type of repetition, which is called pseudo-power. This concept is inspired by Watson-Crick complementarity in DNA sequence and is defined over an antimorphic involution $φ$. We first classify the alphabet $Σ$ and the antimorphic involution $φ$, under which there exists sufficiently long pseudo-$k$th-power-free words. Then we present algorithms to test whether a finite word $w$ is pseudo-$k$th-power-free.
△ Less
Submitted 11 November, 2009;
originally announced November 2009.