-
Flash Invariant Point Attention
Authors:
Andrew Liu,
Axel Elaldi,
Nicholas T Franklin,
Nathan Russell,
Gurinder S Atwal,
Yih-En A Ban,
Olivia Viessmann
Abstract:
Invariant Point Attention (IPA) is a key algorithm for geometry-aware modeling in structural biology, central to many protein and RNA models. However, its quadratic complexity limits the input sequence length. We introduce FlashIPA, a factorized reformulation of IPA that leverages hardware-efficient FlashAttention to achieve linear scaling in GPU memory and wall-clock time with sequence length. Fl…
▽ More
Invariant Point Attention (IPA) is a key algorithm for geometry-aware modeling in structural biology, central to many protein and RNA models. However, its quadratic complexity limits the input sequence length. We introduce FlashIPA, a factorized reformulation of IPA that leverages hardware-efficient FlashAttention to achieve linear scaling in GPU memory and wall-clock time with sequence length. FlashIPA matches or exceeds standard IPA performance while substantially reducing computational costs. FlashIPA extends training to previously unattainable lengths, and we demonstrate this by re-training generative models without length restrictions and generating structures of thousands of residues. FlashIPA is available at https://github.com/flagshippioneering/flash_ipa.
△ Less
Submitted 16 May, 2025;
originally announced May 2025.
-
Learning quantitative sequence-function relationships from massively parallel experiments
Authors:
Gurinder S. Atwal,
Justin B. Kinney
Abstract:
A fundamental aspect of biological information processing is the ubiquity of sequence-function relationships -- functions that map the sequence of DNA, RNA, or protein to a biochemically relevant activity. Most sequence-function relationships in biology are quantitative, but only recently have experimental techniques for effectively measuring these relationships been developed. The advent of such…
▽ More
A fundamental aspect of biological information processing is the ubiquity of sequence-function relationships -- functions that map the sequence of DNA, RNA, or protein to a biochemically relevant activity. Most sequence-function relationships in biology are quantitative, but only recently have experimental techniques for effectively measuring these relationships been developed. The advent of such "massively parallel" experiments presents an exciting opportunity for the concepts and methods of statistical physics to inform the study of biological systems. After reviewing these recent experimental advances, we focus on the problem of how to infer parametric models of sequence-function relationships from the data produced by these experiments. Specifically, we retrace and extend recent theoretical work showing that inference based on mutual information, not the standard likelihood-based approach, is often necessary for accurately learning the parameters of these models. Closely connected with this result is the emergence of "diffeomorphic modes" -- directions in parameter space that are far less constrained by data than likelihood-based inference would suggest. Analogous to Goldstone modes in physics, diffeomorphic modes arise from an arbitrarily broken symmetry of the inference problem. An analytically tractable model of a massively parallel experiment is then described, providing an explicit demonstration of these fundamental aspects of statistical inference. This paper concludes with an outlook on the theoretical and computational challenges currently facing studies of quantitative sequence-function relationships.
△ Less
Submitted 22 September, 2015; v1 submitted 29 May, 2015;
originally announced June 2015.
-
Equitability, mutual information, and the maximal information coefficient
Authors:
Justin B. Kinney,
Gurinder S. Atwal
Abstract:
Reshef et al. recently proposed a new statistical measure, the "maximal information coefficient" (MIC), for quantifying arbitrary dependencies between pairs of stochastic quantities. MIC is based on mutual information, a fundamental quantity in information theory that is widely understood to serve this need. MIC, however, is not an estimate of mutual information. Indeed, it was claimed that MIC po…
▽ More
Reshef et al. recently proposed a new statistical measure, the "maximal information coefficient" (MIC), for quantifying arbitrary dependencies between pairs of stochastic quantities. MIC is based on mutual information, a fundamental quantity in information theory that is widely understood to serve this need. MIC, however, is not an estimate of mutual information. Indeed, it was claimed that MIC possesses a desirable mathematical property called "equitability" that mutual information lacks. This was not proven; instead it was argued solely through the analysis of simulated data. Here we show that this claim, in fact, is incorrect. First we offer mathematical proof that no (non-trivial) dependence measure satisfies the definition of equitability proposed by Reshef et al.. We then propose a self-consistent and more general definition of equitability that follows naturally from the Data Processing Inequality. Mutual information satisfies this new definition of equitability while MIC does not. Finally, we show that the simulation evidence offered by Reshef et al. was artifactual. We conclude that estimating mutual information is not only practical for many real-world applications, but also provides a natural solution to the problem of quantifying associations in large data sets.
△ Less
Submitted 31 January, 2013;
originally announced January 2013.
-
Kerfuffle: a web tool for multi-species gene colocalization analysis
Authors:
Robert Aboukhalil,
Bernard Fendler,
Gurinder S. Atwal
Abstract:
The evolutionary pressures that underlie the large-scale functional organization of the genome are not well understood in eukaryotes. Recent evidence suggests that functionally similar genes may colocalize (cluster) in the eukaryotic genome, suggesting the role of chromatin-level gene regulation in shaping the physical distribution of coordinated genes. However, few of the bioinformatic tools curr…
▽ More
The evolutionary pressures that underlie the large-scale functional organization of the genome are not well understood in eukaryotes. Recent evidence suggests that functionally similar genes may colocalize (cluster) in the eukaryotic genome, suggesting the role of chromatin-level gene regulation in shaping the physical distribution of coordinated genes. However, few of the bioinformatic tools currently available allow for a systematic study of gene colocalization across several, evolutionarily distant species. Kerfuffle is a web tool designed to help discover, visualize, and quantify the physical organization of genomes by identifying significant gene colocalization and conservation across the assembled genomes of available species (currently up to 47, from humans to worms). Kerfuffle only requires the user to specify a list of human genes and the names of other species of interest. Without further input from the user, the software queries the e!Ensembl BioMart server to obtain positional information and discovers homology relations in all genes and species specified. Using this information, Kerfuffle performs a multi-species clustering analysis, presents downloadable lists of clustered genes, performs Monte Carlo statistical significance calculations, estimates how conserved gene clusters are across species, plots histograms and interactive graphs, allows users to save their queries, and generates a downloadable visualization of the clusters using the Circos software. These analyses may be used to further explore the functional roles of gene clusters by interrogating the enriched molecular pathways associated with each cluster.
△ Less
Submitted 14 January, 2013;
originally announced January 2013.
-
Parametric inference in the large data limit using maximally informative models
Authors:
Justin B. Kinney,
Gurinder S. Atwal
Abstract:
Motivated by data-rich experiments in transcriptional regulation and sensory neuroscience, we consider the following general problem in statistical inference. When exposed to a high-dimensional signal S, a system of interest computes a representation R of that signal which is then observed through a noisy measurement M. From a large number of signals and measurements, we wish to infer the "filter"…
▽ More
Motivated by data-rich experiments in transcriptional regulation and sensory neuroscience, we consider the following general problem in statistical inference. When exposed to a high-dimensional signal S, a system of interest computes a representation R of that signal which is then observed through a noisy measurement M. From a large number of signals and measurements, we wish to infer the "filter" that maps S to R. However, the standard method for solving such problems, likelihood-based inference, requires perfect a priori knowledge of the "noise function" mapping R to M. In practice such noise functions are usually known only approximately, if at all, and using an incorrect noise function will typically bias the inferred filter. Here we show that, in the large data limit, this need for a pre-characterized noise function can be circumvented by searching for filters that instead maximize the mutual information I[M;R] between observed measurements and predicted representations. Moreover, if the correct filter lies within the space of filters being explored, maximizing mutual information becomes equivalent to simultaneously maximizing every dependence measure that satisfies the Data Processing Inequality. It is important to note that maximizing mutual information will typically leave a small number of directions in parameter space unconstrained. We term these directions "diffeomorphic modes" and present an equation that allows these modes to be derived systematically. The presence of diffeomorphic modes reflects a fundamental and nontrivial substructure within parameter space, one that is obscured by standard likelihood-based inference.
△ Less
Submitted 13 December, 2013; v1 submitted 14 December, 2012;
originally announced December 2012.
-
Utilizing RNA-Seq Data for Cancer Network Inference
Authors:
Ying Cai,
Bernard Fendler,
Gurinder S. Atwal
Abstract:
An important challenge in cancer systems biology is to uncover the complex network of interactions between genes (tumor suppressor genes and oncogenes) implicated in cancer. Next generation sequencing provides unparalleled ability to probe the expression levels of the entire set of cancer genes and their transcript isoforms. However, there are onerous statistical and computational issues in interp…
▽ More
An important challenge in cancer systems biology is to uncover the complex network of interactions between genes (tumor suppressor genes and oncogenes) implicated in cancer. Next generation sequencing provides unparalleled ability to probe the expression levels of the entire set of cancer genes and their transcript isoforms. However, there are onerous statistical and computational issues in interpreting high-dimensional sequencing data and inferring the underlying genetic network. In this study, we analyzed RNA-Seq data from lymphoblastoid cell lines derived from a population of 69 human individuals and implemented a probabilistic framework to construct biologically-relevant genetic networks. In particular, we employed a graphical lasso analysis, motivated by considerations of the maximum entropy formalism, to estimate the sparse inverse covariance matrix of RNA-Seq data. Gene ontology, pathway enrichment and protein-protein path length analysis were all carried out to validate the biological context of the predicted network of interacting cancer gene isoforms.
△ Less
Submitted 6 December, 2012; v1 submitted 19 November, 2012;
originally announced November 2012.
-
Ambiguous model learning made unambiguous with 1/f priors
Authors:
Gurinder Singh Atwal,
William Bialek
Abstract:
What happens to the optimal interpretation of noisy data when there exists more than one equally plausible interpretation of the data? In a Bayesian model-learning framework the answer depends on the prior expectations of the dynamics of the model parameter that is to be inferred from the data. Local time constraints on the priors are insufficient to pick one interpretation over another. On the…
▽ More
What happens to the optimal interpretation of noisy data when there exists more than one equally plausible interpretation of the data? In a Bayesian model-learning framework the answer depends on the prior expectations of the dynamics of the model parameter that is to be inferred from the data. Local time constraints on the priors are insufficient to pick one interpretation over another. On the other hand, nonlocal time constraints, induced by a $1/f$ noise spectrum of the priors, is shown to permit learning of a specific model parameter even when there are infinitely many equally plausible interpretations of the data. This transition is inferred by a remarkable mapping of the model estimation problem to a dissipative physical system, allowing the use of powerful statistical mechanical methods to uncover the transition from indeterminate to determinate model learning.
△ Less
Submitted 23 December, 2005;
originally announced December 2005.
-
Information based clustering
Authors:
Noam Slonim,
Gurinder Singh Atwal,
Gasper Tkacik,
William Bialek
Abstract:
In an age of increasingly large data sets, investigators in many different disciplines have turned to clustering as a tool for data analysis and exploration. Existing clustering methods, however, typically depend on several nontrivial assumptions about the structure of data. Here we reformulate the clustering problem from an information theoretic perspective which avoids many of these assumption…
▽ More
In an age of increasingly large data sets, investigators in many different disciplines have turned to clustering as a tool for data analysis and exploration. Existing clustering methods, however, typically depend on several nontrivial assumptions about the structure of data. Here we reformulate the clustering problem from an information theoretic perspective which avoids many of these assumptions. In particular, our formulation obviates the need for defining a cluster "prototype", does not require an a priori similarity metric, is invariant to changes in the representation of the data, and naturally captures non-linear relations. We apply this approach to different domains and find that it consistently produces clusters that are more coherent than those extracted by existing algorithms. Finally, our approach provides a way of clustering based on collective notions of similarity rather than the traditional pairwise measures.
△ Less
Submitted 25 November, 2005;
originally announced November 2005.
-
Information based clustering: Supplementary material
Authors:
Noam Slonim,
Gurinder Singh Atwal,
Gasper Tkacik,
William Bialek
Abstract:
This technical report provides the supplementary material for a paper entitled "Information based clustering", to appear shortly in Proceedings of the National Academy of Sciences (USA). In Section I we present in detail the iterative clustering algorithm used in our experiments and in Section II we describe the validation scheme used to determine the statistical significance of our results. The…
▽ More
This technical report provides the supplementary material for a paper entitled "Information based clustering", to appear shortly in Proceedings of the National Academy of Sciences (USA). In Section I we present in detail the iterative clustering algorithm used in our experiments and in Section II we describe the validation scheme used to determine the statistical significance of our results. Then in subsequent sections we provide all the experimental results for three very different applications: the response of gene expression in yeast to different forms of environmental stress, the dynamics of stock prices in the Standard and Poor's 500, and viewer ratings of popular movies. In particular, we highlight some of the results that seem to deserve special attention. All the experimental results and relevant code, including a freely available web application, can be found at http://www.genomics.princeton.edu/biophysics-theory .
△ Less
Submitted 25 November, 2005; v1 submitted 25 November, 2005;
originally announced November 2005.