Search | arXiv e-print repository

Hyperbolic Genome Embeddings

Authors: Raiyan R. Khan, Philippe Chlenski, Itsik Pe'er

Abstract: Current approaches to genomic sequence modeling often struggle to align the inductive biases of machine learning models with the evolutionarily-informed structure of biological systems. To this end, we formulate a novel application of hyperbolic CNNs that exploits this structure, enabling more expressive DNA sequence representations. Our strategy circumvents the need for explicit phylogenetic mapp… ▽ More Current approaches to genomic sequence modeling often struggle to align the inductive biases of machine learning models with the evolutionarily-informed structure of biological systems. To this end, we formulate a novel application of hyperbolic CNNs that exploits this structure, enabling more expressive DNA sequence representations. Our strategy circumvents the need for explicit phylogenetic mapping while discerning key properties of sequences pertaining to core functional and regulatory behavior. Across 37 out of 42 genome interpretation benchmark datasets, our hyperbolic models outperform their Euclidean equivalents. Notably, our approach even surpasses state-of-the-art performance on seven GUE benchmark datasets, consistently outperforming many DNA language models while using orders of magnitude fewer parameters and avoiding pretraining. Our results include a novel set of benchmark datasets--the Transposable Elements Benchmark--which explores a major but understudied component of the genome with deep evolutionary significance. We further motivate our work by exploring how our hyperbolic models recognize genomic signal under various data-generating conditions and by constructing an empirical method for interpreting the hyperbolicity of dataset embeddings. Throughout these assessments, we find persistent evidence highlighting the potential of our hyperbolic framework as a robust paradigm for genome representation learning. Our code and benchmark datasets are available at https://github.com/rrkhan/HGE. △ Less

Submitted 29 July, 2025; originally announced July 2025.

Comments: 30 pages, 16 figures, 10 tables. Camera-ready version for ICLR 2025

Journal ref: In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR 2025). https://openreview.net/forum?id=NkGDNM8LB0

arXiv:2506.04360 [pdf, ps, other]

Even Faster Hyperbolic Random Forests: A Beltrami-Klein Wrapper Approach

Authors: Philippe Chlenski, Itsik Pe'er

Abstract: Decision trees and models that use them as primitives are workhorses of machine learning in Euclidean spaces. Recent work has further extended these models to the Lorentz model of hyperbolic space by replacing axis-parallel hyperplanes with homogeneous hyperplanes when partitioning the input space. In this paper, we show how the hyperDT algorithm can be elegantly reexpressed in the Beltrami-Klein… ▽ More Decision trees and models that use them as primitives are workhorses of machine learning in Euclidean spaces. Recent work has further extended these models to the Lorentz model of hyperbolic space by replacing axis-parallel hyperplanes with homogeneous hyperplanes when partitioning the input space. In this paper, we show how the hyperDT algorithm can be elegantly reexpressed in the Beltrami-Klein model of hyperbolic spaces. This preserves the thresholding operation used in Euclidean decision trees, enabling us to further rewrite hyperDT as simple pre- and post-processing steps that form a wrapper around existing tree-based models designed for Euclidean spaces. The wrapper approach unlocks many optimizations already available in Euclidean space models, improving flexibility, speed, and accuracy while offering a simpler, more maintainable, and extensible codebase. Our implementation is available at https://github.com/pchlenski/hyperdt. △ Less

Submitted 4 June, 2025; originally announced June 2025.

Comments: 15 pages, 4 figures, 2 tables

arXiv:2503.09576 [pdf, ps, other]

Manify: A Python Library for Learning Non-Euclidean Representations

Authors: Philippe Chlenski, Kaizhu Du, Dylan Satow, Raiyan R. Khan, Itsik Pe'er

Abstract: We present Manify, an open-source Python library for non-Euclidean representation learning. Leveraging manifold learning techniques, Manify provides tools for learning embeddings in (products of) non-Euclidean spaces, performing classification and regression with data that lives in such spaces, estimating the curvature of a manifold, and more. Manify aims to advance research and applications in ma… ▽ More We present Manify, an open-source Python library for non-Euclidean representation learning. Leveraging manifold learning techniques, Manify provides tools for learning embeddings in (products of) non-Euclidean spaces, performing classification and regression with data that lives in such spaces, estimating the curvature of a manifold, and more. Manify aims to advance research and applications in machine learning by offering a comprehensive suite of tools for manifold-based data analysis. Our source code, examples, and documentation are available at https://github.com/pchlenski/manify. △ Less

Submitted 16 July, 2025; v1 submitted 12 March, 2025; originally announced March 2025.

Comments: 33 pages, 4 figures, 5 tables. Preprint

arXiv:2501.17965 [pdf, ps, other]

Variational Combinatorial Sequential Monte Carlo for Bayesian Phylogenetics in Hyperbolic Space

Authors: Alex Chen, Philipe Chlenski, Kenneth Munyuza, Antonio Khalil Moretti, Christian A. Naesseth, Itsik Pe'er

Abstract: Hyperbolic space naturally encodes hierarchical structures such as phylogenies (binary trees), where inward-bending geodesics reflect paths through least common ancestors, and the exponential growth of neighborhoods mirrors the super-exponential scaling of topologies. This scaling challenge limits the efficiency of Euclidean-based approximate inference methods. Motivated by the geometric connectio… ▽ More Hyperbolic space naturally encodes hierarchical structures such as phylogenies (binary trees), where inward-bending geodesics reflect paths through least common ancestors, and the exponential growth of neighborhoods mirrors the super-exponential scaling of topologies. This scaling challenge limits the efficiency of Euclidean-based approximate inference methods. Motivated by the geometric connections between trees and hyperbolic space, we develop novel hyperbolic extensions of two sequential search algorithms: Combinatorial and Nested Combinatorial Sequential Monte Carlo (\textsc{Csmc} and \textsc{Ncsmc}). Our approach introduces consistent and unbiased estimators, along with variational inference methods (\textsc{H-Vcsmc} and \textsc{H-Vncsmc}), which outperform their Euclidean counterparts. Empirical results demonstrate improved speed, scalability and performance in high-dimensional phylogenetic inference tasks. △ Less

Submitted 15 July, 2025; v1 submitted 29 January, 2025; originally announced January 2025.

Comments: 24 pages, 10 figures

Journal ref: Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, PMLR 258:2962-2970, 2025

arXiv:2410.13879 [pdf, ps, other]

Mixed-curvature decision trees and random forests

Authors: Philippe Chlenski, Quentin Chu, Raiyan R. Khan, Kaizhu Du, Antonio Khalil Moretti, Itsik Pe'er

Abstract: Decision trees (DTs) and their random forest (RF) extensions are workhorses of classification and regression in Euclidean spaces. However, algorithms for learning in non-Euclidean spaces are still limited. We extend DT and RF algorithms to product manifolds: Cartesian products of several hyperbolic, hyperspherical, or Euclidean components. Such manifolds handle heterogeneous curvature while still… ▽ More Decision trees (DTs) and their random forest (RF) extensions are workhorses of classification and regression in Euclidean spaces. However, algorithms for learning in non-Euclidean spaces are still limited. We extend DT and RF algorithms to product manifolds: Cartesian products of several hyperbolic, hyperspherical, or Euclidean components. Such manifolds handle heterogeneous curvature while still factorizing neatly into simpler components, making them compelling embedding spaces for complex datasets. Our novel angular reformulation respects manifold geometry while preserving the algorithmic properties that make decision trees effective. In the special cases of single-component manifolds, our method simplifies to its Euclidean or hyperbolic counterparts, or introduces hyperspherical DT algorithms, depending on the curvature. In benchmarks on a diverse suite of 57 classification, regression, and link prediction tasks, our product RFs ranked first on 29 tasks and came in the top 2 for 41. This highlights the value of product RFs as straightforward yet powerful new tools for data analysis in product manifolds. Code for our method is available at https://github.com/pchlenski/manify. △ Less

Submitted 6 June, 2025; v1 submitted 2 October, 2024; originally announced October 2024.

Comments: 30 pages, 12 figures, 13 tables. Camera-ready version for ICML 2025

arXiv:2406.11944 [pdf, other]

Transcoders Find Interpretable LLM Feature Circuits

Authors: Jacob Dunefsky, Philippe Chlenski, Neel Nanda

Abstract: A key goal in mechanistic interpretability is circuit analysis: finding sparse subgraphs of models corresponding to specific behaviors or capabilities. However, MLP sublayers make fine-grained circuit analysis on transformer-based language models difficult. In particular, interpretable features -- such as those found by sparse autoencoders (SAEs) -- are typically linear combinations of extremely m… ▽ More A key goal in mechanistic interpretability is circuit analysis: finding sparse subgraphs of models corresponding to specific behaviors or capabilities. However, MLP sublayers make fine-grained circuit analysis on transformer-based language models difficult. In particular, interpretable features -- such as those found by sparse autoencoders (SAEs) -- are typically linear combinations of extremely many neurons, each with its own nonlinearity to account for. Circuit analysis in this setting thus either yields intractably large circuits or fails to disentangle local and global behavior. To address this we explore transcoders, which seek to faithfully approximate a densely activating MLP layer with a wider, sparsely-activating MLP layer. We introduce a novel method for using transcoders to perform weights-based circuit analysis through MLP sublayers. The resulting circuits neatly factorize into input-dependent and input-invariant terms. We then successfully train transcoders on language models with 120M, 410M, and 1.4B parameters, and find them to perform at least on par with SAEs in terms of sparsity, faithfulness, and human-interpretability. Finally, we apply transcoders to reverse-engineer unknown circuits in the model, and we obtain novel insights regarding the "greater-than circuit" in GPT2-small. Our results suggest that transcoders can prove effective in decomposing model computations involving MLPs into interpretable circuits. Code is available at https://github.com/jacobdunefsky/transcoder_circuits/. △ Less

Submitted 6 November, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

Comments: 29 pages, 6 figures, 4 tables, 2 algorithms. NeurIPS 2024

arXiv:2406.05227

Mixed-Curvature Decision Trees and Random Forests

Authors: Philippe Chlenski, Quentin Chu, Itsik Pe'er

Abstract: We extend decision tree and random forest algorithms to product space manifolds: Cartesian products of Euclidean, hyperspherical, and hyperbolic manifolds. Such spaces have extremely expressive geometries capable of representing many arrangements of distances with low metric distortion. To date, all classifiers for product spaces fit a single linear decision boundary, and no regressor has been des… ▽ More We extend decision tree and random forest algorithms to product space manifolds: Cartesian products of Euclidean, hyperspherical, and hyperbolic manifolds. Such spaces have extremely expressive geometries capable of representing many arrangements of distances with low metric distortion. To date, all classifiers for product spaces fit a single linear decision boundary, and no regressor has been described. Our method enables a simple, expressive method for classification and regression in product manifolds. We demonstrate the superior accuracy of our tool compared to Euclidean methods operating in the ambient space or the tangent plane of the manifold across a range of constant-curvature and product manifolds. Code for our implementation and experiments is available at https://github.com/pchlenski/embedders. △ Less

Submitted 7 May, 2025; v1 submitted 7 June, 2024; originally announced June 2024.

Comments: This paper has been replaced by a newer version at arXiv:2410.13879

arXiv:2406.03242 [pdf, other]

Variational Pseudo Marginal Methods for Jet Reconstruction in Particle Physics

Authors: Hanming Yang, Antonio Khalil Moretti, Sebastian Macaluso, Philippe Chlenski, Christian A. Naesseth, Itsik Pe'er

Abstract: Reconstructing jets, which provide vital insights into the properties and histories of subatomic particles produced in high-energy collisions, is a main problem in data analyses in collider physics. This intricate task deals with estimating the latent structure of a jet (binary tree) and involves parameters such as particle energy, momentum, and types. While Bayesian methods offer a natural approa… ▽ More Reconstructing jets, which provide vital insights into the properties and histories of subatomic particles produced in high-energy collisions, is a main problem in data analyses in collider physics. This intricate task deals with estimating the latent structure of a jet (binary tree) and involves parameters such as particle energy, momentum, and types. While Bayesian methods offer a natural approach for handling uncertainty and leveraging prior knowledge, they face significant challenges due to the super-exponential growth of potential jet topologies as the number of observed particles increases. To address this, we introduce a Combinatorial Sequential Monte Carlo approach for inferring jet latent structures. As a second contribution, we leverage the resulting estimator to develop a variational inference algorithm for parameter learning. Building on this, we introduce a variational family using a pseudo-marginal framework for a fully Bayesian treatment of all variables, unifying the generative model with the inference process. We illustrate our method's effectiveness through experiments using data generated with a collider physics generative model, highlighting superior speed and accuracy across a range of tasks. △ Less

Submitted 30 December, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

Comments: 21 pages, 9 figures

Journal ref: Transactions on Machine Learning Research 2024

arXiv:2310.13841 [pdf, other]

Fast hyperboloid decision tree algorithms

Authors: Philippe Chlenski, Ethan Turok, Antonio Moretti, Itsik Pe'er

Abstract: Hyperbolic geometry is gaining traction in machine learning for its effectiveness at capturing hierarchical structures in real-world data. Hyperbolic spaces, where neighborhoods grow exponentially, offer substantial advantages and consistently deliver state-of-the-art results across diverse applications. However, hyperbolic classifiers often grapple with computational challenges. Methods reliant o… ▽ More Hyperbolic geometry is gaining traction in machine learning for its effectiveness at capturing hierarchical structures in real-world data. Hyperbolic spaces, where neighborhoods grow exponentially, offer substantial advantages and consistently deliver state-of-the-art results across diverse applications. However, hyperbolic classifiers often grapple with computational challenges. Methods reliant on Riemannian optimization frequently exhibit sluggishness, stemming from the increased computational demands of operations on Riemannian manifolds. In response to these challenges, we present hyperDT, a novel extension of decision tree algorithms into hyperbolic space. Crucially, hyperDT eliminates the need for computationally intensive Riemannian optimization, numerically unstable exponential and logarithmic maps, or pairwise comparisons between points by leveraging inner products to adapt Euclidean decision tree algorithms to hyperbolic space. Our approach is conceptually straightforward and maintains constant-time decision complexity while mitigating the scalability issues inherent in high-dimensional Euclidean spaces. Building upon hyperDT we introduce hyperRF, a hyperbolic random forest model. Extensive benchmarking across diverse datasets underscores the superior performance of these models, providing a swift, precise, accurate, and user-friendly toolkit for hyperbolic data analysis. △ Less

Submitted 4 March, 2024; v1 submitted 20 October, 2023; originally announced October 2023.

Journal ref: International Conference on Learning Representations (2024)

Showing 1–9 of 9 results for author: Chlenski, P