-
Crafting Interpretable Embeddings by Asking LLMs Questions
Authors:
Vinamra Benara,
Chandan Singh,
John X. Morris,
Richard Antonello,
Ion Stoica,
Alexander G. Huth,
Jianfeng Gao
Abstract:
Large language models (LLMs) have rapidly improved text embeddings for a growing array of natural-language processing tasks. However, their opaqueness and proliferation into scientific domains such as neuroscience have created a growing need for interpretability. Here, we ask whether we can obtain interpretable embeddings through LLM prompting. We introduce question-answering embeddings (QA-Emb),…
▽ More
Large language models (LLMs) have rapidly improved text embeddings for a growing array of natural-language processing tasks. However, their opaqueness and proliferation into scientific domains such as neuroscience have created a growing need for interpretability. Here, we ask whether we can obtain interpretable embeddings through LLM prompting. We introduce question-answering embeddings (QA-Emb), embeddings where each feature represents an answer to a yes/no question asked to an LLM. Training QA-Emb reduces to selecting a set of underlying questions rather than learning model weights.
We use QA-Emb to flexibly generate interpretable models for predicting fMRI voxel responses to language stimuli. QA-Emb significantly outperforms an established interpretable baseline, and does so while requiring very few questions. This paves the way towards building flexible feature spaces that can concretize and evaluate our understanding of semantic brain representations. We additionally find that QA-Emb can be effectively approximated with an efficient model, and we explore broader applications in simple NLP tasks.
△ Less
Submitted 26 May, 2024;
originally announced May 2024.
-
Faster and More Accurate Sequence Alignment with SNAP
Authors:
Matei Zaharia,
William J. Bolosky,
Kristal Curtis,
Armando Fox,
David Patterson,
Scott Shenker,
Ion Stoica,
Richard M. Karp,
Taylor Sittler
Abstract:
We present the Scalable Nucleotide Alignment Program (SNAP), a new short and long read aligner that is both more accurate (i.e., aligns more reads with fewer errors) and 10-100x faster than state-of-the-art tools such as BWA. Unlike recent aligners based on the Burrows-Wheeler transform, SNAP uses a simple hash index of short seed sequences from the genome, similar to BLAST's. However, SNAP greatl…
▽ More
We present the Scalable Nucleotide Alignment Program (SNAP), a new short and long read aligner that is both more accurate (i.e., aligns more reads with fewer errors) and 10-100x faster than state-of-the-art tools such as BWA. Unlike recent aligners based on the Burrows-Wheeler transform, SNAP uses a simple hash index of short seed sequences from the genome, similar to BLAST's. However, SNAP greatly reduces the number and cost of local alignment checks performed through several measures: it uses longer seeds to reduce the false positive locations considered, leverages larger memory capacities to speed index lookup, and excludes most candidate locations without fully computing their edit distance to the read. The result is an algorithm that scales well for reads from one hundred to thousands of bases long and provides a rich error model that can match classes of mutations (e.g., longer indels) that today's fast aligners ignore. We calculate that SNAP can align a dataset with 30x coverage of a human genome in less than an hour for a cost of $2 on Amazon EC2, with higher accuracy than BWA. Finally, we describe ongoing work to further improve SNAP.
△ Less
Submitted 23 November, 2011;
originally announced November 2011.
-
Structure calculation strategies for helical membrane proteins; a comparison study
Authors:
Ileana Stoica
Abstract:
Structure predictions of helical membrane proteins have been designed to take advantage of the structural autonomy of secondary structure elements, as postulated by the two-stage model of Engelman and Popot. In this context, we investigate structure calculation strategies for two membrane proteins with different functions, sizes, aminoacid compositions, and topologies: the glycophorin A homodime…
▽ More
Structure predictions of helical membrane proteins have been designed to take advantage of the structural autonomy of secondary structure elements, as postulated by the two-stage model of Engelman and Popot. In this context, we investigate structure calculation strategies for two membrane proteins with different functions, sizes, aminoacid compositions, and topologies: the glycophorin A homodimer (a paradigm for close inter-helical packing in membrane proteins) and aquaporin (a channel protein). Our structure calculations are based on two alternative folding schemes: a one-step simulated annealing from an extended chain conformation, and a two-step procedure inspired by the grid-search methods traditionally used in membrane protein predictions. In this framework, we investigate rationales for the utilization of sparse NMR data such as distance-based restraints and residual dipolar couplings in structure calculations of helical membrane proteins.
△ Less
Submitted 21 August, 2005; v1 submitted 6 May, 2005;
originally announced May 2005.