-
Clio: Privacy-Preserving Insights into Real-World AI Use
Authors:
Alex Tamkin,
Miles McCain,
Kunal Handa,
Esin Durmus,
Liane Lovitt,
Ankur Rathi,
Saffron Huang,
Alfred Mountfield,
Jerry Hong,
Stuart Ritchie,
Michael Stern,
Brian Clarke,
Landon Goldberg,
Theodore R. Sumers,
Jared Mueller,
William McEachen,
Wes Mitchell,
Shan Carter,
Jack Clark,
Jared Kaplan,
Deep Ganguli
Abstract:
How are AI assistants being used in the real world? While model providers in theory have a window into this impact via their users' data, both privacy concerns and practical challenges have made analyzing this data difficult. To address these issues, we present Clio (Claude insights and observations), a privacy-preserving platform that uses AI assistants themselves to analyze and surface aggregate…
▽ More
How are AI assistants being used in the real world? While model providers in theory have a window into this impact via their users' data, both privacy concerns and practical challenges have made analyzing this data difficult. To address these issues, we present Clio (Claude insights and observations), a privacy-preserving platform that uses AI assistants themselves to analyze and surface aggregated usage patterns across millions of conversations, without the need for human reviewers to read raw conversations. We validate this can be done with a high degree of accuracy and privacy by conducting extensive evaluations. We demonstrate Clio's usefulness in two broad ways. First, we share insights about how models are being used in the real world from one million Claude.ai Free and Pro conversations, ranging from providing advice on hairstyles to providing guidance on Git operations and concepts. We also identify the most common high-level use cases on Claude.ai (coding, writing, and research tasks) as well as patterns that differ across languages (e.g., conversations in Japanese discuss elder care and aging populations at higher-than-typical rates). Second, we use Clio to make our systems safer by identifying coordinated attempts to abuse our systems, monitoring for unknown unknowns during critical periods like launches of new capabilities or major world events, and improving our existing monitoring systems. We also discuss the limitations of our approach, as well as risks and ethical concerns. By enabling analysis of real-world AI usage, Clio provides a scalable platform for empirically grounded AI safety and governance.
△ Less
Submitted 18 December, 2024;
originally announced December 2024.
-
Mixture of neural fields for heterogeneous reconstruction in cryo-EM
Authors:
Axel Levy,
Rishwanth Raghu,
David Shustin,
Adele Rui-Yang Peng,
Huan Li,
Oliver Biggs Clarke,
Gordon Wetzstein,
Ellen D. Zhong
Abstract:
Cryo-electron microscopy (cryo-EM) is an experimental technique for protein structure determination that images an ensemble of macromolecules in near-physiological contexts. While recent advances enable the reconstruction of dynamic conformations of a single biomolecular complex, current methods do not adequately model samples with mixed conformational and compositional heterogeneity. In particula…
▽ More
Cryo-electron microscopy (cryo-EM) is an experimental technique for protein structure determination that images an ensemble of macromolecules in near-physiological contexts. While recent advances enable the reconstruction of dynamic conformations of a single biomolecular complex, current methods do not adequately model samples with mixed conformational and compositional heterogeneity. In particular, datasets containing mixtures of multiple proteins require the joint inference of structure, pose, compositional class, and conformational states for 3D reconstruction. Here, we present Hydra, an approach that models both conformational and compositional heterogeneity fully ab initio by parameterizing structures as arising from one of K neural fields. We employ a new likelihood-based loss function and demonstrate the effectiveness of our approach on synthetic datasets composed of mixtures of proteins with large degrees of conformational variability. We additionally demonstrate Hydra on an experimental dataset of a cellular lysate containing a mixture of different protein complexes. Hydra expands the expressivity of heterogeneous reconstruction methods and thus broadens the scope of cryo-EM to increasingly complex samples.
△ Less
Submitted 12 December, 2024;
originally announced December 2024.
-
Point Prediction for Streaming Data
Authors:
Aleena Chanda,
N. V. Vinodchandran,
Bertrand Clarke
Abstract:
We present two new approaches for point prediction with streaming data. One is based on the Count-Min sketch (CMS) and the other is based on Gaussian process priors with a random bias. These methods are intended for the most general predictive problems where no true model can be usefully formulated for the data stream. In statistical contexts, this is often called the $\mathcal{M}$-open problem cl…
▽ More
We present two new approaches for point prediction with streaming data. One is based on the Count-Min sketch (CMS) and the other is based on Gaussian process priors with a random bias. These methods are intended for the most general predictive problems where no true model can be usefully formulated for the data stream. In statistical contexts, this is often called the $\mathcal{M}$-open problem class. Under the assumption that the data consists of i.i.d samples from a fixed distribution function $F$, we show that the CMS-based estimates of the distribution function are consistent.
We compare our new methods with two established predictors in terms of cumulative $L^1$ error. One is based on the Shtarkov solution (often called the normalized maximum likelihood) in the normal experts setting and the other is based on Dirichlet process priors. These comparisons are for two cases. The first is one-pass meaning that the updating of the predictors is done using the fact that the CMS is a sketch. For predictors that are not one-pass, we use streaming $K$-means to give a representative subset of fixed size that can be updated as data accumulate.
Preliminary computational work suggests that the one-pass median version of the CMS method is rarely outperformed by the other methods for sufficiently complex data. We also find that predictors based on Gaussian process priors with random biases perform well. The Shtarkov predictors we use here did not perform as well probably because we were only using the simplest example. The other predictors seemed to perform well mainly when the data did not look like they came from an M-open data generator.
△ Less
Submitted 2 August, 2024;
originally announced August 2024.
-
Decentralised Active Perception in Continuous Action Spaces for the Coordinated Escort Problem
Authors:
Rhett Hull,
Ki Myung Brian Lee,
Jennifer Wakulicz,
Chanyeol Yoo,
James McMahon,
Bryan Clarke,
Stuart Anstee,
Jijoong Kim,
Robert Fitch
Abstract:
We consider the coordinated escort problem, where a decentralised team of supporting robots implicitly assist the mission of higher-value principal robots. The defining challenge is how to evaluate the effect of supporting robots' actions on the principal robots' mission. To capture this effect, we define two novel auxiliary reward functions for supporting robots called satisfaction improvement an…
▽ More
We consider the coordinated escort problem, where a decentralised team of supporting robots implicitly assist the mission of higher-value principal robots. The defining challenge is how to evaluate the effect of supporting robots' actions on the principal robots' mission. To capture this effect, we define two novel auxiliary reward functions for supporting robots called satisfaction improvement and satisfaction entropy, which computes the improvement in probability of mission success, or the uncertainty thereof. Given these reward functions, we coordinate the entire team of principal and supporting robots using decentralised cross entropy method (Dec-CEM), a new extension of CEM to multi-agent systems based on the product distribution approximation. In a simulated object avoidance scenario, our planning framework demonstrates up to two-fold improvement in task satisfaction against conventional decoupled information gathering.The significance of our results is to introduce a new family of algorithmic problems that will enable important new practical applications of heterogeneous multi-robot systems.
△ Less
Submitted 2 May, 2023;
originally announced May 2023.
-
Profunctor Optics, a Categorical Update
Authors:
Bryce Clarke,
Derek Elkins,
Jeremy Gibbons,
Fosco Loregian,
Bartosz Milewski,
Emily Pillmore,
Mario Román
Abstract:
Optics are bidirectional data accessors that capture data transformation patterns such as accessing subfields or iterating over containers. Profunctor optics are a particular choice of representation supporting modularity, meaning that we can construct accessors for complex structures by combining simpler ones. Profunctor optics have previously been studied only in an unenriched and non-mixed sett…
▽ More
Optics are bidirectional data accessors that capture data transformation patterns such as accessing subfields or iterating over containers. Profunctor optics are a particular choice of representation supporting modularity, meaning that we can construct accessors for complex structures by combining simpler ones. Profunctor optics have previously been studied only in an unenriched and non-mixed setting, in which both directions of access are modelled in the same category. However, functional programming languages are arguably better described by enriched categories; and we have found that some structures in the literature are actually mixed optics, with access directions modelled in different categories. Our work generalizes a classic result by Pastro and Street on Tambara theory and uses it to describe mixed V-enriched profunctor optics and to endow them with V-category structure. We provide some original families of optics and derivations, including an elementary one for traversals. Finally, we discuss a Haskell implementation.
△ Less
Submitted 20 February, 2024; v1 submitted 21 January, 2020;
originally announced January 2020.
-
A General Hybrid Clustering Technique
Authors:
Saeid Amiri,
Bertrand Clarke,
Jennifer Clarke,
Hoyt A. Koepke
Abstract:
Here, we propose a clustering technique for general clustering problems including those that have non-convex clusters. For a given desired number of clusters $K$, we use three stages to find a clustering. The first stage uses a hybrid clustering technique to produce a series of clusterings of various sizes (randomly selected). They key steps are to find a $K$-means clustering using $K_\ell$ cluste…
▽ More
Here, we propose a clustering technique for general clustering problems including those that have non-convex clusters. For a given desired number of clusters $K$, we use three stages to find a clustering. The first stage uses a hybrid clustering technique to produce a series of clusterings of various sizes (randomly selected). They key steps are to find a $K$-means clustering using $K_\ell$ clusters where $K_\ell \gg K$ and then joins these small clusters by using single linkage clustering. The second stage stabilizes the result of stage one by reclustering via the `membership matrix' under Hamming distance to generate a dendrogram. The third stage is to cut the dendrogram to get $K^*$ clusters where $K^* \geq K$ and then prune back to $K$ to give a final clustering. A variant on our technique also gives a reasonable estimate for $K_T$, the true number of clusters.
We provide a series of arguments to justify the steps in the stages of our methods and we provide numerous examples involving real and simulated data to compare our technique with other related techniques.
△ Less
Submitted 5 March, 2015; v1 submitted 3 March, 2015;
originally announced March 2015.