Search | arXiv e-print repository

Clio: Privacy-Preserving Insights into Real-World AI Use

Authors: Alex Tamkin, Miles McCain, Kunal Handa, Esin Durmus, Liane Lovitt, Ankur Rathi, Saffron Huang, Alfred Mountfield, Jerry Hong, Stuart Ritchie, Michael Stern, Brian Clarke, Landon Goldberg, Theodore R. Sumers, Jared Mueller, William McEachen, Wes Mitchell, Shan Carter, Jack Clark, Jared Kaplan, Deep Ganguli

Abstract: How are AI assistants being used in the real world? While model providers in theory have a window into this impact via their users' data, both privacy concerns and practical challenges have made analyzing this data difficult. To address these issues, we present Clio (Claude insights and observations), a privacy-preserving platform that uses AI assistants themselves to analyze and surface aggregate… ▽ More How are AI assistants being used in the real world? While model providers in theory have a window into this impact via their users' data, both privacy concerns and practical challenges have made analyzing this data difficult. To address these issues, we present Clio (Claude insights and observations), a privacy-preserving platform that uses AI assistants themselves to analyze and surface aggregated usage patterns across millions of conversations, without the need for human reviewers to read raw conversations. We validate this can be done with a high degree of accuracy and privacy by conducting extensive evaluations. We demonstrate Clio's usefulness in two broad ways. First, we share insights about how models are being used in the real world from one million Claude.ai Free and Pro conversations, ranging from providing advice on hairstyles to providing guidance on Git operations and concepts. We also identify the most common high-level use cases on Claude.ai (coding, writing, and research tasks) as well as patterns that differ across languages (e.g., conversations in Japanese discuss elder care and aging populations at higher-than-typical rates). Second, we use Clio to make our systems safer by identifying coordinated attempts to abuse our systems, monitoring for unknown unknowns during critical periods like launches of new capabilities or major world events, and improving our existing monitoring systems. We also discuss the limitations of our approach, as well as risks and ethical concerns. By enabling analysis of real-world AI usage, Clio provides a scalable platform for empirically grounded AI safety and governance. △ Less

Submitted 18 December, 2024; originally announced December 2024.

arXiv:2412.09420 [pdf, other]

Mixture of neural fields for heterogeneous reconstruction in cryo-EM

Authors: Axel Levy, Rishwanth Raghu, David Shustin, Adele Rui-Yang Peng, Huan Li, Oliver Biggs Clarke, Gordon Wetzstein, Ellen D. Zhong

Abstract: Cryo-electron microscopy (cryo-EM) is an experimental technique for protein structure determination that images an ensemble of macromolecules in near-physiological contexts. While recent advances enable the reconstruction of dynamic conformations of a single biomolecular complex, current methods do not adequately model samples with mixed conformational and compositional heterogeneity. In particula… ▽ More Cryo-electron microscopy (cryo-EM) is an experimental technique for protein structure determination that images an ensemble of macromolecules in near-physiological contexts. While recent advances enable the reconstruction of dynamic conformations of a single biomolecular complex, current methods do not adequately model samples with mixed conformational and compositional heterogeneity. In particular, datasets containing mixtures of multiple proteins require the joint inference of structure, pose, compositional class, and conformational states for 3D reconstruction. Here, we present Hydra, an approach that models both conformational and compositional heterogeneity fully ab initio by parameterizing structures as arising from one of K neural fields. We employ a new likelihood-based loss function and demonstrate the effectiveness of our approach on synthetic datasets composed of mixtures of proteins with large degrees of conformational variability. We additionally demonstrate Hydra on an experimental dataset of a cellular lysate containing a mixture of different protein complexes. Hydra expands the expressivity of heterogeneous reconstruction methods and thus broadens the scope of cryo-EM to increasingly complex samples. △ Less

Submitted 12 December, 2024; originally announced December 2024.

arXiv:2408.01318 [pdf, other]

Point Prediction for Streaming Data

Authors: Aleena Chanda, N. V. Vinodchandran, Bertrand Clarke

Abstract: We present two new approaches for point prediction with streaming data. One is based on the Count-Min sketch (CMS) and the other is based on Gaussian process priors with a random bias. These methods are intended for the most general predictive problems where no true model can be usefully formulated for the data stream. In statistical contexts, this is often called the $\mathcal{M}$-open problem cl… ▽ More We present two new approaches for point prediction with streaming data. One is based on the Count-Min sketch (CMS) and the other is based on Gaussian process priors with a random bias. These methods are intended for the most general predictive problems where no true model can be usefully formulated for the data stream. In statistical contexts, this is often called the $\mathcal{M}$-open problem class. Under the assumption that the data consists of i.i.d samples from a fixed distribution function $F$, we show that the CMS-based estimates of the distribution function are consistent. We compare our new methods with two established predictors in terms of cumulative $L^1$ error. One is based on the Shtarkov solution (often called the normalized maximum likelihood) in the normal experts setting and the other is based on Dirichlet process priors. These comparisons are for two cases. The first is one-pass meaning that the updating of the predictors is done using the fact that the CMS is a sketch. For predictors that are not one-pass, we use streaming $K$-means to give a representative subset of fixed size that can be updated as data accumulate. Preliminary computational work suggests that the one-pass median version of the CMS method is rarely outperformed by the other methods for sufficiently complex data. We also find that predictors based on Gaussian process priors with random biases perform well. The Shtarkov predictors we use here did not perform as well probably because we were only using the simplest example. The other predictors seemed to perform well mainly when the data did not look like they came from an M-open data generator. △ Less

Submitted 2 August, 2024; originally announced August 2024.

Comments: 42 pages, two figures

MSC Class: 35A01; 65L10; 65L12; 65L20; 65L70

arXiv:2305.01869 [pdf, other]

Decentralised Active Perception in Continuous Action Spaces for the Coordinated Escort Problem

Authors: Rhett Hull, Ki Myung Brian Lee, Jennifer Wakulicz, Chanyeol Yoo, James McMahon, Bryan Clarke, Stuart Anstee, Jijoong Kim, Robert Fitch

Abstract: We consider the coordinated escort problem, where a decentralised team of supporting robots implicitly assist the mission of higher-value principal robots. The defining challenge is how to evaluate the effect of supporting robots' actions on the principal robots' mission. To capture this effect, we define two novel auxiliary reward functions for supporting robots called satisfaction improvement an… ▽ More We consider the coordinated escort problem, where a decentralised team of supporting robots implicitly assist the mission of higher-value principal robots. The defining challenge is how to evaluate the effect of supporting robots' actions on the principal robots' mission. To capture this effect, we define two novel auxiliary reward functions for supporting robots called satisfaction improvement and satisfaction entropy, which computes the improvement in probability of mission success, or the uncertainty thereof. Given these reward functions, we coordinate the entire team of principal and supporting robots using decentralised cross entropy method (Dec-CEM), a new extension of CEM to multi-agent systems based on the product distribution approximation. In a simulated object avoidance scenario, our planning framework demonstrates up to two-fold improvement in task satisfaction against conventional decoupled information gathering.The significance of our results is to introduce a new family of algorithmic problems that will enable important new practical applications of heterogeneous multi-robot systems. △ Less

Submitted 2 May, 2023; originally announced May 2023.

Comments: 7 pages, 4 figures

arXiv:2001.07488 [pdf, other]

doi 10.32408/compositionality-6-1

Profunctor Optics, a Categorical Update

Authors: Bryce Clarke, Derek Elkins, Jeremy Gibbons, Fosco Loregian, Bartosz Milewski, Emily Pillmore, Mario Román

Abstract: Optics are bidirectional data accessors that capture data transformation patterns such as accessing subfields or iterating over containers. Profunctor optics are a particular choice of representation supporting modularity, meaning that we can construct accessors for complex structures by combining simpler ones. Profunctor optics have previously been studied only in an unenriched and non-mixed sett… ▽ More Optics are bidirectional data accessors that capture data transformation patterns such as accessing subfields or iterating over containers. Profunctor optics are a particular choice of representation supporting modularity, meaning that we can construct accessors for complex structures by combining simpler ones. Profunctor optics have previously been studied only in an unenriched and non-mixed setting, in which both directions of access are modelled in the same category. However, functional programming languages are arguably better described by enriched categories; and we have found that some structures in the literature are actually mixed optics, with access directions modelled in different categories. Our work generalizes a classic result by Pastro and Street on Tambara theory and uses it to describe mixed V-enriched profunctor optics and to endow them with V-category structure. We provide some original families of optics and derivations, including an elementary one for traversals. Finally, we discuss a Haskell implementation. △ Less

Submitted 20 February, 2024; v1 submitted 21 January, 2020; originally announced January 2020.

Comments: 38 pages. Final version with Compositionality metadata, does not change theorem numbering

MSC Class: 18C50 ACM Class: D.3.1

Journal ref: Compositionality, Volume 6 (2024) (February 23, 2024) compositionality:13530

arXiv:1503.01183 [pdf, other]

A General Hybrid Clustering Technique

Authors: Saeid Amiri, Bertrand Clarke, Jennifer Clarke, Hoyt A. Koepke

Abstract: Here, we propose a clustering technique for general clustering problems including those that have non-convex clusters. For a given desired number of clusters $K$, we use three stages to find a clustering. The first stage uses a hybrid clustering technique to produce a series of clusterings of various sizes (randomly selected). They key steps are to find a $K$-means clustering using $K_\ell$ cluste… ▽ More Here, we propose a clustering technique for general clustering problems including those that have non-convex clusters. For a given desired number of clusters $K$, we use three stages to find a clustering. The first stage uses a hybrid clustering technique to produce a series of clusterings of various sizes (randomly selected). They key steps are to find a $K$-means clustering using $K_\ell$ clusters where $K_\ell \gg K$ and then joins these small clusters by using single linkage clustering. The second stage stabilizes the result of stage one by reclustering via the `membership matrix' under Hamming distance to generate a dendrogram. The third stage is to cut the dendrogram to get $K^*$ clusters where $K^* \geq K$ and then prune back to $K$ to give a final clustering. A variant on our technique also gives a reasonable estimate for $K_T$, the true number of clusters. We provide a series of arguments to justify the steps in the stages of our methods and we provide numerous examples involving real and simulated data to compare our technique with other related techniques. △ Less

Submitted 5 March, 2015; v1 submitted 3 March, 2015; originally announced March 2015.

Showing 1–6 of 6 results for author: Clarke, B