-
Learning Obfuscations Of LLM Embedding Sequences: Stained Glass Transform
Authors:
Jay Roberts,
Kyle Mylonakis,
Sidhartha Roy,
Kaan Kale
Abstract:
The high cost of ownership of AI compute infrastructure and challenges of robust serving of large language models (LLMs) has led to a surge in managed Model-as-a-service deployments. Even when enterprises choose on-premises deployments, the compute infrastructure is typically shared across many teams in order to maximize the return on investment. In both scenarios the deployed models operate only…
▽ More
The high cost of ownership of AI compute infrastructure and challenges of robust serving of large language models (LLMs) has led to a surge in managed Model-as-a-service deployments. Even when enterprises choose on-premises deployments, the compute infrastructure is typically shared across many teams in order to maximize the return on investment. In both scenarios the deployed models operate only on plaintext data, and so enterprise data owners must allow their data to appear in plaintext on a shared or multi-tenant compute infrastructure. This results in data owners with private or sensitive data being hesitant or restricted in what data they use with these types of deployments. In this work we introduce the Stained Glass Transform, a learned, stochastic, and sequence dependent transformation of the word embeddings of an LLM which information theoretically provides privacy to the input of the LLM while preserving the utility of model. We theoretically connect a particular class of Stained Glass Transforms to the theory of mutual information of Gaussian Mixture Models. We then calculate a-postiori privacy estimates, based on mutual information, and verify the privacy and utility of instances of transformed embeddings through token level metrics of privacy and standard LLM performance benchmarks.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
BeamClean: Language Aware Embedding Reconstruction
Authors:
Kaan Kale,
Kyle Mylonakis,
Jay Roberts,
Sidhartha Roy
Abstract:
In this work, we consider an inversion attack on the obfuscated input embeddings sent to a language model on a server, where the adversary has no access to the language model or the obfuscation mechanism and sees only the obfuscated embeddings along with the model's embedding table. We propose BeamClean, an inversion attack that jointly estimates the noise parameters and decodes token sequences by…
▽ More
In this work, we consider an inversion attack on the obfuscated input embeddings sent to a language model on a server, where the adversary has no access to the language model or the obfuscation mechanism and sees only the obfuscated embeddings along with the model's embedding table. We propose BeamClean, an inversion attack that jointly estimates the noise parameters and decodes token sequences by integrating a language-model prior. Against Laplacian and Gaussian obfuscation mechanisms, BeamClean always surpasses naive distance-based attacks. This work highlights the necessity for and robustness of more advanced learned, input-dependent methods.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Deep Learning Seismic Substructure Detection using the Frozen Gaussian Approximation
Authors:
James C. Hateley,
Jay Roberts,
Kyle Mylonakis,
Xu Yang
Abstract:
We propose a deep learning algorithm for seismic interface and pocket detection with neural networks trained by synthetic high-frequency displacement data efficiently generated by the frozen Gaussian approximation (FGA). In seismic imaging high-frequency data is advantageous since it can provide high resolution of substructures. However, generation of sufficient synthetic high-frequency data sets…
▽ More
We propose a deep learning algorithm for seismic interface and pocket detection with neural networks trained by synthetic high-frequency displacement data efficiently generated by the frozen Gaussian approximation (FGA). In seismic imaging high-frequency data is advantageous since it can provide high resolution of substructures. However, generation of sufficient synthetic high-frequency data sets for training neural networks is computationally challenging. This bottleneck is overcome by a highly scalable computational platform built upon the FGA, which comes from the semiclassical theory and approximates the wavefields by a sum of fixed-width (frozen) Gaussian wave packets. Data is generated from a forward simulation of the elastic wave equation using the FGA. This data contains accurate traveltime information (from the ray path) but not exact amplitude information (with asymptotic errors not shrinking to zero even at extremely fine numerical resolution). Using this data we build convolutional neural network models using an open source API, GeoSeg, developed using Keras and Tensorflow. On a simple model, networks, despite only being trained on FGA data, can detect an interface with a high success rate from displacement data generated by the spectral element method. Benchmark tests are done for P-waves (acoustic) and P- and S-waves (elastic) generated using the FGA and a spectral element method. Further, results with a high accuracy are shown for more complicated geometries including a three layered model, and a 2D-pocket model where the neural networks trained by both clean and noisy data.
△ Less
Submitted 5 November, 2019; v1 submitted 15 October, 2018;
originally announced October 2018.
-
Rapid Near-Neighbor Interaction of High-dimensional Data via Hierarchical Clustering
Authors:
Nikos Pitsianis,
Dimitris Floros,
Alexandros-Stavros Iliopoulos,
Kostas Mylonakis,
Nikos Sismanis,
Xiaobai Sun
Abstract:
Calculation of near-neighbor interactions among high dimensional, irregularly distributed data points is a fundamental task to many graph-based or kernel-based machine learning algorithms and applications. Such calculations, involving large, sparse interaction matrices, expose the limitation of conventional data-and-computation reordering techniques for improving space and time locality on modern…
▽ More
Calculation of near-neighbor interactions among high dimensional, irregularly distributed data points is a fundamental task to many graph-based or kernel-based machine learning algorithms and applications. Such calculations, involving large, sparse interaction matrices, expose the limitation of conventional data-and-computation reordering techniques for improving space and time locality on modern computer memory hierarchies. We introduce a novel method for obtaining a matrix permutation that renders a desirable sparsity profile. The method is distinguished by the guiding principle to obtain a profile that is block-sparse with dense blocks. Our profile model and measure capture the essential properties affecting space and time locality, and permit variation in sparsity profile without imposing a restriction to a fixed pattern. The second distinction lies in an efficient algorithm for obtaining a desirable profile, via exploring and exploiting multi-scale cluster structure hidden in but intrinsic to the data. The algorithm accomplishes its task with key components for lower-dimensional embedding with data-specific principal feature axes, hierarchical data clustering, multi-level matrix compression storage, and multi-level interaction computations. We provide experimental results from case studies with two important data analysis algorithms. The resulting performance is remarkably comparable to the BLAS performance for the best-case interaction governed by a regularly banded matrix with the same sparsity.
△ Less
Submitted 11 September, 2017;
originally announced September 2017.