-
POI-SIMEX for Conditionally Poisson Distributed Biomarkers from Tissue Histology
Authors:
Aijun Yang,
Phineas T. Hamilton,
Brad H. Nelson,
Julian J. Lum,
Mary Lesperance,
Farouk S. Nathoo
Abstract:
Covariate measurement error in regression analysis is an important issue that has been studied extensively under the classical additive and the Berkson error models. Here, we consider cases where covariates are derived from tumor tissue histology, and in particular tissue microarrays. In such settings, biomarkers are evaluated from tissue cores that are subsampled from a larger tissue section so t…
▽ More
Covariate measurement error in regression analysis is an important issue that has been studied extensively under the classical additive and the Berkson error models. Here, we consider cases where covariates are derived from tumor tissue histology, and in particular tissue microarrays. In such settings, biomarkers are evaluated from tissue cores that are subsampled from a larger tissue section so that these biomarkers are only estimates of the true cell densities. The resulting measurement error is non-negligible but is seldom accounted for in the analysis of cancer studies involving tissue microarrays. To adjust for this type of measurement error, we assume that these discrete-valued biomarkers are conditionally Poisson distributed, based on a Poisson process model governing the spatial locations of marker-positive cells. Existing methods for addressing conditional Poisson surrogates, particularly in the absence of internal validation data, are limited. We extend the simulation extrapolation (SIMEX) algorithm to accommodate the conditional Poisson case (POI-SIMEX), where measurement errors are non-Gaussian with heteroscedastic variance. The proposed estimator is shown to be strongly consistent in a linear regression model under the assumption of a conditional Poisson distribution for the observed biomarker. Simulation studies evaluate the performance of POI-SIMEX, comparing it with the naive method and an alternative corrected likelihood approach in linear regression and survival analysis contexts. POI-SIMEX is then applied to a study of high-grade serous cancer, examining the association between survival and the presence of triple-positive biomarker (CD3+CD8+FOXP3+ cells)
△ Less
Submitted 3 November, 2024; v1 submitted 21 September, 2024;
originally announced September 2024.
-
Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
Authors:
Anay Mehrotra,
Manolis Zampetakis,
Paul Kassianik,
Blaine Nelson,
Hyrum Anderson,
Yaron Singer,
Amin Karbasi
Abstract:
While Large Language Models (LLMs) display versatile functionality, they continue to generate harmful, biased, and toxic content, as demonstrated by the prevalence of human-designed jailbreaks. In this work, we present Tree of Attacks with Pruning (TAP), an automated method for generating jailbreaks that only requires black-box access to the target LLM. TAP utilizes an attacker LLM to iteratively…
▽ More
While Large Language Models (LLMs) display versatile functionality, they continue to generate harmful, biased, and toxic content, as demonstrated by the prevalence of human-designed jailbreaks. In this work, we present Tree of Attacks with Pruning (TAP), an automated method for generating jailbreaks that only requires black-box access to the target LLM. TAP utilizes an attacker LLM to iteratively refine candidate (attack) prompts until one of the refined prompts jailbreaks the target. In addition, before sending prompts to the target, TAP assesses them and prunes the ones unlikely to result in jailbreaks, reducing the number of queries sent to the target LLM. In empirical evaluations, we observe that TAP generates prompts that jailbreak state-of-the-art LLMs (including GPT4-Turbo and GPT4o) for more than 80% of the prompts. This significantly improves upon the previous state-of-the-art black-box methods for generating jailbreaks while using a smaller number of queries than them. Furthermore, TAP is also capable of jailbreaking LLMs protected by state-of-the-art guardrails, e.g., LlamaGuard.
△ Less
Submitted 31 October, 2024; v1 submitted 4 December, 2023;
originally announced December 2023.
-
Topological Interpretations of GPT-3
Authors:
Tianyi Sun,
Bradley Nelson
Abstract:
This is an experiential study of investigating a consistent method for deriving the correlation between sentence vector and semantic meaning of a sentence. We first used three state-of-the-art word/sentence embedding methods including GPT-3, Word2Vec, and Sentence-BERT, to embed plain text sentence strings into high dimensional spaces. Then we compute the pairwise distance between any possible com…
▽ More
This is an experiential study of investigating a consistent method for deriving the correlation between sentence vector and semantic meaning of a sentence. We first used three state-of-the-art word/sentence embedding methods including GPT-3, Word2Vec, and Sentence-BERT, to embed plain text sentence strings into high dimensional spaces. Then we compute the pairwise distance between any possible combination of two sentence vectors in an embedding space and map them into a matrix. Based on each distance matrix, we compute the correlation of distances of a sentence vector with respect to the other sentence vectors in an embedding space. Then we compute the correlation of each pair of the distance matrices. We observed correlations of the same sentence in different embedding spaces and correlations of different sentences in the same embedding space. These observations are consistent with our hypothesis and take us to the next stage.
△ Less
Submitted 8 August, 2023; v1 submitted 7 August, 2023;
originally announced August 2023.
-
Greedy Matroid Algorithm And Computational Persistent Homology
Authors:
Tianyi Sun,
Bradley Nelson
Abstract:
An important problem in computational topology is to calculate the homology of a space from samples. In this work, we develop a statistical approach to this problem by calculating the expected rank of an induced map on homology from a sub-sample to the full space. We develop a greedy matroid algorithm for finding an optimal basis for the image of the induced map, and investigate the relationship b…
▽ More
An important problem in computational topology is to calculate the homology of a space from samples. In this work, we develop a statistical approach to this problem by calculating the expected rank of an induced map on homology from a sub-sample to the full space. We develop a greedy matroid algorithm for finding an optimal basis for the image of the induced map, and investigate the relationship between this algorithm and the probability of sampling vectors in the image of the induced map.
△ Less
Submitted 3 August, 2023;
originally announced August 2023.
-
Getting to "rate-optimal'' in ranking & selection
Authors:
Harun Avci,
Barry L. Nelson,
Andreas Wächter
Abstract:
In their 2004 seminal paper, Glynn and Juneja formally and precisely established the rate-optimal, probability-of-incorrect-selection, replication allocation scheme for selecting the best of k simulated systems. In the case of independent, normally distributed outputs this allocation has a simple form that depends in an intuitively appealing way on the true means and variances. Of course the means…
▽ More
In their 2004 seminal paper, Glynn and Juneja formally and precisely established the rate-optimal, probability-of-incorrect-selection, replication allocation scheme for selecting the best of k simulated systems. In the case of independent, normally distributed outputs this allocation has a simple form that depends in an intuitively appealing way on the true means and variances. Of course the means and (typically) variances are unknown, but the rate-optimal allocation provides a target for implementable, dynamic, data-driven policies to achieve. In this paper we compare the empirical behavior of four related replication-allocation policies: mCEI from Chen and Rzyhov and our new gCEI policy that both converge to the Glynn and Juneja allocation; AOMAP from Peng and Fu that converges to the OCBA optimal allocation; and TTTS from Russo that targets the rate of convergence of the posterior probability of incorrect selection. We find that these policies have distinctly different behavior in some settings.
△ Less
Submitted 4 February, 2023;
originally announced February 2023.
-
What is an equivariant neural network?
Authors:
Lek-Heng Lim,
Bradley J. Nelson
Abstract:
We explain equivariant neural networks, a notion underlying breakthroughs in machine learning from deep convolutional neural networks for computer vision to AlphaFold 2 for protein structure prediction, without assuming knowledge of equivariance or neural networks. The basic mathematical ideas are simple but are often obscured by engineering complications that come with practical realizations. We…
▽ More
We explain equivariant neural networks, a notion underlying breakthroughs in machine learning from deep convolutional neural networks for computer vision to AlphaFold 2 for protein structure prediction, without assuming knowledge of equivariance or neural networks. The basic mathematical ideas are simple but are often obscured by engineering complications that come with practical realizations. We extract and focus on the mathematical aspects, and limit ourselves to a cursory treatment of the engineering issues at the end.
△ Less
Submitted 16 November, 2022; v1 submitted 15 May, 2022;
originally announced May 2022.
-
Stochastic Simulation Uncertainty Analysis to Accelerate Flexible Biomanufacturing Process Development
Authors:
Wei Xie,
Russell R. Barton,
Barry L. Nelson,
Keqi Wang
Abstract:
Motivated by critical challenges and needs from biopharmaceuticals manufacturing, we propose a general metamodel-assisted stochastic simulation uncertainty analysis framework to accelerate the development of a simulation model with modular design for flexible production processes. There are often very limited process observations. Thus, there exist both simulation and model uncertainties in the sy…
▽ More
Motivated by critical challenges and needs from biopharmaceuticals manufacturing, we propose a general metamodel-assisted stochastic simulation uncertainty analysis framework to accelerate the development of a simulation model with modular design for flexible production processes. There are often very limited process observations. Thus, there exist both simulation and model uncertainties in the system performance estimates. In biopharmaceutical manufacturing, model uncertainty often dominates. The proposed framework can produce a confidence interval that accounts for simulation and model uncertainties by using a metamodel-assisted bootstrapping approach. Furthermore, a variance decomposition is utilized to estimate the relative contributions from each source of model uncertainty, as well as simulation uncertainty. This information can be used to improve the system mean performance estimation. Asymptotic analysis provides theoretical support for our approach, while the empirical study demonstrates that it has good finite-sample performance.
△ Less
Submitted 3 September, 2022; v1 submitted 16 March, 2022;
originally announced March 2022.
-
Statistical Uncertainty Analysis for Stochastic Simulation
Authors:
Wei Xie,
Barry L. Nelson,
Russell R. Barton
Abstract:
When we use simulation to evaluate the performance of a stochastic system, the simulation often contains input distributions estimated from real-world data; therefore, there is both simulation and input uncertainty in the performance estimates. Ignoring either source of uncertainty underestimates the overall statistical error. Simulation uncertainty can be reduced by additional computation (e.g.,…
▽ More
When we use simulation to evaluate the performance of a stochastic system, the simulation often contains input distributions estimated from real-world data; therefore, there is both simulation and input uncertainty in the performance estimates. Ignoring either source of uncertainty underestimates the overall statistical error. Simulation uncertainty can be reduced by additional computation (e.g., more replications). Input uncertainty can be reduced by collecting more real-world data, when feasible. This paper proposes an approach to quantify overall statistical uncertainty when the simulation is driven by independent parametric input distributions; specifically, we produce a confidence interval that accounts for both simulation and input uncertainty by using a metamodel-assisted bootstrapping approach. The input uncertainty is measured via bootstrapping, an equation-based stochastic kriging metamodel propagates the input uncertainty to the output mean, and both simulation and metamodel uncertainty are derived using properties of the metamodel. A variance decomposition is proposed to estimate the relative contribution of input to overall uncertainty; this information indicates whether the overall uncertainty can be significantly reduced through additional simulation alone. Asymptotic analysis provides theoretical support for our approach, while an empirical study demonstrates that it has good finite-sample performance.
△ Less
Submitted 9 November, 2020;
originally announced November 2020.
-
A Topology Layer for Machine Learning
Authors:
Rickard Brüel-Gabrielsson,
Bradley J. Nelson,
Anjan Dwaraknath,
Primoz Skraba,
Leonidas J. Guibas,
Gunnar Carlsson
Abstract:
Topology applied to real world data using persistent homology has started to find applications within machine learning, including deep learning. We present a differentiable topology layer that computes persistent homology based on level set filtrations and edge-based filtrations. We present three novel applications: the topological layer can (i) regularize data reconstruction or the weights of mac…
▽ More
Topology applied to real world data using persistent homology has started to find applications within machine learning, including deep learning. We present a differentiable topology layer that computes persistent homology based on level set filtrations and edge-based filtrations. We present three novel applications: the topological layer can (i) regularize data reconstruction or the weights of machine learning models, (ii) construct a loss on the output of a deep generative network to incorporate topological priors, and (iii) perform topological adversarial attacks on deep networks trained with persistence features. The code (www.github.com/bruel-gabrielsson/TopologyLayer) is publicly available and we hope its availability will facilitate the use of persistent homology in deep learning and other gradient based applications.
△ Less
Submitted 24 April, 2020; v1 submitted 28 May, 2019;
originally announced May 2019.
-
Sparse canonical correlation analysis
Authors:
Xiaotong Suo,
Victor Minden,
Bradley Nelson,
Robert Tibshirani,
Michael Saunders
Abstract:
Canonical correlation analysis was proposed by Hotelling [6] and it measures linear relationship between two multidimensional variables. In high dimensional setting, the classical canonical correlation analysis breaks down. We propose a sparse canonical correlation analysis by adding l1 constraints on the canonical vectors and show how to solve it efficiently using linearized alternating direction…
▽ More
Canonical correlation analysis was proposed by Hotelling [6] and it measures linear relationship between two multidimensional variables. In high dimensional setting, the classical canonical correlation analysis breaks down. We propose a sparse canonical correlation analysis by adding l1 constraints on the canonical vectors and show how to solve it efficiently using linearized alternating direction method of multipliers (ADMM) and using TFOCS as a black box. We illustrate this idea on simulated data.
△ Less
Submitted 2 June, 2017; v1 submitted 30 May, 2017;
originally announced May 2017.
-
Regularized Estimation of Piecewise Constant Gaussian Graphical Models: The Group-Fused Graphical Lasso
Authors:
Alexander J. Gibberd,
James D. B. Nelson
Abstract:
The time-evolving precision matrix of a piecewise-constant Gaussian graphical model encodes the dynamic conditional dependency structure of a multivariate time-series. Traditionally, graphical models are estimated under the assumption that data is drawn identically from a generating distribution. Introducing sparsity and sparse-difference inducing priors we relax these assumptions and propose a no…
▽ More
The time-evolving precision matrix of a piecewise-constant Gaussian graphical model encodes the dynamic conditional dependency structure of a multivariate time-series. Traditionally, graphical models are estimated under the assumption that data is drawn identically from a generating distribution. Introducing sparsity and sparse-difference inducing priors we relax these assumptions and propose a novel regularized M-estimator to jointly estimate both the graph and changepoint structure. The resulting estimator possesses the ability to therefore favor sparse dependency structures and/or smoothly evolving graph structures, as required. Moreover, our approach extends current methods to allow estimation of changepoints that are grouped across multiple dependencies in a system. An efficient algorithm for estimating structure is proposed. We study the empirical recovery properties in a synthetic setting. The qualitative effect of grouped changepoint estimation is then demonstrated by applying the method on two real-world data-sets.
△ Less
Submitted 31 October, 2017; v1 submitted 18 December, 2015;
originally announced December 2015.
-
Bayesian Differential Privacy through Posterior Sampling
Authors:
Christos Dimitrakakis,
Blaine Nelson,
and Zuhe Zhang,
Aikaterini Mitrokotsa,
Benjamin Rubinstein
Abstract:
Differential privacy formalises privacy-preserving mechanisms that provide access to a database. We pose the question of whether Bayesian inference itself can be used directly to provide private access to data, with no modification. The answer is affirmative: under certain conditions on the prior, sampling from the posterior distribution can be used to achieve a desired level of privacy and utilit…
▽ More
Differential privacy formalises privacy-preserving mechanisms that provide access to a database. We pose the question of whether Bayesian inference itself can be used directly to provide private access to data, with no modification. The answer is affirmative: under certain conditions on the prior, sampling from the posterior distribution can be used to achieve a desired level of privacy and utility. To do so, we generalise differential privacy to arbitrary dataset metrics, outcome spaces and distribution families. This allows us to also deal with non-i.i.d or non-tabular datasets. We prove bounds on the sensitivity of the posterior to the data, which gives a measure of robustness. We also show how to use posterior sampling to provide differentially private responses to queries, within a decision-theoretic framework. Finally, we provide bounds on the utility and on the distinguishability of datasets. The latter are complemented by a novel use of Le Cam's method to obtain lower bounds. All our general results hold for arbitrary database metrics, including those for the common definition of differential privacy. For specific choices of the metric, we give a number of examples satisfying our assumptions.
△ Less
Submitted 23 December, 2016; v1 submitted 5 June, 2013;
originally announced June 2013.
-
Poisoning Attacks against Support Vector Machines
Authors:
Battista Biggio,
Blaine Nelson,
Pavel Laskov
Abstract:
We investigate a family of poisoning attacks against Support Vector Machines (SVM). Such attacks inject specially crafted training data that increases the SVM's test error. Central to the motivation for these attacks is the fact that most learning algorithms assume that their training data comes from a natural or well-behaved distribution. However, this assumption does not generally hold in securi…
▽ More
We investigate a family of poisoning attacks against Support Vector Machines (SVM). Such attacks inject specially crafted training data that increases the SVM's test error. Central to the motivation for these attacks is the fact that most learning algorithms assume that their training data comes from a natural or well-behaved distribution. However, this assumption does not generally hold in security-sensitive settings. As we demonstrate, an intelligent adversary can, to some extent, predict the change of the SVM's decision function due to malicious input and use this ability to construct malicious data. The proposed attack uses a gradient ascent strategy in which the gradient is computed based on properties of the SVM's optimal solution. This method can be kernelized and enables the attack to be constructed in the input space even for non-linear kernels. We experimentally demonstrate that our gradient ascent procedure reliably identifies good local maxima of the non-convex validation error surface, which significantly increases the classifier's test error.
△ Less
Submitted 25 March, 2013; v1 submitted 27 June, 2012;
originally announced June 2012.