-
Weighted Euclidean Distance Matrices over Mixed Continuous and Categorical Inputs for Gaussian Process Models
Authors:
Mingyu Pu,
Songhao Wang,
Haowei Wang,
Szu Hui Ng
Abstract:
Gaussian Process (GP) models are widely utilized as surrogate models in scientific and engineering fields. However, standard GP models are limited to continuous variables due to the difficulties in establishing correlation structures for categorical variables. To overcome this limitati on, we introduce WEighted Euclidean distance matrices Gaussian Process (WEGP). WEGP constructs the kernel functio…
▽ More
Gaussian Process (GP) models are widely utilized as surrogate models in scientific and engineering fields. However, standard GP models are limited to continuous variables due to the difficulties in establishing correlation structures for categorical variables. To overcome this limitati on, we introduce WEighted Euclidean distance matrices Gaussian Process (WEGP). WEGP constructs the kernel function for each categorical input by estimating the Euclidean distance matrix (EDM) among all categorical choices of this input. The EDM is represented as a linear combination of several predefined base EDMs, each scaled by a positive weight. The weights, along with other kernel hyperparameters, are inferred using a fully Bayesian framework. We analyze the predictive performance of WEGP theoretically. Numerical experiments validate the accuracy of our GP model, and by WEGP, into Bayesian Optimization (BO), we achieve superior performance on both synthetic and real-world optimization problems.
△ Less
Submitted 4 March, 2025;
originally announced March 2025.
-
An Efficient Self-optimized Sampling Method for Rare Events in Nonequilibrium Systems
Authors:
Huijun Jiang,
Mingfeng Pu,
Zhonghuai Hou
Abstract:
Rare events such as nucleation processes are of ubiquitous importance in real systems. The most popular method for nonequilibrium systems, forward flux sampling (FFS), samples rare events by using interfaces to partition the whole transition process into sequence of steps along an order parameter connecting the initial and final states. FFS usually suffers from two main difficulties: low computati…
▽ More
Rare events such as nucleation processes are of ubiquitous importance in real systems. The most popular method for nonequilibrium systems, forward flux sampling (FFS), samples rare events by using interfaces to partition the whole transition process into sequence of steps along an order parameter connecting the initial and final states. FFS usually suffers from two main difficulties: low computational efficiency due to bad interface locations and even being not applicable when trapping into unknown intermediate metastable states. In the present work, we propose an approach to overcome these difficulties, by self-adaptively locating the interfaces on the fly in an optimized manner. Contrary to the conventional FFS which set the interfaces with euqal distance of the order parameter, our approach determines the interfaces with equal transition probability which is shown to satisfy the optimization condition. This is done by firstly running long local trajectories starting from the current interface $ł_i$ to get the conditional probability distribution $P_c$, and then determining $ł_{i+1}$ by equalling $P_c$ to a give value $p_0$. With these optimized interfaces, FFS can be run in a much efficient way. In addition, our approach can conveniently find the intermediate metastable states by monitoring some special long trajectories that nither end at the initial state nor reach the next interface, the number of which will increase sharply from zero if such metastable states are encountered. We apply our approach to a model two-state system and a two-dimensional lattice gas Ising model. Our approach is shown to be much more efficient than the conventional FFS method without losing accuracy, and it can also well reproduce the two-step nucleation scenario of the Ising model with easy identification of the intermidiate metastable state.
△ Less
Submitted 8 August, 2013;
originally announced August 2013.
-
Statistical tests for the intersection of independent lists of genes: Sensitivity, FDR, and type I error control
Authors:
Loki Natarajan,
Minya Pu,
Karen Messer
Abstract:
Public data repositories have enabled researchers to compare results across multiple genomic studies in order to replicate findings. A common approach is to first rank genes according to an hypothesis of interest within each study. Then, lists of the top-ranked genes within each study are compared across studies. Genes recaptured as highly ranked (usually above some threshold) in multiple studies…
▽ More
Public data repositories have enabled researchers to compare results across multiple genomic studies in order to replicate findings. A common approach is to first rank genes according to an hypothesis of interest within each study. Then, lists of the top-ranked genes within each study are compared across studies. Genes recaptured as highly ranked (usually above some threshold) in multiple studies are considered to be significant. However, this comparison strategy often remains informal, in that type I error and false discovery rate (FDR) are usually uncontrolled. In this paper, we formalize an inferential strategy for this kind of list-intersection discovery test. We show how to compute a $p$-value associated with a "recaptured" set of genes, using a closed-form Poisson approximation to the distribution of the size of the recaptured set. We investigate operating characteristics of the test as a function of the total number of studies considered, the rank threshold within each study, and the number of studies within which a gene must be recaptured to be declared significant. We investigate the trade off between FDR control and expected sensitivity (the expected proportion of true-positive genes identified as significant). We give practical guidance on how to design a bioinformatic list-intersection study with maximal expected sensitivity and prespecified control of type I error (at the set level) and false discovery rate (at the gene level). We show how optimal choice of parameters may depend on particular alternative hypothesis which might hold. We illustrate our methods using prostate cancer gene-expression datasets from the curated Oncomine database, and discuss the effects of dependence between genes on the test.
△ Less
Submitted 28 June, 2012;
originally announced June 2012.