Search | arXiv e-print repository

arXiv:2506.22313 [pdf, ps, other]

Manifold-Constrained Gaussian Processes for Inference of Mixed-effects Ordinary Differential Equations with Application to Pharmacokinetics

Authors: Yuxuan Zhao, Samuel W. K. Wong

Abstract: Pharmacokinetic modeling using ordinary differential equations (ODEs) has an important role in dose optimization studies, where dosing must balance sustained therapeutic efficacy with the risk of adverse side effects. Such ODE models characterize drug plasma concentration over time and allow pharmacokinetic parameters to be inferred, such as drug absorption and elimination rates. For time-course s… ▽ More Pharmacokinetic modeling using ordinary differential equations (ODEs) has an important role in dose optimization studies, where dosing must balance sustained therapeutic efficacy with the risk of adverse side effects. Such ODE models characterize drug plasma concentration over time and allow pharmacokinetic parameters to be inferred, such as drug absorption and elimination rates. For time-course studies involving treatment groups with multiple subjects, mixed-effects ODE models are commonly used. However, existing methods tend to lack uncertainty quantification on a subject-level, for key measures such as peak or trough concentration and for making predictions of drug concentration. To address such limitations, we propose an extension of manifold-constrained Gaussian processes for inference of general mixed-effects ODE models within a Bayesian statistical framework. We evaluate our method on simulated examples, demonstrating its ability to provide fast and accurate inference for parameters and trajectories using nested optimization. To illustrate the practical efficacy of the proposed method, we provide a real data analysis of a pharmacokinetic model used for an HIV combination therapy study. △ Less

Submitted 27 June, 2025; originally announced June 2025.

Comments: 34 pages, 4 figures

arXiv:2506.20048 [pdf, ps, other]

A Principled Path to Fitted Distributional Evaluation

Authors: Sungee Hong, Jiayi Wang, Zhengling Qi, Raymond Ka Wai Wong

Abstract: In reinforcement learning, distributional off-policy evaluation (OPE) focuses on estimating the return distribution of a target policy using offline data collected under a different policy. This work focuses on extending the widely used fitted-Q evaluation -- developed for expectation-based reinforcement learning -- to the distributional OPE setting. We refer to this extension as fitted distributi… ▽ More In reinforcement learning, distributional off-policy evaluation (OPE) focuses on estimating the return distribution of a target policy using offline data collected under a different policy. This work focuses on extending the widely used fitted-Q evaluation -- developed for expectation-based reinforcement learning -- to the distributional OPE setting. We refer to this extension as fitted distributional evaluation (FDE). While only a few related approaches exist, there remains no unified framework for designing FDE methods. To fill this gap, we present a set of guiding principles for constructing theoretically grounded FDE methods. Building on these principles, we develop several new FDE methods with convergence analysis and provide theoretical justification for existing methods, even in non-tabular environments. Extensive experiments, including simulations on linear quadratic regulators and Atari games, demonstrate the superior performance of the FDE methods. △ Less

Submitted 24 June, 2025; originally announced June 2025.

arXiv:2506.09722 [pdf, ps, other]

Fully Bayesian Sequential Design for Mean Response Surface Prediction of Heteroscedastic Stochastic Simulations

Authors: Yuying Huang, Samuel W. K. Wong

Abstract: We present a fully Bayesian sequential strategy for predicting the mean response surface of heteroscedastic stochastic simulation functions. Leveraging dual Gaussian processes as the surrogate model and a criterion based on empirical expected integrated mean-square prediction error, our approach sequentially selects informative design points while fully accounting for parameter uncertainty. Sequen… ▽ More We present a fully Bayesian sequential strategy for predicting the mean response surface of heteroscedastic stochastic simulation functions. Leveraging dual Gaussian processes as the surrogate model and a criterion based on empirical expected integrated mean-square prediction error, our approach sequentially selects informative design points while fully accounting for parameter uncertainty. Sequential importance sampling is employed to efficiently update the posterior distribution of the parameters. Our strategy is tailored for expensive simulation functions, where achieving robust predictive accuracy under a limited budget is critical. We illustrate its potential advantages compared to existing approaches through synthetic examples. We then implement the proposed strategy on a real motivating application in seismic design of wood-frame podium buildings. △ Less

Submitted 11 June, 2025; originally announced June 2025.

Comments: 37 pages, 8 figures

arXiv:2504.13467 [pdf, ps, other]

Efficient Estimation under Multiple Missing Patterns via Balancing Weights

Authors: Jianing Dong, Raymond K. W. Wong, Kwun Chuen Gary Chan

Abstract: As one of the most commonly seen data challenges, missing data, in particular, multiple, non-monotone missing patterns, complicates estimation and inference due to the fact that missingness mechanisms are often not missing at random, and conventional methods cannot be applied. Pattern graphs have recently been proposed as a tool to systematically relate various observed patterns in the sample. We… ▽ More As one of the most commonly seen data challenges, missing data, in particular, multiple, non-monotone missing patterns, complicates estimation and inference due to the fact that missingness mechanisms are often not missing at random, and conventional methods cannot be applied. Pattern graphs have recently been proposed as a tool to systematically relate various observed patterns in the sample. We extend its scope to the estimation of parameters defined by moment equations, including common regression models, via solving weighted estimating equations with weights constructed using a sequential balancing approach. These novel weights are carefully crafted to address the instability issue of the straightforward approach based on local balancing. We derive the efficiency bound for the model parameters and show that our proposed method, albeit relatively simple, is asymptotically efficient. Simulation results demonstrate the superior performance of the proposed method, and real-data applications illustrate how the results are robust to the choice of identification assumptions. △ Less

Submitted 18 April, 2025; originally announced April 2025.

Comments: arXiv admin note: substantial text overlap with arXiv:2402.08873

arXiv:2503.00002 [pdf, other]

Failure of Optimal Design Theory? A Case Study in Toxicology Using Sequential Robust Optimal Design Framework

Authors: Elvis Han Cui, Michael Collins, Jessica Munson, Weng Kee Wong

Abstract: This paper presents a quasi-sequential optimal design framework for toxicology experiments, specifically applied to sea urchin embryos. The authors propose a novel approach combining robust optimal design with adaptive, stage-based testing to improve efficiency in toxicological studies, particularly where traditional uniform designs fall short. The methodology uses statistical models to refine dos… ▽ More This paper presents a quasi-sequential optimal design framework for toxicology experiments, specifically applied to sea urchin embryos. The authors propose a novel approach combining robust optimal design with adaptive, stage-based testing to improve efficiency in toxicological studies, particularly where traditional uniform designs fall short. The methodology uses statistical models to refine dose levels across experimental phases, aiming for increased precision while reducing costs and complexity. Key components include selecting an initial design, iterative dose optimization based on preliminary results, and assessing various model fits to ensure robust, data-driven adjustments. Through case studies, we demonstrate improved statistical efficiency and adaptability in toxicology, with potential applications in other experimental domains. △ Less

Submitted 10 February, 2025; originally announced March 2025.

arXiv:2411.05797 [pdf, other]

Metaheuristics is All You Need

Authors: Eliuvish Cuicizion, Haowen Xu, Weng Kee Wong

Abstract: Optimization plays an important role in tackling public health problems. Animal instincts can be used effectively to solve complex public health management issues by providing optimal or approximately optimal solutions to complicated optimization problems common in public health. BAT algorithm is an exemplary member of a class of nature-inspired metaheuristic optimization algorithms and designed t… ▽ More Optimization plays an important role in tackling public health problems. Animal instincts can be used effectively to solve complex public health management issues by providing optimal or approximately optimal solutions to complicated optimization problems common in public health. BAT algorithm is an exemplary member of a class of nature-inspired metaheuristic optimization algorithms and designed to outperform existing metaheuristic algorithms in terms of efficiency and accuracy. It's inspiration comes from the foraging behavior of group of microbats that use echolocation to find their target in the surrounding environment. In recent years, BAT algorithm has been extensively used by researchers in the area of optimization, and various variants of BAT algorithm have been developed to improve its performance and extend its application to diverse disciplines. This paper first reviews the basic BAT algorithm and its variants, including their applications in various fields. As a specific application, we apply the BAT algorithm to a biostatistical estimation problem and show it has some clear advantages over existing algorithms. △ Less

Submitted 21 March, 2025; v1 submitted 25 October, 2024; originally announced November 2024.

Comments: 25 pages, many figures

arXiv:2410.11482 [pdf, other]

Scalable likelihood-based estimation and variable selection for the Cox model with incomplete covariates

Authors: Ngok Sang Kwok, Kin Yau Wong

Abstract: Regression analysis with missing data is a long-standing and challenging problem, particularly when there are many missing variables with arbitrary missing patterns. Likelihood-based methods, although theoretically appealing, are often computationally inefficient or even infeasible when dealing with a large number of missing variables. In this paper, we consider the Cox regression model with incom… ▽ More Regression analysis with missing data is a long-standing and challenging problem, particularly when there are many missing variables with arbitrary missing patterns. Likelihood-based methods, although theoretically appealing, are often computationally inefficient or even infeasible when dealing with a large number of missing variables. In this paper, we consider the Cox regression model with incomplete covariates that are missing at random. We develop an expectation-maximization (EM) algorithm for nonparametric maximum likelihood estimation, employing a transformation technique in the E-step so that it involves only a one-dimensional integration. This innovation makes our methods scalable with respect to the dimension of the missing variables. In addition, for variable selection, we extend the proposed EM algorithm to accommodate a LASSO penalty in the likelihood. We demonstrate the feasibility and advantages of the proposed methods over existing methods by large-scale simulation studies and apply the proposed methods to a cancer genomic study. △ Less

Submitted 15 October, 2024; originally announced October 2024.

Comments: 15 pages, 2 figures, 7 tables

arXiv:2409.03572 [pdf, other]

Extrinsic Principal Component Analysis

Authors: Ka Chun Wong, Vic Patrangenaru, Robert L. Paige, Mihaela Pricop Jeckstadt

Abstract: One develops a fast computational methodology for principal component analysis on manifolds. Instead of estimating intrinsic principal components on an object space with a Riemannian structure, one embeds the object space in a numerical space, and the resulting chord distance is used. This method helps us analyzing high, theoretically even infinite dimensional data, from a new perspective. We defi… ▽ More One develops a fast computational methodology for principal component analysis on manifolds. Instead of estimating intrinsic principal components on an object space with a Riemannian structure, one embeds the object space in a numerical space, and the resulting chord distance is used. This method helps us analyzing high, theoretically even infinite dimensional data, from a new perspective. We define the extrinsic principal sub-manifolds of a random object on a Hilbert manifold embedded in a Hilbert space, and the sample counterparts. The resulting extrinsic principal components are useful for dimension data reduction. For application, one retains a very small number of such extrinsic principal components for a shape of contour data sample, extracted from imaging data. △ Less

Submitted 3 October, 2024; v1 submitted 5 September, 2024; originally announced September 2024.

arXiv:2406.15170 [pdf, other]

Inference for Delay Differential Equations Using Manifold-Constrained Gaussian Processes

Authors: Yuxuan Zhao, Samuel W. K. Wong

Abstract: Dynamic systems described by differential equations often involve feedback among system components. When there are time delays for components to sense and respond to feedback, delay differential equation (DDE) models are commonly used. This paper considers the problem of inferring unknown system parameters, including the time delays, from noisy and sparse experimental data observed from the system… ▽ More Dynamic systems described by differential equations often involve feedback among system components. When there are time delays for components to sense and respond to feedback, delay differential equation (DDE) models are commonly used. This paper considers the problem of inferring unknown system parameters, including the time delays, from noisy and sparse experimental data observed from the system. We propose an extension of manifold-constrained Gaussian processes to conduct parameter inference for DDEs, whereas the time delay parameters have posed a challenge for existing methods that bypass numerical solvers. Our method uses a Bayesian framework to impose a Gaussian process model over the system trajectory, conditioned on the manifold constraint that satisfies the DDEs. For efficient computation, a linear interpolation scheme is developed to approximate the values of the time-delayed system outputs, along with corresponding theoretical error bounds on the approximated derivatives. Two simulation examples, based on Hutchinson's equation and the lac operon system, together with a real-world application using Ontario COVID-19 data, are used to illustrate the efficacy of our method. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: 42 pages, 8 figures

arXiv:2405.12386 [pdf, other]

Particle swarm optimization with Applications to Maximum Likelihood Estimation and Penalized Negative Binomial Regression

Authors: Sisi Shao, Junhyung Park, Weng Kee Wong

Abstract: General purpose optimization routines such as nlminb, optim (R) or nlmixed (SAS) are frequently used to estimate model parameters in nonstandard distributions. This paper presents Particle Swarm Optimization (PSO), as an alternative to many of the current algorithms used in statistics. We find that PSO can not only reproduce the same results as the above routines, it can also produce results that… ▽ More General purpose optimization routines such as nlminb, optim (R) or nlmixed (SAS) are frequently used to estimate model parameters in nonstandard distributions. This paper presents Particle Swarm Optimization (PSO), as an alternative to many of the current algorithms used in statistics. We find that PSO can not only reproduce the same results as the above routines, it can also produce results that are more optimal or when others cannot converge. In the latter case, it can also identify the source of the problem or problems. We highlight advantages of using PSO using four examples, where: (1) some parameters in a generalized distribution are unidentified using PSO when it is not apparent or computationally manifested using routines in R or SAS; (2) PSO can produce estimation results for the log-binomial regressions when current routines may not; (3) PSO provides flexibility in the link function for binomial regression with LASSO penalty, which is unsupported by standard packages like GLM and GENMOD in Stata and SAS, respectively, and (4) PSO provides superior MLE estimates for an EE-IW distribution compared with those from the traditional statistical methods that rely on moments. △ Less

Submitted 20 May, 2024; originally announced May 2024.

arXiv:2402.08873 [pdf, ps, other]

Balancing Weights for Non-monotone Missing Data

Authors: Jianing Dong, Raymond K. W. Wong, Kwun Chuen Gary Chan

Abstract: Balancing weights have been widely applied to single or monotone missingness due to empirical advantages over likelihood-based methods and inverse probability weighting approaches. This paper considers non-monotone missing data under the complete-case missing variable condition (CCMV), a case of missing not at random (MNAR). Using relationships between each missing pattern and the complete-case su… ▽ More Balancing weights have been widely applied to single or monotone missingness due to empirical advantages over likelihood-based methods and inverse probability weighting approaches. This paper considers non-monotone missing data under the complete-case missing variable condition (CCMV), a case of missing not at random (MNAR). Using relationships between each missing pattern and the complete-case subsample, we construct a weighted estimator for estimation, where the weight is a sum of ratios of the conditional probability of observing a particular missing pattern versus that of observing the complete-case, given the variables observed in the corresponding missing pattern. However, plug-in estimators of the propensity odds can be unbounded and lead to unstable estimation. Using further relations between propensity odds and balancing of moments across response patterns, we employ tailored loss functions, each encouraging empirical balance across patterns to estimate propensity odds flexibly using a functional basis expansion. We propose two penalizations to control propensity odds model smoothness and empirical imbalance. We study the asymptotic properties of the proposed estimators and show that they are consistent under mild smoothness assumptions. Asymptotic normality and efficiency are developed. Simulation results show the superior performance of the proposed method. △ Less

Submitted 12 December, 2024; v1 submitted 13 February, 2024; originally announced February 2024.

arXiv:2402.06058 [pdf, other]

Mathematical programming tools for randomization purposes in small two-arm clinical trials: A case study with real data

Authors: Alan R. Vazquez, Weng Kee Wong

Abstract: Modern randomization methods in clinical trials are invariably adaptive, meaning that the assignment of the next subject to a treatment group uses the accumulated information in the trial. Some of the recent adaptive randomization methods use mathematical programming to construct attractive clinical trials that balance the group features, such as their sizes and covariate distributions of their su… ▽ More Modern randomization methods in clinical trials are invariably adaptive, meaning that the assignment of the next subject to a treatment group uses the accumulated information in the trial. Some of the recent adaptive randomization methods use mathematical programming to construct attractive clinical trials that balance the group features, such as their sizes and covariate distributions of their subjects. We review some of these methods and compare their performance with common covariate-adaptive randomization methods for small clinical trials. We introduce an energy distance measure that compares the discrepancy between the two groups using the joint distribution of the subjects' covariates. This metric is more appealing than evaluating the discrepancy between the groups using their marginal covariate distributions. Using numerical experiments, we demonstrate the advantages of the mathematical programming methods under the new measure. In the supplementary material, we provide R codes to reproduce our study results and facilitate comparisons of different randomization procedures. △ Less

Submitted 8 February, 2024; originally announced February 2024.

Comments: 36 pages, 12 figures

arXiv:2402.01900 [pdf, other]

Distributional Off-policy Evaluation with Bellman Residual Minimization

Authors: Sungee Hong, Zhengling Qi, Raymond K. W. Wong

Abstract: We study distributional off-policy evaluation (OPE), of which the goal is to learn the distribution of the return for a target policy using offline data generated by a different policy. The theoretical foundation of many existing work relies on the supremum-extended statistical distances such as supremum-Wasserstein distance, which are hard to estimate. In contrast, we study the more manageable ex… ▽ More We study distributional off-policy evaluation (OPE), of which the goal is to learn the distribution of the return for a target policy using offline data generated by a different policy. The theoretical foundation of many existing work relies on the supremum-extended statistical distances such as supremum-Wasserstein distance, which are hard to estimate. In contrast, we study the more manageable expectation-extended statistical distances and provide a novel theoretical justification on their validity for learning the return distribution. Based on this attractive property, we propose a new method called Energy Bellman Residual Minimizer (EBRM) for distributional OPE. We provide corresponding in-depth theoretical analyses. We establish a finite-sample error bound for the EBRM estimator under the realizability assumption. Furthermore, we introduce a variant of our method based on a multi-step extension which improves the error bound for non-realizable settings. Notably, unlike prior distributional OPE methods, the theoretical guarantees of our method do not require the completeness assumption. △ Less

Submitted 12 March, 2025; v1 submitted 2 February, 2024; originally announced February 2024.

arXiv:2401.10010 [pdf, ps, other]

A global kernel estimator for partially linear varying coefficient additive hazards models

Authors: Hoi Min Ng, Kin Yau Wong

Abstract: In biomedical studies, we are often interested in the association between different types of covariates and the times to disease events. Because the relationship between the covariates and event times is often complex, standard survival models that assume a linear covariate effect are inadequate. A flexible class of models for capturing complex interaction effects among types of covariates is the… ▽ More In biomedical studies, we are often interested in the association between different types of covariates and the times to disease events. Because the relationship between the covariates and event times is often complex, standard survival models that assume a linear covariate effect are inadequate. A flexible class of models for capturing complex interaction effects among types of covariates is the varying coefficient models, where the effects of a type of covariates can be modified by another type of covariates. In this paper, we study kernel-based estimation methods for varying coefficient additive hazards models. Unlike many existing kernel-based methods that use a local neighborhood of subjects for the estimation of the varying coefficient function, we propose a novel global approach that is generally more efficient. We establish theoretical properties of the proposed estimators and demonstrate their superior performance compared with existing local methods through large-scale simulation studies. To illustrate the proposed method, we provide an application to a motivating cancer genomic study. △ Less

Submitted 18 January, 2024; originally announced January 2024.

Comments: 27 pages

MSC Class: 62N02

arXiv:2401.04723 [pdf, other]

Spatio-temporal data fusion for the analysis of in situ and remote sensing data using the INLA-SPDE approach

Authors: Shiyu He, Samuel W. K. Wong

Abstract: We propose a Bayesian hierarchical model to address the challenge of spatial misalignment in spatio-temporal data obtained from in situ and satellite sources. The model is fit using the INLA-SPDE approach, which provides efficient computation. Our methodology combines the different data sources in a "fusion"" model via the construction of projection matrices in both spatial and temporal domains. T… ▽ More We propose a Bayesian hierarchical model to address the challenge of spatial misalignment in spatio-temporal data obtained from in situ and satellite sources. The model is fit using the INLA-SPDE approach, which provides efficient computation. Our methodology combines the different data sources in a "fusion"" model via the construction of projection matrices in both spatial and temporal domains. Through simulation studies, we demonstrate that the fusion model has superior performance in prediction accuracy across space and time compared to standalone "in situ" and "satellite" models based on only in situ or satellite data, respectively. The fusion model also generally outperforms the standalone models in terms of parameter inference. Such a modeling approach is motivated by environmental problems, and our specific focus is on the analysis and prediction of harmful algae bloom (HAB) events, where the convention is to conduct separate analyses based on either in situ samples or satellite images. A real data analysis shows that the proposed model is a necessary step towards a unified characterization of bloom dynamics and identifying the key drivers of HAB events. △ Less

Submitted 9 January, 2024; originally announced January 2024.

Comments: 23 pages, 7 figures

arXiv:2312.13044 [pdf, other]

Particle Gibbs for Likelihood-Free Inference of State Space Models with Application to Stochastic Volatility

Authors: Zhaoran Hou, Samuel W. K. Wong

Abstract: State space models (SSMs) are widely used to describe dynamic systems. However, when the likelihood of the observations is intractable, parameter inference for SSMs cannot be easily carried out using standard Markov chain Monte Carlo or sequential Monte Carlo methods. In this paper, we propose a particle Gibbs sampler as a general strategy to handle SSMs with intractable likelihoods in the approxi… ▽ More State space models (SSMs) are widely used to describe dynamic systems. However, when the likelihood of the observations is intractable, parameter inference for SSMs cannot be easily carried out using standard Markov chain Monte Carlo or sequential Monte Carlo methods. In this paper, we propose a particle Gibbs sampler as a general strategy to handle SSMs with intractable likelihoods in the approximate Bayesian computation (ABC) setting. The proposed sampler incorporates a conditional auxiliary particle filter, which can help mitigate the weight degeneracy often encountered in ABC. To illustrate the methodology, we focus on a classic stochastic volatility model (SVM) used in finance and econometrics for analyzing and interpreting volatility. Simulation studies demonstrate the accuracy of our sampler for SVM parameter inference, compared to existing particle Gibbs samplers based on the conditional bootstrap filter. As a real data application, we apply the proposed sampler for fitting an SVM to S&P 500 Index time-series data during the 2008 financial crisis. △ Less

Submitted 20 December, 2023; originally announced December 2023.

Comments: 23 pages

arXiv:2311.03497 [pdf, other]

Understanding the Impact of Seasonal Climate Change on Canada's Economy by Region and Sector

Authors: Shiyu He, Trang Bui, Yuying Huang, Wenling Zhang, Jie Jian, Samuel W. K. Wong, Tony S. Wirjanto

Abstract: To assess the impact of climate change on the Canadian economy, we investigate and model the relationship between seasonal climate variables and economic growth across provinces and economic sectors. We further provide projections of climate change impacts up to the year 2050, taking into account the diverse climate change patterns and economic conditions across Canada. Our results indicate that r… ▽ More To assess the impact of climate change on the Canadian economy, we investigate and model the relationship between seasonal climate variables and economic growth across provinces and economic sectors. We further provide projections of climate change impacts up to the year 2050, taking into account the diverse climate change patterns and economic conditions across Canada. Our results indicate that rising Fall temperature anomalies have a notable adverse impact on Canadian economic growth. Province-wide, Saskatchewan and Manitoba are anticipated to experience the most substantial declines, whereas British Columbia and the Maritime provinces will be less impacted. Industry-wide, Mining is projected to see the greatest benefits, while Agriculture and Manufacturing are projected to have the most significant downturns. The disparities of climate change effects between provinces and industries highlight the need for governments to tailor their policies accordingly, and offer targeted assistance to regions and industries that are particularly vulnerable in the face of climate change. Targeted approaches to climate change mitigation are likely to be more effective than one-size-fits-all policies for the whole economy. △ Less

Submitted 6 November, 2023; originally announced November 2023.

Comments: 25 pages, 7 figures

arXiv:2310.20537 [pdf, other]

Directed Cyclic Graph for Causal Discovery from Multivariate Functional Data

Authors: Saptarshi Roy, Raymond K. W. Wong, Yang Ni

Abstract: Discovering causal relationship using multivariate functional data has received a significant amount of attention very recently. In this article, we introduce a functional linear structural equation model for causal structure learning when the underlying graph involving the multivariate functions may have cycles. To enhance interpretability, our model involves a low-dimensional causal embedded spa… ▽ More Discovering causal relationship using multivariate functional data has received a significant amount of attention very recently. In this article, we introduce a functional linear structural equation model for causal structure learning when the underlying graph involving the multivariate functions may have cycles. To enhance interpretability, our model involves a low-dimensional causal embedded space such that all the relevant causal information in the multivariate functional data is preserved in this lower-dimensional subspace. We prove that the proposed model is causally identifiable under standard assumptions that are often made in the causal discovery literature. To carry out inference of our model, we develop a fully Bayesian framework with suitable prior specifications and uncertainty quantification through posterior summaries. We illustrate the superior performance of our method over existing methods in terms of causal graph estimation through extensive simulation studies. We also demonstrate the proposed method using a brain EEG dataset. △ Less

Submitted 31 October, 2023; originally announced October 2023.

Comments: 36 pages, 2 figures, 7 tables

arXiv:2310.15070 [pdf, ps, other]

Improving estimation efficiency of case-cohort study with interval-censored failure time data

Authors: Qingning Zhou, Kin Yau Wong

Abstract: The case-cohort design is a commonly used cost-effective sampling strategy for large cohort studies, where some covariates are expensive to measure or obtain. In this paper, we consider regression analysis under a case-cohort study with interval-censored failure time data, where the failure time is only known to fall within an interval instead of being exactly observed. A common approach to analyz… ▽ More The case-cohort design is a commonly used cost-effective sampling strategy for large cohort studies, where some covariates are expensive to measure or obtain. In this paper, we consider regression analysis under a case-cohort study with interval-censored failure time data, where the failure time is only known to fall within an interval instead of being exactly observed. A common approach to analyze data from a case-cohort study is the inverse probability weighting approach, where only subjects in the case-cohort sample are used in estimation, and the subjects are weighted based on the probability of inclusion into the case-cohort sample. This approach, though consistent, is generally inefficient as it does not incorporate information outside the case-cohort sample. To improve efficiency, we first develop a sieve maximum weighted likelihood estimator under the Cox model based on the case-cohort sample, and then propose a procedure to update this estimator by using information in the full cohort. We show that the update estimator is consistent, asymptotically normal, and more efficient than the original estimator. The proposed method can flexibly incorporate auxiliary variables to further improve estimation efficiency. We employ a weighted bootstrap procedure for variance estimation. Simulation results indicate that the proposed method works well in practical situations. A real study on diabetes is provided for illustration. △ Less

Submitted 23 October, 2023; originally announced October 2023.

Comments: 19 pages, 3 tables

arXiv:2310.07801 [pdf, other]

Trajectory-aware Principal Manifold Framework for Data Augmentation and Image Generation

Authors: Elvis Han Cui, Bingbin Li, Yanan Li, Weng Kee Wong, Donghui Wang

Abstract: Data augmentation for deep learning benefits model training, image transformation, medical imaging analysis and many other fields. Many existing methods generate new samples from a parametric distribution, like the Gaussian, with little attention to generate samples along the data manifold in either the input or feature space. In this paper, we verify that there are theoretical and practical advan… ▽ More Data augmentation for deep learning benefits model training, image transformation, medical imaging analysis and many other fields. Many existing methods generate new samples from a parametric distribution, like the Gaussian, with little attention to generate samples along the data manifold in either the input or feature space. In this paper, we verify that there are theoretical and practical advantages of using the principal manifold hidden in the feature space than the Gaussian distribution. We then propose a novel trajectory-aware principal manifold framework to restore the manifold backbone and generate samples along a specific trajectory. On top of the autoencoder architecture, we further introduce an intrinsic dimension regularization term to make the manifold more compact and enable few-shot image generation. Experimental results show that the novel framework is able to extract more compact manifold representation, improve classification accuracy and generate smooth transformation among few samples. △ Less

Submitted 30 July, 2023; originally announced October 2023.

Comments: 20 figures

arXiv:2309.08039 [pdf, other]

Flexible Functional Treatment Effect Estimation

Authors: Jiayi Wang, Raymond K. W. Wong, Xiaoke Zhang, Kwun Chuen Gary Chan

Abstract: We study treatment effect estimation with functional treatments where the average potential outcome functional is a function of functions, in contrast to continuous treatment effect estimation where the target is a function of real numbers. By considering a flexible scalar-on-function marginal structural model, a weight-modified kernel ridge regression (WMKRR) is adopted for estimation. The weight… ▽ More We study treatment effect estimation with functional treatments where the average potential outcome functional is a function of functions, in contrast to continuous treatment effect estimation where the target is a function of real numbers. By considering a flexible scalar-on-function marginal structural model, a weight-modified kernel ridge regression (WMKRR) is adopted for estimation. The weights are constructed by directly minimizing the uniform balancing error resulting from a decomposition of the WMKRR estimator, instead of being estimated under a particular treatment selection model. Despite the complex structure of the uniform balancing error derived under WMKRR, finite-dimensional convex algorithms can be applied to efficiently solve for the proposed weights thanks to a representer theorem. The optimal convergence rate is shown to be attainable by the proposed WMKRR estimator without any smoothness assumption on the true weight function. Corresponding empirical performance is demonstrated by a simulation study and a real data application. △ Less

Submitted 12 November, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

arXiv:2306.16909 [pdf, other]

A network-based regression approach for identifying subject-specific driver mutations

Authors: Kin Yau Wong, Donglin Zeng, D. Y. Lin

Abstract: In cancer genomics, it is of great importance to distinguish driver mutations, which contribute to cancer progression, from causally neutral passenger mutations. We propose a random-effect regression approach to estimate the effects of mutations on the expressions of genes in tumor samples, where the estimation is assisted by a prespecified gene network. The model allows the mutation effects to va… ▽ More In cancer genomics, it is of great importance to distinguish driver mutations, which contribute to cancer progression, from causally neutral passenger mutations. We propose a random-effect regression approach to estimate the effects of mutations on the expressions of genes in tumor samples, where the estimation is assisted by a prespecified gene network. The model allows the mutation effects to vary across subjects. We develop a subject-specific mutation score to quantify the effect of a mutation on the expressions of its downstream genes, so mutations with large scores can be prioritized as drivers. We demonstrate the usefulness of the proposed methods by simulation studies and provide an application to a breast cancer genomics study. △ Less

Submitted 29 June, 2023; originally announced June 2023.

Comments: 23 pages; 9 figures

arXiv:2304.02127 [pdf, other]

A Bayesian Collocation Integral Method for Parameter Estimation in Ordinary Differential Equations

Authors: Mingwei Xu, Samuel W. K. Wong, Peijun Sang

Abstract: Inferring the parameters of ordinary differential equations (ODEs) from noisy observations is an important problem in many scientific fields. Currently, most parameter estimation methods that bypass numerical integration tend to rely on basis functions or Gaussian processes to approximate the ODE solution and its derivatives. Due to the sensitivity of the ODE solution to its derivatives, these met… ▽ More Inferring the parameters of ordinary differential equations (ODEs) from noisy observations is an important problem in many scientific fields. Currently, most parameter estimation methods that bypass numerical integration tend to rely on basis functions or Gaussian processes to approximate the ODE solution and its derivatives. Due to the sensitivity of the ODE solution to its derivatives, these methods can be hindered by estimation error, especially when only sparse time-course observations are available. We present a Bayesian collocation framework that operates on the integrated form of the ODEs and also avoids the expensive use of numerical solvers. Our methodology has the capability to handle general nonlinear ODE systems. We demonstrate the accuracy of the proposed method through simulation studies, where the estimated parameters and recovered system trajectories are compared with other recent methods. A real data example is also provided. △ Less

Submitted 23 October, 2023; v1 submitted 4 April, 2023; originally announced April 2023.

arXiv:2302.08439 [pdf, other]

doi 10.1080/00401706.2023.2197471

Bayesian Nonlinear Tensor Regression with Functional Fused Elastic Net Prior

Authors: Shuoli Chen, Kejun He, Shiyuan He, Yang Ni, Raymond K. W. Wong

Abstract: Tensor regression methods have been widely used to predict a scalar response from covariates in the form of a multiway array. In many applications, the regions of tensor covariates used for prediction are often spatially connected with unknown shapes and discontinuous jumps on the boundaries. Moreover, the relationship between the response and the tensor covariates can be nonlinear. In this articl… ▽ More Tensor regression methods have been widely used to predict a scalar response from covariates in the form of a multiway array. In many applications, the regions of tensor covariates used for prediction are often spatially connected with unknown shapes and discontinuous jumps on the boundaries. Moreover, the relationship between the response and the tensor covariates can be nonlinear. In this article, we develop a nonlinear Bayesian tensor additive regression model to accommodate such spatial structure. A functional fused elastic net prior is proposed over the additive component functions to comprehensively model the nonlinearity and spatial smoothness, detect the discontinuous jumps, and simultaneously identify the active regions. The great flexibility and interpretability of the proposed method against the alternatives are demonstrated by a simulation study and an analysis on facial feature data. △ Less

Submitted 16 February, 2023; originally announced February 2023.

Journal ref: Technometrics, 65:4, 524-536 (2023)

arXiv:2301.12540 [pdf, other]

Implicit Regularization for Group Sparsity

Authors: Jiangyuan Li, Thanh V. Nguyen, Chinmay Hegde, Raymond K. W. Wong

Abstract: We study the implicit regularization of gradient descent towards structured sparsity via a novel neural reparameterization, which we call a diagonally grouped linear neural network. We show the following intriguing property of our reparameterization: gradient descent over the squared regression loss, without any explicit regularization, biases towards solutions with a group sparsity structure. In… ▽ More We study the implicit regularization of gradient descent towards structured sparsity via a novel neural reparameterization, which we call a diagonally grouped linear neural network. We show the following intriguing property of our reparameterization: gradient descent over the squared regression loss, without any explicit regularization, biases towards solutions with a group sparsity structure. In contrast to many existing works in understanding implicit regularization, we prove that our training trajectory cannot be simulated by mirror descent. We analyze the gradient dynamics of the corresponding regression problem in the general noise setting and obtain minimax-optimal error rates. Compared to existing bounds for implicit sparse regularization using diagonal linear networks, our analysis with the new reparameterization shows improved sample complexity. In the degenerate case of size-one groups, our approach gives rise to a new algorithm for sparse linear regression. Finally, we demonstrate the efficacy of our approach with several numerical experiments. △ Less

Submitted 29 January, 2023; originally announced January 2023.

Comments: accepted by ICLR 2023

arXiv:2301.12302 [pdf, other]

A Kriging Metamodel with Adaptive Sampling for Seismic Evaluation of Podium Buildings

Authors: Yuying Huang, Zhiyong Chen, Samuel W. K. Wong

Abstract: In this paper, nonlinear time-history dynamic analyses of selected earthquake ground motions are conducted on designated wood-frame podium buildings and the resulting inter-story drifts are analyzed. We aim to construct a reliable region where performance-based seismic design criteria are met, such that a two-step analysis procedure can be used with high confidence. We develop a kriging metamodel… ▽ More In this paper, nonlinear time-history dynamic analyses of selected earthquake ground motions are conducted on designated wood-frame podium buildings and the resulting inter-story drifts are analyzed. We aim to construct a reliable region where performance-based seismic design criteria are met, such that a two-step analysis procedure can be used with high confidence. We develop a kriging metamodel with tailored adaptive sampling methods to achieve this goal in a computationally efficient manner. The input variables we consider are the normalized stiffness ratio and the normalized mass ratio of the podium building. We took a six-story wood frame built upon a one-story concrete podium as a case study for our methodology, where our results indicate that the two-step analysis procedure may be used with high confidence if its normalized stiffness ratio is at least 38 and its normalized mass ratio is between 0.5 and 1.5. △ Less

Submitted 28 January, 2023; originally announced January 2023.

Comments: 14 pages, 2 figures

arXiv:2210.14216 [pdf, other]

Estimating Boltzmann Averages for Protein Structural Quantities Using Sequential Monte Carlo

Authors: Zhaoran Hou, Samuel W. K. Wong

Abstract: Sequential Monte Carlo (SMC) methods are widely used to draw samples from intractable target distributions. Particle degeneracy can hinder the use of SMC when the target distribution is highly constrained or multimodal. As a motivating application, we consider the problem of sampling protein structures from the Boltzmann distribution. This paper proposes a general SMC method that propagates multip… ▽ More Sequential Monte Carlo (SMC) methods are widely used to draw samples from intractable target distributions. Particle degeneracy can hinder the use of SMC when the target distribution is highly constrained or multimodal. As a motivating application, we consider the problem of sampling protein structures from the Boltzmann distribution. This paper proposes a general SMC method that propagates multiple descendants for each particle, followed by resampling to maintain the desired number of particles. Simulation studies demonstrate the efficacy of the method for tackling the protein sampling problem. As a real data example, we use our method to estimate the number of atomic contacts for a key segment of the SARS-CoV-2 viral spike protein. △ Less

Submitted 25 October, 2022; originally announced October 2022.

Comments: 20 pages

arXiv:2210.13323 [pdf, other]

A Comparative Study of Compartmental Models for COVID-19 Transmission in Ontario, Canada

Authors: Yuxuan Zhao, Samuel W. K. Wong

Abstract: The number of confirmed COVID-19 cases reached over 1.3 million in Ontario, Canada by June 4, 2022. The continued spread of the virus underlying COVID-19 has been spurred by the emergence of variants since the initial outbreak in December, 2019. Much attention has thus been devoted to tracking and modelling the transmission of COVID-19. Compartmental models are commonly used to mimic epidemic tran… ▽ More The number of confirmed COVID-19 cases reached over 1.3 million in Ontario, Canada by June 4, 2022. The continued spread of the virus underlying COVID-19 has been spurred by the emergence of variants since the initial outbreak in December, 2019. Much attention has thus been devoted to tracking and modelling the transmission of COVID-19. Compartmental models are commonly used to mimic epidemic transmission mechanisms and are easy to understand. Their performance in real-world settings, however, needs to be more thoroughly assessed. In this comparative study, we examine five compartmental models -- four existing ones and an extended model that we propose -- and analyze their ability to describe COVID-19 transmission in Ontario from January 2022 to June 2022. △ Less

Submitted 24 October, 2022; originally announced October 2022.

Comments: 26 pages, 8 figures

arXiv:2206.12891 [pdf, other]

Hierarchical nuclear norm penalization for multi-view data

Authors: Sangyoon Yi, Raymond K. W. Wong, Irina Gaynanova

Abstract: The prevalence of data collected on the same set of samples from multiple sources (i.e., multi-view data) has prompted significant development of data integration methods based on low-rank matrix factorizations. These methods decompose signal matrices from each view into the sum of shared and individual structures, which are further used for dimension reduction, exploratory analyses, and quantifyi… ▽ More The prevalence of data collected on the same set of samples from multiple sources (i.e., multi-view data) has prompted significant development of data integration methods based on low-rank matrix factorizations. These methods decompose signal matrices from each view into the sum of shared and individual structures, which are further used for dimension reduction, exploratory analyses, and quantifying associations across views. However, existing methods have limitations in modeling partially-shared structures due to either too restrictive models, or restrictive identifiability conditions. To address these challenges, we formulate a new model for partially-shared signals based on grouping the views into so-called hierarchical levels. The proposed hierarchy leads us to introduce a new penalty, hierarchical nuclear norm (HNN), for signal estimation. In contrast to existing methods, HNN penalization avoids scores and loadings factorization of the signals and leads to a convex optimization problem, which we solve using a dual forward-backward algorithm. We propose a simple refitting procedure to adjust the penalization bias and develop an adapted version of bi-cross-validation for selecting tuning parameters. Extensive simulation studies and analysis of the genotype-tissue expression data demonstrate the advantages of our method over existing alternatives. △ Less

Submitted 26 June, 2022; originally announced June 2022.

Comments: 39 pages, 10 figures, 3 tables

arXiv:2203.12913 [pdf, other]

k-Rater Reliability: The Correct Unit of Reliability for Aggregated Human Annotations

Authors: Ka Wong, Praveen Paritosh

Abstract: Since the inception of crowdsourcing, aggregation has been a common strategy for dealing with unreliable data. Aggregate ratings are more reliable than individual ones. However, many natural language processing (NLP) applications that rely on aggregate ratings only report the reliability of individual ratings, which is the incorrect unit of analysis. In these instances, the data reliability is und… ▽ More Since the inception of crowdsourcing, aggregation has been a common strategy for dealing with unreliable data. Aggregate ratings are more reliable than individual ones. However, many natural language processing (NLP) applications that rely on aggregate ratings only report the reliability of individual ratings, which is the incorrect unit of analysis. In these instances, the data reliability is under-reported, and a proposed k-rater reliability (kRR) should be used as the correct data reliability for aggregated datasets. It is a multi-rater generalization of inter-rater reliability (IRR). We conducted two replications of the WordSim-353 benchmark, and present empirical, analytical, and bootstrap-based methods for computing kRR on WordSim-353. These methods produce very similar results. We hope this discussion will nudge researchers to report kRR in addition to IRR. △ Less

Submitted 24 March, 2022; originally announced March 2022.

arXiv:2203.06066 [pdf, other]

MAGI: A Package for Inference of Dynamic Systems from Noisy and Sparse Data via Manifold-constrained Gaussian Processes

Authors: Samuel W. K. Wong, Shihao Yang, S. C. Kou

Abstract: This article presents the MAGI software package for the inference of dynamic systems. The focus of MAGI is on dynamics modeled by nonlinear ordinary differential equations with unknown parameters. While such models are widely used in science and engineering, the available experimental data for parameter estimation may be noisy and sparse. Furthermore, some system components may be entirely unobser… ▽ More This article presents the MAGI software package for the inference of dynamic systems. The focus of MAGI is on dynamics modeled by nonlinear ordinary differential equations with unknown parameters. While such models are widely used in science and engineering, the available experimental data for parameter estimation may be noisy and sparse. Furthermore, some system components may be entirely unobserved. MAGI solves this inference problem with the help of manifold-constrained Gaussian processes within a Bayesian statistical framework, whereas unobserved components have posed a significant challenge for existing software. We use several realistic examples to illustrate the functionality of MAGI. The user may choose to use the package in any of the R, MATLAB, and Python environments. △ Less

Submitted 16 October, 2023; v1 submitted 11 March, 2022; originally announced March 2022.

Comments: 47 pages, 10 figures

arXiv:2201.07775 [pdf, other]

Monte Carlo sampling of flexible protein structures: an application to the SARS-CoV-2 omicron variant

Authors: Samuel W. K. Wong

Abstract: Proteins can exhibit dynamic structural flexibility as they carry out their functions, especially in binding regions that interact with other molecules. For the key SARS-CoV-2 spike protein that facilitates COVID-19 infection, studies have previously identified several such highly flexible regions with therapeutic importance. However, protein structures available from the Protein Data Bank are pre… ▽ More Proteins can exhibit dynamic structural flexibility as they carry out their functions, especially in binding regions that interact with other molecules. For the key SARS-CoV-2 spike protein that facilitates COVID-19 infection, studies have previously identified several such highly flexible regions with therapeutic importance. However, protein structures available from the Protein Data Bank are presented as static snapshots that may not adequately depict this flexibility, and furthermore these cannot keep pace with new mutations and variants. In this paper we present a sequential Monte Carlo method for broadly sampling the 3-D conformational space of protein structure, according to the Boltzmann distribution of a given energy function. Our approach is distinct from previous sampling methods that focus on finding the lowest-energy conformation for predicting a single stable structure. We exemplify our method on the SARS-CoV-2 omicron variant as an application of timely interest. Our results identify sequence positions 495-508 as a key region where omicron mutations have the most impact on the space of possible conformations, which coincides with the findings of other preliminary studies on the binding properties of the omicron variant. △ Less

Submitted 4 February, 2022; v1 submitted 19 January, 2022; originally announced January 2022.

Comments: 20 pages, 4 figures

arXiv:2201.03464 [pdf, other]

Knots and their effect on the tensile strength of lumber: a case study

Authors: Shuxian Fan, Samuel W. K. Wong, James V. Zidek

Abstract: When assessing the strength of sawn lumber for use in engineering applications, the sizes and locations of knots are an important consideration. Knots are the most common visual characteristics of lumber, that result from the growth of tree branches. Large individual knots, as well as clusters of distinct knots, are known to have strength-reducing effects. However, industry grading rules that gove… ▽ More When assessing the strength of sawn lumber for use in engineering applications, the sizes and locations of knots are an important consideration. Knots are the most common visual characteristics of lumber, that result from the growth of tree branches. Large individual knots, as well as clusters of distinct knots, are known to have strength-reducing effects. However, industry grading rules that govern knots are informed by subjective judgment to some extent, particularly the spatial interaction of knots and their relationship with lumber strength. This case study reports the results of an experiment that investigated and modelled the strength-reducing effects of knots on a sample of Douglas Fir lumber. Experimental data were obtained by taking scans of lumber surfaces and applying tensile strength testing. The modelling approach presented incorporates all relevant knot information in a Bayesian framework, thereby contributing a more refined way of managing the quality of manufactured lumber. △ Less

Submitted 14 February, 2023; v1 submitted 10 January, 2022; originally announced January 2022.

Comments: 20 pages, 4 figures

arXiv:2111.14623 [pdf, other]

doi 10.1109/TBDATA.2021.3103458

An Overview of Healthcare Data Analytics With Applications to the COVID-19 Pandemic

Authors: Zhe Fei, Yevgen Ryeznik, Oleksandr Sverdlov, Chee Wei Tan, Weng Kee Wong

Abstract: In the era of big data, standard analysis tools may be inadequate for making inference and there is a growing need for more efficient and innovative ways to collect, process, analyze and interpret the massive and complex data. We provide an overview of challenges in big data problems and describe how innovative analytical methods, machine learning tools and metaheuristics can tackle general health… ▽ More In the era of big data, standard analysis tools may be inadequate for making inference and there is a growing need for more efficient and innovative ways to collect, process, analyze and interpret the massive and complex data. We provide an overview of challenges in big data problems and describe how innovative analytical methods, machine learning tools and metaheuristics can tackle general healthcare problems with a focus on the current pandemic. In particular, we give applications of modern digital technology, statistical methods, data platforms and data integration systems to improve diagnosis and treatment of diseases in clinical research and novel epidemiologic tools to tackle infection source problems, such as finding Patient Zero in the spread of epidemics. We make the case that analyzing and interpreting big data is a very challenging task that requires a multi-disciplinary effort to continuously create more effective methodologies and powerful tools to transfer data information into knowledge that enables informed decision making. △ Less

Submitted 25 November, 2021; originally announced November 2021.

Journal ref: IEEE TRANSACTIONS ON BIG DATA, 12 August 2021

arXiv:2110.11896 [pdf, other]

Multimodel Bayesian Analysis of Load Duration Effects in Lumber Reliability

Authors: Yunfeng Yang, Martin Lysy, Samuel W. K. Wong

Abstract: This paper evaluates the reliability of lumber, accounting for the duration-of-load (DOL) effect under different load profiles based on a multimodel Bayesian approach. Three individual DOL models previously used for reliability assessment are considered: the US model, the Canadian model, and the Gamma process model. Procedures for stochastic generation of residential, snow, and wind loads are also… ▽ More This paper evaluates the reliability of lumber, accounting for the duration-of-load (DOL) effect under different load profiles based on a multimodel Bayesian approach. Three individual DOL models previously used for reliability assessment are considered: the US model, the Canadian model, and the Gamma process model. Procedures for stochastic generation of residential, snow, and wind loads are also described. We propose Bayesian model-averaging (BMA) as a method for combining the reliability estimates of individual models under a given load profile that coherently accounts for statistical uncertainty in the choice of model and parameter values. The method is applied to the analysis of a Hemlock experimental dataset, where the BMA results are illustrated via estimated reliability indices together with 95% interval bands. △ Less

Submitted 22 October, 2021; originally announced October 2021.

Comments: 15 pages, 2 figures

arXiv:2110.06115 [pdf]

doi 10.1097/EDE.0000000000001453

Evaluating the Impact of State-Level Public Masking Mandates on New COVID-19 Cases and Deaths in the United States: A Demonstration of the Causal Roadmap

Authors: Angus K. Wong, Laura B. Balzer

Abstract: At a national-level, we sought to investigate the effect of public masking mandates on COVID-19 in Fall 2020. Specifically, we aimed to evaluate how the relative growth of COVID-19 cases and deaths would have differed if all states had issued a mandate to mask in public by September 1, 2020 versus if all states had delayed issuing such a mandate. To do so, we applied the Causal Roadmap, a formal f… ▽ More At a national-level, we sought to investigate the effect of public masking mandates on COVID-19 in Fall 2020. Specifically, we aimed to evaluate how the relative growth of COVID-19 cases and deaths would have differed if all states had issued a mandate to mask in public by September 1, 2020 versus if all states had delayed issuing such a mandate. To do so, we applied the Causal Roadmap, a formal framework for causal and statistical inference. The outcome was defined as the state-specific relative increase in cumulative cases and in cumulative deaths {21, 30, 45, 60}-days after September 1. Despite the natural experiment in state-level masking policies, the causal effect of interest was not identifiable. Nonetheless, we specified the target statistical parameter as the adjusted rate ratio (aRR): the expected outcome with early implementation divided by the expected outcome with delayed implementation, after adjusting for state-level confounders. To minimize strong estimation assumptions, primary analyses used targeted maximum likelihood estimation (TMLE) with Super Learner. After 60-days and at a national-level, early implementation was associated 9% reduction in new COVID-19 cases (aRR: 0.91; 95%CI: 0.88-0.95) and a 16% reduction in new COVID-19 deaths (aRR: 0.84; 95%CI: 0.76-0.93). Although lack of identifiability prohibited causal interpretations, application of the Causal Roadmap facilitated estimation and inference of statistical associations, providing timely answers to pressing questions in the COVID-19 response. △ Less

Submitted 12 October, 2021; originally announced October 2021.

Comments: 34 total page (including supp materials)

Journal ref: Epidemiology, December 8, 2021

arXiv:2109.04640 [pdf, other]

Projected State-action Balancing Weights for Offline Reinforcement Learning

Authors: Jiayi Wang, Zhengling Qi, Raymond K. W. Wong

Abstract: Offline policy evaluation (OPE) is considered a fundamental and challenging problem in reinforcement learning (RL). This paper focuses on the value estimation of a target policy based on pre-collected data generated from a possibly different policy, under the framework of infinite-horizon Markov decision processes. Motivated by the recently developed marginal importance sampling method in RL and t… ▽ More Offline policy evaluation (OPE) is considered a fundamental and challenging problem in reinforcement learning (RL). This paper focuses on the value estimation of a target policy based on pre-collected data generated from a possibly different policy, under the framework of infinite-horizon Markov decision processes. Motivated by the recently developed marginal importance sampling method in RL and the covariate balancing idea in causal inference, we propose a novel estimator with approximately projected state-action balancing weights for the policy value estimation. We obtain the convergence rate of these weights and show that the proposed value estimator is semi-parametric efficient under technical conditions. In terms of asymptotics, our results scale with both the number of trajectories and the number of decision points at each trajectory. As such, consistency can still be achieved with a limited number of subjects when the number of decision points diverges. In addition, we develop a necessary and sufficient condition for establishing the well-posedness of the Bellman operator in the off-policy setting, which characterizes the difficulty of OPE and may be of independent interest. Numerical experiments demonstrate the promising performance of our proposed estimator. △ Less

Submitted 9 June, 2022; v1 submitted 9 September, 2021; originally announced September 2021.

arXiv:2108.05574 [pdf, other]

Implicit Sparse Regularization: The Impact of Depth and Early Stopping

Authors: Jiangyuan Li, Thanh V. Nguyen, Chinmay Hegde, Raymond K. W. Wong

Abstract: In this paper, we study the implicit bias of gradient descent for sparse regression. We extend results on regression with quadratic parametrization, which amounts to depth-2 diagonal linear networks, to more general depth-N networks, under more realistic settings of noise and correlated designs. We show that early stopping is crucial for gradient descent to converge to a sparse model, a phenomenon… ▽ More In this paper, we study the implicit bias of gradient descent for sparse regression. We extend results on regression with quadratic parametrization, which amounts to depth-2 diagonal linear networks, to more general depth-N networks, under more realistic settings of noise and correlated designs. We show that early stopping is crucial for gradient descent to converge to a sparse model, a phenomenon that we call implicit sparse regularization. This result is in sharp contrast to known results for noiseless and uncorrelated-design cases. We characterize the impact of depth and early stopping and show that for a general depth parameter N, gradient descent with early stopping achieves minimax optimal sparse recovery with sufficiently small initialization and step size. In particular, we show that increasing depth enlarges the scale of working initialization and the early-stopping window so that this implicit sparse regularization effect is more likely to take place. △ Less

Submitted 26 October, 2021; v1 submitted 12 August, 2021; originally announced August 2021.

Comments: 32 pages, accepted by NeurIPS 2021. arXiv admin note: text overlap with arXiv:1909.05122 by other authors

arXiv:2106.07393 [pdf, other]

Cross-replication Reliability -- An Empirical Approach to Interpreting Inter-rater Reliability

Authors: Ka Wong, Praveen Paritosh, Lora Aroyo

Abstract: We present a new approach to interpreting IRR that is empirical and contextualized. It is based upon benchmarking IRR against baseline measures in a replication, one of which is a novel cross-replication reliability (xRR) measure based on Cohen's kappa. We call this approach the xRR framework. We opensource a replication dataset of 4 million human judgements of facial expressions and analyze it wi… ▽ More We present a new approach to interpreting IRR that is empirical and contextualized. It is based upon benchmarking IRR against baseline measures in a replication, one of which is a novel cross-replication reliability (xRR) measure based on Cohen's kappa. We call this approach the xRR framework. We opensource a replication dataset of 4 million human judgements of facial expressions and analyze it with the proposed framework. We argue this framework can be used to measure the quality of crowdsourced datasets. △ Less

Submitted 11 June, 2021; originally announced June 2021.

arXiv:2106.05850 [pdf, other]

Matrix Completion with Model-free Weighting

Authors: Jiayi Wang, Raymond K. W. Wong, Xiaojun Mao, Kwun Chuen Gary Chan

Abstract: In this paper, we propose a novel method for matrix completion under general non-uniform missing structures. By controlling an upper bound of a novel balancing error, we construct weights that can actively adjust for the non-uniformity in the empirical risk without explicitly modeling the observation probabilities, and can be computed efficiently via convex optimization. The recovered matrix based… ▽ More In this paper, we propose a novel method for matrix completion under general non-uniform missing structures. By controlling an upper bound of a novel balancing error, we construct weights that can actively adjust for the non-uniformity in the empirical risk without explicitly modeling the observation probabilities, and can be computed efficiently via convex optimization. The recovered matrix based on the proposed weighted empirical risk enjoys appealing theoretical guarantees. In particular, the proposed method achieves a stronger guarantee than existing work in terms of the scaling with respect to the observation probabilities, under asymptotically heterogeneous missing settings (where entry-wise observation probabilities can be of different orders). These settings can be regarded as a better theoretical model of missing patterns with highly varying probabilities. We also provide a new minimax lower bound under a class of heterogeneous settings. Numerical experiments are also provided to demonstrate the effectiveness of the proposed method. △ Less

Submitted 9 June, 2021; originally announced June 2021.

Comments: Proceedings of the 38th International Conference on Machine Learning, PMLR 139, 2021

arXiv:2105.14647 [pdf, ps, other]

Orthogonal Subsampling for Big Data Linear Regression

Authors: Lin Wang, Jake Elmstedt, Weng Kee Wong, Hongquan Xu

Abstract: The dramatic growth of big datasets presents a new challenge to data storage and analysis. Data reduction, or subsampling, that extracts useful information from datasets is a crucial step in big data analysis. We propose an orthogonal subsampling (OSS) approach for big data with a focus on linear regression models. The approach is inspired by the fact that an orthogonal array of two levels provide… ▽ More The dramatic growth of big datasets presents a new challenge to data storage and analysis. Data reduction, or subsampling, that extracts useful information from datasets is a crucial step in big data analysis. We propose an orthogonal subsampling (OSS) approach for big data with a focus on linear regression models. The approach is inspired by the fact that an orthogonal array of two levels provides the best experimental design for linear regression models in the sense that it minimizes the average variance of the estimated parameters and provides the best predictions. The merits of OSS are three-fold: (i) it is easy to implement and fast; (ii) it is suitable for distributed parallel computing and ensures the subsamples selected in different batches have no common data points; and (iii) it outperforms existing methods in minimizing the mean squared errors of the estimated parameters and maximizing the efficiencies of the selected subsamples. Theoretical results and extensive numerical results show that the OSS approach is superior to existing subsampling approaches. It is also more robust to the presence of interactions among covariates and, when they do exist, OSS provides more precise estimates of the interaction effects than existing methods. The advantages of OSS are also illustrated through analysis of real data. △ Less

Submitted 30 May, 2021; originally announced May 2021.

arXiv:2105.08835 [pdf, ps, other]

Conformational variability of loops in the SARS-CoV-2 spike protein

Authors: Samuel W. K. Wong, Zongjun Liu

Abstract: The SARS-CoV-2 spike (S) protein facilitates viral infection, and has been the focus of many structure determination efforts. Its flexible loop regions are known to be involved in protein binding and may adopt multiple conformations. This paper identifies the S protein loops and studies their conformational variability based on the available Protein Data Bank (PDB) structures. While most loops had… ▽ More The SARS-CoV-2 spike (S) protein facilitates viral infection, and has been the focus of many structure determination efforts. Its flexible loop regions are known to be involved in protein binding and may adopt multiple conformations. This paper identifies the S protein loops and studies their conformational variability based on the available Protein Data Bank (PDB) structures. While most loops had essentially one stable conformation, 17 of 44 loop regions were observed to be structurally variable with multiple substantively distinct conformations based on a cluster analysis. Loop modeling methods were then applied to the S protein loop targets, and the prediction accuracies discussed in relation to the characteristics of the conformational clusters identified. Loops with multiple conformations were found to be challenging to model based on a single structural template. △ Less

Submitted 13 October, 2021; v1 submitted 18 May, 2021; originally announced May 2021.

Comments: 24 pages

arXiv:2104.10878 [pdf, other]

doi 10.3934/math.2022376

Comparing regional and provincial-wide COVID-19 models with physical distancing in British Columbia

Authors: Geoffrey McGregor, Jennifer Tippett, Andy T. S. Wan, Mengxiao Wang, Samuel W. K. Wong

Abstract: We study the effects of physical distancing measures for the spread of COVID-19 in regional areas within British Columbia, using the reported cases of the five provincial Health Authorities. Building on the Bayesian epidemiological model of Anderson et al. (2020), we propose a hierarchical regional Bayesian model with time-varying regional parameters between March to December of 2020. In the absen… ▽ More We study the effects of physical distancing measures for the spread of COVID-19 in regional areas within British Columbia, using the reported cases of the five provincial Health Authorities. Building on the Bayesian epidemiological model of Anderson et al. (2020), we propose a hierarchical regional Bayesian model with time-varying regional parameters between March to December of 2020. In the absence of COVID-19 variants and vaccinations during this period, we examine the regionalized basic reproduction number, modelled prevalence, relative reduction in contact due to physical distancing, and proportion of anticipated cases that have been tested and reported. We observe significant differences between the regional and provincial-wide models and demonstrate the hierarchical regional model can better estimate regional prevalence, especially in rural regions. These results indicate that it can be useful to apply similar regional models to other parts of Canada or other countries. △ Less

Submitted 13 November, 2021; v1 submitted 22 April, 2021; originally announced April 2021.

Comments: 35 pages, 16 figures

Journal ref: AIMS Mathematics, 2022, 7(4): 6743-6778

arXiv:2104.10041 [pdf, other]

Particle swarm optimization in constrained maximum likelihood estimation a case study

Authors: Elvis Cui, Dongyuan Song, Weng Kee Wong

Abstract: The aim of paper is to apply two types of particle swarm optimization, global best andlocal best PSO to a constrained maximum likelihood estimation problem in pseudotime anal-ysis, a sub-field in bioinformatics. The results have shown that particle swarm optimizationis extremely useful and efficient when the optimization problem is non-differentiable and non-convex so that analytical solution can… ▽ More The aim of paper is to apply two types of particle swarm optimization, global best andlocal best PSO to a constrained maximum likelihood estimation problem in pseudotime anal-ysis, a sub-field in bioinformatics. The results have shown that particle swarm optimizationis extremely useful and efficient when the optimization problem is non-differentiable and non-convex so that analytical solution can not be derived and gradient-based methods can not beapplied. △ Less

Submitted 9 April, 2021; originally announced April 2021.

Comments: 11 pages, 7 figures

arXiv:2103.03437 [pdf, other]

Estimation of Partially Conditional Average Treatment Effect by Hybrid Kernel-covariate Balancing

Authors: Jiayi Wang, Raymond K. W. Wong, Shu Yang, Kwun Chuen Gary Chan

Abstract: We study nonparametric estimation for the partially conditional average treatment effect, defined as the treatment effect function over an interested subset of confounders. We propose a hybrid kernel weighting estimator where the weights aim to control the balancing error of any function of the confounders from a reproducing kernel Hilbert space after kernel smoothing over the subset of interested… ▽ More We study nonparametric estimation for the partially conditional average treatment effect, defined as the treatment effect function over an interested subset of confounders. We propose a hybrid kernel weighting estimator where the weights aim to control the balancing error of any function of the confounders from a reproducing kernel Hilbert space after kernel smoothing over the subset of interested variables. In addition, we present an augmented version of our estimator which can incorporate estimations of outcome mean functions. Based on the representer theorem, gradient-based algorithms can be applied for solving the corresponding infinite-dimensional optimization problem. Asymptotic properties are studied without any smoothness assumptions for propensity score function or the need of data splitting, relaxing certain existing stringent assumptions. The numerical performance of the proposed estimator is demonstrated by a simulation study and an application to the effect of a mother's smoking on a baby's birth weight conditioned on the mother's age. △ Less

Submitted 4 March, 2021; originally announced March 2021.

Comments: 19 pages, 2 figures

arXiv:2101.02304 [pdf, other]

Statistical challenges in the analysis of sequence and structure data for the COVID-19 spike protein

Authors: Shiyu He, Samuel W. K. Wong

Abstract: As the major target of many vaccines and neutralizing antibodies against SARS-CoV-2, the spike (S) protein is observed to mutate over time. In this paper, we present statistical approaches to tackle some challenges associated with the analysis of S-protein data. We build a Bayesian hierarchical model to study the temporal and spatial evolution of S-protein sequences, after grouping the sequences i… ▽ More As the major target of many vaccines and neutralizing antibodies against SARS-CoV-2, the spike (S) protein is observed to mutate over time. In this paper, we present statistical approaches to tackle some challenges associated with the analysis of S-protein data. We build a Bayesian hierarchical model to study the temporal and spatial evolution of S-protein sequences, after grouping the sequences into representative clusters. We then apply sampling methods to investigate possible changes to the S-protein's 3-D structure as a result of commonly observed mutations. While the increasing spread of D614G variants has been noted in other research, our results also show that the co-occurring mutations of D614G together with S477N or A222V may spread even more rapidly, as quantified by our model estimates. △ Less

Submitted 30 January, 2021; v1 submitted 6 January, 2021; originally announced January 2021.

Comments: 21 pages, 5 figures

arXiv:2011.00442 [pdf, other]

Penalized estimation for single-index varying-coefficient models with applications to integrative genomic analysis

Authors: Hoi Min Ng, Binyan Jiang, Kin Yau Wong

Abstract: Recent technological advances have made it possible to collect high-dimensional genomic data along with clinical data on a large number of subjects. In the studies of chronic diseases such as cancer, it is of great interest to integrate clinical and genomic data to build a comprehensive understanding of the disease mechanisms. Despite extensive studies on integrative analysis, it remains an ongoin… ▽ More Recent technological advances have made it possible to collect high-dimensional genomic data along with clinical data on a large number of subjects. In the studies of chronic diseases such as cancer, it is of great interest to integrate clinical and genomic data to build a comprehensive understanding of the disease mechanisms. Despite extensive studies on integrative analysis, it remains an ongoing challenge to model the interaction effects between clinical and genomic variables, due to high-dimensionality of the data and heterogeneity across data types. In this paper, we propose an integrative approach that models interaction effects using a single-index varying-coefficient model, where the effects of genomic features can be modified by clinical variables. We propose a penalized approach for separate selection of main and interaction effects. We demonstrate the advantages of the proposed methods through extensive simulation studies and provide applications to a motivating cancer genomic study. △ Less

Submitted 1 November, 2020; originally announced November 2020.

Comments: 18 pages, 8 figures

arXiv:2010.13568 [pdf, other]

doi 10.1109/ACCESS.2021.3049494

CP Degeneracy in Tensor Regression

Authors: Ya Zhou, Raymond K. W. Wong, Kejun He

Abstract: Tensor linear regression is an important and useful tool for analyzing tensor data. To deal with high dimensionality, CANDECOMP/PARAFAC (CP) low-rank constraints are often imposed on the coefficient tensor parameter in the (penalized) $M$-estimation. However, we show that the corresponding optimization may not be attainable, and when this happens, the estimator is not well-defined. This is closely… ▽ More Tensor linear regression is an important and useful tool for analyzing tensor data. To deal with high dimensionality, CANDECOMP/PARAFAC (CP) low-rank constraints are often imposed on the coefficient tensor parameter in the (penalized) $M$-estimation. However, we show that the corresponding optimization may not be attainable, and when this happens, the estimator is not well-defined. This is closely related to a phenomenon, called CP degeneracy, in low-rank tensor approximation problems. In this article, we provide useful results of CP degeneracy in tensor regression problems. In addition, we provide a general penalized strategy as a solution to overcome CP degeneracy. The asymptotic properties of the resulting estimation are also studied. Numerical experiments are conducted to illustrate our findings. △ Less

Submitted 22 October, 2020; originally announced October 2020.

Journal ref: IEEE Access, 9:1, 7775-7788 (2021)

arXiv:2009.11452 [pdf, ps, other]

A Wavelet-Based Independence Test for Functional Data with an Application to MEG Functional Connectivity

Authors: Rui Miao, Xiaoke Zhang, Raymond K. W. Wong

Abstract: Measuring and testing the dependency between multiple random functions is often an important task in functional data analysis. In the literature, a model-based method relies on a model which is subject to the risk of model misspecification, while a model-free method only provides a correlation measure which is inadequate to test independence. In this paper, we adopt the Hilbert-Schmidt Independenc… ▽ More Measuring and testing the dependency between multiple random functions is often an important task in functional data analysis. In the literature, a model-based method relies on a model which is subject to the risk of model misspecification, while a model-free method only provides a correlation measure which is inadequate to test independence. In this paper, we adopt the Hilbert-Schmidt Independence Criterion (HSIC) to measure the dependency between two random functions. We develop a two-step procedure by first pre-smoothing each function based on its discrete and noisy measurements and then applying the HSIC to recovered functions. To ensure the compatibility between the two steps such that the effect of the pre-smoothing error on the subsequent HSIC is asymptotically negligible, we propose to use wavelet soft-thresholding for pre-smoothing and Besov-norm-induced kernels for HSIC. We also provide the corresponding asymptotic analysis. The superior numerical performance of the proposed method over existing ones is demonstrated in a simulation study. Moreover, in an magnetoencephalography (MEG) data application, the functional connectivity patterns identified by the proposed method are more anatomically interpretable than those by existing methods. △ Less

Submitted 23 September, 2020; originally announced September 2020.

arXiv:2009.07444 [pdf, other]

doi 10.1073/pnas.2020397118

Inference of dynamic systems from noisy and sparse data via manifold-constrained Gaussian processes

Authors: Shihao Yang, Samuel W. K. Wong, S. C. Kou

Abstract: Parameter estimation for nonlinear dynamic system models, represented by ordinary differential equations (ODEs), using noisy and sparse data is a vital task in many fields. We propose a fast and accurate method, MAGI (MAnifold-constrained Gaussian process Inference), for this task. MAGI uses a Gaussian process model over time-series data, explicitly conditioned on the manifold constraint that deri… ▽ More Parameter estimation for nonlinear dynamic system models, represented by ordinary differential equations (ODEs), using noisy and sparse data is a vital task in many fields. We propose a fast and accurate method, MAGI (MAnifold-constrained Gaussian process Inference), for this task. MAGI uses a Gaussian process model over time-series data, explicitly conditioned on the manifold constraint that derivatives of the Gaussian process must satisfy the ODE system. By doing so, we completely bypass the need for numerical integration and achieve substantial savings in computational time. MAGI is also suitable for inference with unobserved system components, which often occur in real experiments. MAGI is distinct from existing approaches as we provide a principled statistical construction under a Bayesian framework, which incorporates the ODE system through the manifold constraint. We demonstrate the accuracy and speed of MAGI using realistic examples based on physical experiments. △ Less

Submitted 21 February, 2021; v1 submitted 15 September, 2020; originally announced September 2020.

Showing 1–50 of 99 results for author: Wong, K