Search | arXiv e-print repository

State Space Model Programming in Turing.jl

Authors: Tim Hargreaves, Qing Li, Charles Knipp, Frederic Wantiez, Simon J. Godsill, Hong Ge

Abstract: State space models (SSMs) are a powerful and widely-used class of probabilistic models for analysing time-series data across various fields, from econometrics to robotics. Despite their prevalence, existing software frameworks for SSMs often lack compositionality and scalability, hindering experimentation and making it difficult to leverage advanced inference techniques. This paper introduces SSMP… ▽ More State space models (SSMs) are a powerful and widely-used class of probabilistic models for analysing time-series data across various fields, from econometrics to robotics. Despite their prevalence, existing software frameworks for SSMs often lack compositionality and scalability, hindering experimentation and making it difficult to leverage advanced inference techniques. This paper introduces SSMProblems.jl and GeneralisedFilters.jl, two Julia packages within the Turing.jl ecosystem, that address this challenge by providing a consistent, composable, and general framework for defining SSMs and performing inference on them. This unified interface allows researchers to easily define a wide range of SSMs and apply various inference algorithms, including Kalman filtering, particle filtering, and combinations thereof. By promoting code reuse and modularity, our packages reduce development time and improve the reliability of SSM implementations. We prioritise scalability through efficient memory management and GPU-acceleration, ensuring that our framework can handle large-scale inference tasks. △ Less

Submitted 29 May, 2025; originally announced May 2025.

Comments: 16 pages, 6 figures, Presented at LAFI (Languages for Inference) Workshop, POPL 2025

arXiv:2505.19367 [pdf, ps, other]

Adaptive Diffusion Guidance via Stochastic Optimal Control

Authors: Iskander Azangulov, Peter Potaptchik, Qinyu Li, Eddie Aamari, George Deligiannidis, Judith Rousseau

Abstract: Guidance is a cornerstone of modern diffusion models, playing a pivotal role in conditional generation and enhancing the quality of unconditional samples. However, current approaches to guidance scheduling--determining the appropriate guidance weight--are largely heuristic and lack a solid theoretical foundation. This work addresses these limitations on two fronts. First, we provide a theoretical… ▽ More Guidance is a cornerstone of modern diffusion models, playing a pivotal role in conditional generation and enhancing the quality of unconditional samples. However, current approaches to guidance scheduling--determining the appropriate guidance weight--are largely heuristic and lack a solid theoretical foundation. This work addresses these limitations on two fronts. First, we provide a theoretical formalization that precisely characterizes the relationship between guidance strength and classifier confidence. Second, building on this insight, we introduce a stochastic optimal control framework that casts guidance scheduling as an adaptive optimization problem. In this formulation, guidance strength is not fixed but dynamically selected based on time, the current sample, and the conditioning class, either independently or in combination. By solving the resulting control problem, we establish a principled foundation for more effective guidance in diffusion models. △ Less

Submitted 25 May, 2025; originally announced May 2025.

arXiv:2505.12952 [pdf, ps, other]

LoD: Loss-difference OOD Detection by Intentionally Label-Noisifying Unlabeled Wild Data

Authors: Chuanxing Geng, Qifei Li, Xinrui Wang, Dong Liang, Songcan Chen, Pong C. Yuen

Abstract: Using unlabeled wild data containing both in-distribution (ID) and out-of-distribution (OOD) data to improve the safety and reliability of models has recently received increasing attention. Existing methods either design customized losses for labeled ID and unlabeled wild data then perform joint optimization, or first filter out OOD data from the latter then learn an OOD detector. While achieving… ▽ More Using unlabeled wild data containing both in-distribution (ID) and out-of-distribution (OOD) data to improve the safety and reliability of models has recently received increasing attention. Existing methods either design customized losses for labeled ID and unlabeled wild data then perform joint optimization, or first filter out OOD data from the latter then learn an OOD detector. While achieving varying degrees of success, two potential issues remain: (i) Labeled ID data typically dominates the learning of models, inevitably making models tend to fit OOD data as IDs; (ii) The selection of thresholds for identifying OOD data in unlabeled wild data usually faces dilemma due to the unavailability of pure OOD samples. To address these issues, we propose a novel loss-difference OOD detection framework (LoD) by \textit{intentionally label-noisifying} unlabeled wild data. Such operations not only enable labeled ID data and OOD data in unlabeled wild data to jointly dominate the models' learning but also ensure the distinguishability of the losses between ID and OOD samples in unlabeled wild data, allowing the classic clustering technique (e.g., K-means) to filter these OOD samples without requiring thresholds any longer. We also provide theoretical foundation for LoD's viability, and extensive experiments verify its superiority. △ Less

Submitted 19 May, 2025; originally announced May 2025.

Comments: Accepted by IJCAI2025

arXiv:2504.20360 [pdf, other]

Identification and estimation of vaccine effectiveness in the test-negative design under equi-confounding

Authors: Christopher B. Boyer, Kendrick Qijun Li, Xu Shi, Eric J. Tchetgen Tchetgen

Abstract: The test-negative design (TND) is frequently used to evaluate vaccine effectiveness in real-world settings. In a TND study, individuals with similar symptoms who seek care are tested for the disease of interest, and vaccine effectiveness is estimated by comparing the vaccination history of test-positive cases and test-negative controls. Traditional approaches justify the TND by assuming either (a)… ▽ More The test-negative design (TND) is frequently used to evaluate vaccine effectiveness in real-world settings. In a TND study, individuals with similar symptoms who seek care are tested for the disease of interest, and vaccine effectiveness is estimated by comparing the vaccination history of test-positive cases and test-negative controls. Traditional approaches justify the TND by assuming either (a) receiving a test is a perfect proxy for unmeasured health-seeking behavior or (b) vaccination is unconfounded given measured covariates -- both of which may be unrealistic in practice. In this paper, we return to the original motivation for the TND and propose an alternative justification based on the assumption of odds ratio equi-confounding, where unmeasured confounders influence test-positive and test-negative individuals equivalently on the odds ratio scale. We discuss the implications of this assumption for TND design and provide alternative estimators for the marginal risk ratio among the vaccinated under equi-confounding, including estimators based on outcome modeling and inverse probability weighting as well as a semiparametric estimator that is doubly-robust. When the equi-confounding assumption does not hold, we suggest a sensitivity analysis that parameterizes the magnitude of the deviation on the odds ratio scale. We conduct a simulation study to evaluate the empirical performance of our proposed estimators under a wide range of scenarios. △ Less

Submitted 11 May, 2025; v1 submitted 28 April, 2025; originally announced April 2025.

arXiv:2504.14169 [pdf, other]

Correction for nonignorable nonresponse bias in the estimation of turnout using callback data

Authors: Xinyu Li, Naiwen Ying, Kendrick Qijun Li, Xu Shi, Wang Miao

Abstract: Overestimation of turnout has long been an issue in election surveys, with nonresponse bias or voter overrepresentation regarded as one of the major sources of bias. However, the adjustment for nonignorable nonresponse bias is substantially challenging. Based on the ANES Non-Response Follow-Up Study concerning the 2020 U.S. presidential election, we investigate the role of callback data in adjusti… ▽ More Overestimation of turnout has long been an issue in election surveys, with nonresponse bias or voter overrepresentation regarded as one of the major sources of bias. However, the adjustment for nonignorable nonresponse bias is substantially challenging. Based on the ANES Non-Response Follow-Up Study concerning the 2020 U.S. presidential election, we investigate the role of callback data in adjusting for nonresponse bias in the estimation of turnout. Callback data are the records of contact attempts in the survey course, available in many modern large-scale surveys. We propose a stableness of resistance assumption to account for the nonignorable missingness in the outcome, which states that the impact of the missing outcome on the response propensity is stable in the first two call attempts. Under this assumption and by leveraging covariates information from the census data, we establish the identifiability and develop estimation methods for turnout, including a doubly robust estimator. Our methods produce estimates very close to the official turnout and successfully capture the trend of declining willingness to vote as response reluctance or contact difficulty increases. This work hints at the importance of adjusting for nonignorable nonresponse bias and exhibits the promise of callback data for political surveys. △ Less

Submitted 19 April, 2025; originally announced April 2025.

arXiv:2504.04547 [pdf, other]

Variational Bayesian Multiple Imputation in High-Dimensional Regression Models With Missing Responses

Authors: Qiushuang Li, Recai Yucel

Abstract: Multiple imputation has become one of the standard methods in drawing inferences in many incomplete data applications. Applications of multiple imputation in relatively more complex settings, such as high-dimensional clustered data, require specialized methods to overcome the computational burden. Using linear mixed-effects models, we develop such methods that can be applied to continuous, binary,… ▽ More Multiple imputation has become one of the standard methods in drawing inferences in many incomplete data applications. Applications of multiple imputation in relatively more complex settings, such as high-dimensional clustered data, require specialized methods to overcome the computational burden. Using linear mixed-effects models, we develop such methods that can be applied to continuous, binary, or categorical incomplete data by employing variational Bayesian inference to sample the posterior predictive distribution of the missing data. These methods specifically target high-dimensional data and work with the spike-and-slab prior, which automatically selects the variables of importance to be in the imputation model. The individual regression computation is then incorporated into a variable-by-variable imputation algorithm. Finally, we use a calibration-based algorithm to adopt these methods to multiple imputations of categorical variables. We present a simulation study and illustrate on National Survey of Children's Health data to assess the performance of these methods in a repetitive sampling framework. △ Less

Submitted 6 April, 2025; originally announced April 2025.

arXiv:2504.04539 [pdf, ps, other]

Sequential Hierarchical Regression Imputation with Variable Selection Routines

Authors: Qiushuang Li, Recai Yucel

Abstract: We aim to incorporate variable selection routines into variable-by-variable (or sequential) imputation in clustered data to achieve computational improvement in applications with large-scale health data. Specifically, we utilize variable selection routines using spike-and-slab priors within the Bayesian variable selection routine. The choice of these priors allows us to ``force'' variables of impo… ▽ More We aim to incorporate variable selection routines into variable-by-variable (or sequential) imputation in clustered data to achieve computational improvement in applications with large-scale health data. Specifically, we utilize variable selection routines using spike-and-slab priors within the Bayesian variable selection routine. The choice of these priors allows us to ``force'' variables of importance (e.g., design variables or variables known to play a role in the missingness mechanism) into the imputation models based on a class of mixed-effects models. Our ultimate goal is to improve computational speed by removing unnecessary variables. We employ Markov chain Monte Carlo techniques to sample from the implied posterior distributions for model unknowns as well as missing data. We assess the performance of our proposed methodology via simulation studies. Our results show that our proposed algorithms lead to satisfactory estimates and, in some instances, outperform some of the existing methods that are available to practitioners. We illustrate our methods using a national survey of children's health. △ Less

Submitted 6 April, 2025; originally announced April 2025.

arXiv:2504.00755 [pdf, ps, other]

doi 10.1002/sim.10311

Efficient computation of high-dimensional penalized piecewise constant hazard random effects models

Authors: Hillary M. Heiling, Naim U. Rashid, Quefeng Li, Xianlu L. Peng, Jen Jen Yeh

Abstract: Identifying and characterizing relationships between treatments, exposures, or other covariates and time-to-event outcomes has great significance in a wide range of biomedical settings. In research areas such as multi-center clinical trials, recurrent events, and genetic studies, proportional hazard mixed effects models (PHMMs) are used to account for correlations observed in clusters within the d… ▽ More Identifying and characterizing relationships between treatments, exposures, or other covariates and time-to-event outcomes has great significance in a wide range of biomedical settings. In research areas such as multi-center clinical trials, recurrent events, and genetic studies, proportional hazard mixed effects models (PHMMs) are used to account for correlations observed in clusters within the data. In high dimensions, proper specification of the fixed and random effects within PHMMs is difficult and computationally complex. In this paper, we approximate the proportional hazards mixed effects model with a piecewise constant hazard mixed effects survival model. We estimate the model parameters using a modified Monte Carlo Expectation Conditional Minimization algorithm, allowing us to perform variable selection on both the fixed and random effects simultaneously. We also incorporate a factor model decomposition of the random effects in order to more easily scale the variable selection method to larger dimensions. We demonstrate the utility of our method using simulations, and we apply our method to a multi-study pancreatic ductal adenocarcinoma gene expression dataset to select features important for survival. △ Less

Submitted 1 April, 2025; originally announced April 2025.

Journal ref: Statistics in Medicine 2025

arXiv:2503.24322 [pdf, other]

NoProp: Training Neural Networks without Back-propagation or Forward-propagation

Authors: Qinyu Li, Yee Whye Teh, Razvan Pascanu

Abstract: The canonical deep learning approach for learning requires computing a gradient term at each layer by back-propagating the error signal from the output towards each learnable parameter. Given the stacked structure of neural networks, where each layer builds on the representation of the layer below, this approach leads to hierarchical representations. More abstract features live on the top layers o… ▽ More The canonical deep learning approach for learning requires computing a gradient term at each layer by back-propagating the error signal from the output towards each learnable parameter. Given the stacked structure of neural networks, where each layer builds on the representation of the layer below, this approach leads to hierarchical representations. More abstract features live on the top layers of the model, while features on lower layers are expected to be less abstract. In contrast to this, we introduce a new learning method named NoProp, which does not rely on either forward or backwards propagation. Instead, NoProp takes inspiration from diffusion and flow matching methods, where each layer independently learns to denoise a noisy target. We believe this work takes a first step towards introducing a new family of gradient-free learning methods, that does not learn hierarchical representations -- at least not in the usual sense. NoProp needs to fix the representation at each layer beforehand to a noised version of the target, learning a local denoising process that can then be exploited at inference. We demonstrate the effectiveness of our method on MNIST, CIFAR-10, and CIFAR-100 image classification benchmarks. Our results show that NoProp is a viable learning algorithm which achieves superior accuracy, is easier to use and computationally more efficient compared to other existing back-propagation-free methods. By departing from the traditional gradient based learning paradigm, NoProp alters how credit assignment is done within the network, enabling more efficient distributed learning as well as potentially impacting other characteristics of the learning process. △ Less

Submitted 31 March, 2025; originally announced March 2025.

arXiv:2502.13453 [pdf, other]

BISON: Bi-clustering of spatial omics data with feature selection

Authors: Bencong Zhu, Alberto Cassese, Marina Vannucci, Michele Guindani, Qiwei Li

Abstract: The advent of next-generation sequencing-based spatially resolved transcriptomics (SRT) techniques has reshaped genomic studies by enabling high-throughput gene expression profiling while preserving spatial and morphological context. Understanding gene functions and interactions in different spatial domains is crucial, as it can enhance our comprehension of biological mechanisms, such as cancer-im… ▽ More The advent of next-generation sequencing-based spatially resolved transcriptomics (SRT) techniques has reshaped genomic studies by enabling high-throughput gene expression profiling while preserving spatial and morphological context. Understanding gene functions and interactions in different spatial domains is crucial, as it can enhance our comprehension of biological mechanisms, such as cancer-immune interactions and cell differentiation in various regions. It is necessary to cluster tissue regions into distinct spatial domains and identify discriminating genes that elucidate the clustering result, referred to as spatial domain-specific discriminating genes (DGs). Existing methods for identifying these genes typically rely on a two-stage approach, which can lead to the phenomenon known as \textit{double-dipping}. To address the challenge, we propose a unified Bayesian latent block model that simultaneously detects a list of DGs contributing to spatial domain identification while clustering these DGs and spatial locations. The efficacy of our proposed method is validated through a series of simulation experiments, and its capability to identify DGs is demonstrated through applications to benchmark SRT datasets. △ Less

Submitted 19 February, 2025; originally announced February 2025.

arXiv:2501.03747 [pdf, other]

Context-Alignment: Activating and Enhancing LLM Capabilities in Time Series

Authors: Yuxiao Hu, Qian Li, Dongxiao Zhang, Jinyue Yan, Yuntian Chen

Abstract: Recently, leveraging pre-trained Large Language Models (LLMs) for time series (TS) tasks has gained increasing attention, which involves activating and enhancing LLMs' capabilities. Many methods aim to activate LLMs' capabilities based on token-level alignment but overlook LLMs' inherent strength on natural language processing -- their deep understanding of linguistic logic and structure rather th… ▽ More Recently, leveraging pre-trained Large Language Models (LLMs) for time series (TS) tasks has gained increasing attention, which involves activating and enhancing LLMs' capabilities. Many methods aim to activate LLMs' capabilities based on token-level alignment but overlook LLMs' inherent strength on natural language processing -- their deep understanding of linguistic logic and structure rather than superficial embedding processing. We propose Context-Alignment, a new paradigm that aligns TS with a linguistic component in the language environments familiar to LLMs to enable LLMs to contextualize and comprehend TS data, thereby activating their capabilities. Specifically, such context-level alignment comprises structural alignment and logical alignment, which is achieved by a Dual-Scale Context-Alignment GNNs (DSCA-GNNs) applied to TS-language multimodal inputs. Structural alignment utilizes dual-scale nodes to describe hierarchical structure in TS-language, enabling LLMs treat long TS data as a whole linguistic component while preserving intrinsic token features. Logical alignment uses directed edges to guide logical relationships, ensuring coherence in the contextual semantics. Demonstration examples prompt are employed to construct Demonstration Examples based Context-Alignment (DECA) following DSCA-GNNs framework. DECA can be flexibly and repeatedly integrated into various layers of pre-trained LLMs to improve awareness of logic and structure, thereby enhancing performance. Extensive experiments show the effectiveness of DECA and the importance of Context-Alignment across tasks, particularly in few-shot and zero-shot forecasting, confirming that Context-Alignment provide powerful prior knowledge on context. △ Less

Submitted 5 April, 2025; v1 submitted 7 January, 2025; originally announced January 2025.

Comments: no comment

arXiv:2412.05669 [pdf, other]

Detecting outliers by clustering algorithms

Authors: Qi Li, Shuliang Wang

Abstract: Clustering and outlier detection are two important tasks in data mining. Outliers frequently interfere with clustering algorithms to determine the similarity between objects, resulting in unreliable clustering results. Currently, only a few clustering algorithms (e.g., DBSCAN) have the ability to detect outliers to eliminate interference. For other clustering algorithms, it is tedious to introduce… ▽ More Clustering and outlier detection are two important tasks in data mining. Outliers frequently interfere with clustering algorithms to determine the similarity between objects, resulting in unreliable clustering results. Currently, only a few clustering algorithms (e.g., DBSCAN) have the ability to detect outliers to eliminate interference. For other clustering algorithms, it is tedious to introduce another outlier detection task to eliminate outliers before each clustering process. Obviously, how to equip more clustering algorithms with outlier detection ability is very meaningful. Although a common strategy allows clustering algorithms to detect outliers based on the distance between objects and clusters, it is contradictory to improving the performance of clustering algorithms on the datasets with outliers. In this paper, we propose a novel outlier detection approach, called ODAR, for clustering. ODAR maps outliers and normal objects into two separated clusters by feature transformation. As a result, any clustering algorithm can detect outliers by identifying clusters. Experiments show that ODAR is robust to diverse datasets. Compared with baseline methods, the clustering algorithms achieve the best on 7 out of 10 datasets with the help of ODAR, with at least 5% improvement in accuracy. △ Less

Submitted 7 December, 2024; originally announced December 2024.

arXiv:2411.00992 [pdf, other]

Correlation of Correlation Networks: High-Order Interactions in the Topology of Brain Networks

Authors: Qiang Li, Jingyu Liu, Vince D. Calhoun

Abstract: To understand collective network behavior in the complex human brain, pairwise correlation networks alone are insufficient for capturing the high-order interactions that extend beyond pairwise interactions and play a crucial role in brain network dynamics. These interactions often reveal intricate relationships among multiple brain networks, significantly influencing cognitive processes. In this s… ▽ More To understand collective network behavior in the complex human brain, pairwise correlation networks alone are insufficient for capturing the high-order interactions that extend beyond pairwise interactions and play a crucial role in brain network dynamics. These interactions often reveal intricate relationships among multiple brain networks, significantly influencing cognitive processes. In this study, we explored the correlation of correlation networks and topological network analysis with resting-state fMRI to gain deeper insights into these higher-order interactions and their impact on the topology of brain networks, ultimately enhancing our understanding of brain function. We observed that the correlation of correlation networks highlighted network connections while preserving the topological structure of correlation networks. Our findings suggested that the correlation of correlation networks surpassed traditional correlation networks, showcasing considerable potential for applications in various areas of network science. Moreover, after applying topological network analysis to the correlation of correlation networks, we observed that some high-order interaction hubs predominantly occurred in primary and high-level cognitive areas, such as the visual and fronto-parietal regions. These high-order hubs played a crucial role in information integration within the human brain. △ Less

Submitted 5 November, 2024; v1 submitted 1 November, 2024; originally announced November 2024.

Comments: 4 pages, 2 figures, 1 table; Submitted to IEEE International Symposium on Biomedical Imaging (ISBI 2025)

arXiv:2411.00982 [pdf, other]

The Dynamics of Triple Interactions in Resting fMRI: Insights into Psychotic Disorders

Authors: Qiang Li, Vince D. Calhoun, Armin Iraji

Abstract: The human brain dynamically integrated and configured information to adapt to the environment. To capture these changes over time, dynamic second-order functional connectivity was typically used to capture transient brain patterns. However, dynamic second-order functional connectivity typically ignored interactions beyond pairwise relationships. To address this limitation, we utilized dynamic trip… ▽ More The human brain dynamically integrated and configured information to adapt to the environment. To capture these changes over time, dynamic second-order functional connectivity was typically used to capture transient brain patterns. However, dynamic second-order functional connectivity typically ignored interactions beyond pairwise relationships. To address this limitation, we utilized dynamic triple interactions to investigate multiscale network interactions in the brain. In this study, we evaluated a resting-state fMRI dataset that included individuals with psychotic disorders (PD). We first estimated dynamic triple interactions using resting-state fMRI. After clustering, we estimated cohort-specific and cohort-common states for controls (CN), schizophrenia (SZ), and schizoaffective disorder (SAD). From the cohort-specific states, we observed significant triple interactions, particularly among visual, subcortical, and somatomotor networks, as well as temporal and higher cognitive networks in SZ. In SAD, key interactions involved temporal networks in the initial state and somatomotor networks in subsequent states. From the cohort-common states, we observed that high-cognitive networks were primarily involved in SZ and SAD compared to CN. Furthermore, the most significant differences between SZ and SAD also existed in high-cognitive networks. In summary, we studied PD using dynamic triple interaction, the first time such an approach has been used to study PD. Our findings highlighted the significant potential of dynamic high-order functional connectivity, paving the way for new avenues in the study of the healthy and disordered human brain. △ Less

Submitted 5 November, 2024; v1 submitted 1 November, 2024; originally announced November 2024.

Comments: 4 pages, 3 figures; Submitted to IEEE International Symposium on Biomedical Imaging (ISBI 2025)

arXiv:2410.18076 [pdf, other]

Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration

Authors: Max Wilcoxson, Qiyang Li, Kevin Frans, Sergey Levine

Abstract: Unsupervised pretraining has been transformative in many supervised domains. However, applying such ideas to reinforcement learning (RL) presents a unique challenge in that fine-tuning does not involve mimicking task-specific data, but rather exploring and locating the solution through iterative self-improvement. In this work, we study how unlabeled offline trajectory data can be leveraged to lear… ▽ More Unsupervised pretraining has been transformative in many supervised domains. However, applying such ideas to reinforcement learning (RL) presents a unique challenge in that fine-tuning does not involve mimicking task-specific data, but rather exploring and locating the solution through iterative self-improvement. In this work, we study how unlabeled offline trajectory data can be leveraged to learn efficient exploration strategies. While prior data can be used to pretrain a set of low-level skills, or as additional off-policy data for online RL, it has been unclear how to combine these ideas effectively for online exploration. Our method SUPE (Skills from Unlabeled Prior data for Exploration) demonstrates that a careful combination of these ideas compounds their benefits. Our method first extracts low-level skills using a variational autoencoder (VAE), and then pseudo-labels unlabeled trajectories with optimistic rewards and high-level action labels, transforming prior data into high-level, task-relevant examples that encourage novelty-seeking behavior. Finally, SUPE uses these transformed examples as additional off-policy data for online RL to learn a high-level policy that composes pretrained low-level skills to explore efficiently. In our experiments, SUPE consistently outperforms prior strategies across a suite of 42 long-horizon, sparse-reward tasks. Code: https://github.com/rail-berkeley/supe. △ Less

Submitted 23 February, 2025; v1 submitted 23 October, 2024; originally announced October 2024.

Comments: 27 pages, 19 figures

arXiv:2410.00229 [pdf, other]

Stochastic Inverse Problem: stability, regularization and Wasserstein gradient flow

Authors: Qin Li, Maria Oprea, Li Wang, Yunan Yang

Abstract: Inverse problems in physical or biological sciences often involve recovering an unknown parameter that is random. The sought-after quantity is a probability distribution of the unknown parameter, that produces data that aligns with measurements. Consequently, these problems are naturally framed as stochastic inverse problems. In this paper, we explore three aspects of this problem: direct inversio… ▽ More Inverse problems in physical or biological sciences often involve recovering an unknown parameter that is random. The sought-after quantity is a probability distribution of the unknown parameter, that produces data that aligns with measurements. Consequently, these problems are naturally framed as stochastic inverse problems. In this paper, we explore three aspects of this problem: direct inversion, variational formulation with regularization, and optimization via gradient flows, drawing parallels with deterministic inverse problems. A key difference from the deterministic case is the space in which we operate. Here, we work within probability space rather than Euclidean or Sobolev spaces, making tools from measure transport theory necessary for the study. Our findings reveal that the choice of metric -- both in the design of the loss function and in the optimization process -- significantly impacts the stability and properties of the optimizer. △ Less

Submitted 30 September, 2024; originally announced October 2024.

arXiv:2409.16308 [pdf, other]

Probabilistic Spatiotemporal Modeling of Day-Ahead Wind Power Generation with Input-Warped Gaussian Processes

Authors: Qiqi Li, Mike Ludkovski

Abstract: We design a Gaussian Process (GP) spatiotemporal model to capture features of day-ahead wind power forecasts. We work with hourly-scale day-ahead forecasts across hundreds of wind farm locations, with the main aim of constructing a fully probabilistic joint model across space and hours of the day. To this end, we design a separable space-time kernel, implementing both temporal and spatial input wa… ▽ More We design a Gaussian Process (GP) spatiotemporal model to capture features of day-ahead wind power forecasts. We work with hourly-scale day-ahead forecasts across hundreds of wind farm locations, with the main aim of constructing a fully probabilistic joint model across space and hours of the day. To this end, we design a separable space-time kernel, implementing both temporal and spatial input warping to capture the non-stationarity in the covariance of wind power. We conduct synthetic experiments to validate our choice of the spatial kernel and to demonstrate the effectiveness of warping in addressing nonstationarity. The second half of the paper is devoted to a detailed case study using a realistic, fully calibrated dataset representing wind farms in the ERCOT region of Texas. △ Less

Submitted 10 September, 2024; originally announced September 2024.

Comments: 29 pages, 12 figures

arXiv:2408.14410 [pdf, other]

Generalized Bayesian nonparametric clustering framework for high-dimensional spatial omics data

Authors: Bencong Zhu, Guanyu Hu, Xiaodan Fan, Qiwei Li

Abstract: The advent of next-generation sequencing-based spatially resolved transcriptomics (SRT) techniques has transformed genomic research by enabling high-throughput gene expression profiling while preserving spatial context. Identifying spatial domains within SRT data is a critical task, with numerous computational approaches currently available. However, most existing methods rely on a multi-stage pro… ▽ More The advent of next-generation sequencing-based spatially resolved transcriptomics (SRT) techniques has transformed genomic research by enabling high-throughput gene expression profiling while preserving spatial context. Identifying spatial domains within SRT data is a critical task, with numerous computational approaches currently available. However, most existing methods rely on a multi-stage process that involves ad-hoc dimension reduction techniques to manage the high dimensionality of SRT data. These low-dimensional embeddings are then subjected to model-based or distance-based clustering methods. Additionally, many approaches depend on arbitrarily specifying the number of clusters (i.e., spatial domains), which can result in information loss and suboptimal downstream analysis. To address these limitations, we propose a novel Bayesian nonparametric mixture of factor analysis (BNPMFA) model, which incorporates a Markov random field-constrained Gibbs-type prior for partitioning high-dimensional spatial omics data. This new prior effectively integrates the spatial constraints inherent in SRT data while simultaneously inferring cluster membership and determining the optimal number of spatial domains. We have established the theoretical identifiability of cluster membership within this framework. The efficacy of our proposed approach is demonstrated through realistic simulations and applications to two SRT datasets. Our results show that the BNPMFA model not only surpasses state-of-the-art methods in clustering accuracy and estimating the number of clusters but also offers novel insights for identifying cellular regions within tissue samples. △ Less

Submitted 26 August, 2024; originally announced August 2024.

arXiv:2407.20288 [pdf, other]

Supervised Learning based Method for Condition Monitoring of Overhead Line Insulators using Leakage Current Measurement

Authors: Mile Mitrovic, Dmitry Titov, Klim Volkhov, Irina Lukicheva, Andrey Kudryavzev, Petr Vorobev, Qi Li, Vladimir Terzija

Abstract: As a new practical and economical solution to the aging problem of overhead line (OHL) assets, the technical policies of most power grid companies in the world experienced a gradual transition from scheduled preventive maintenance to a risk-based approach in asset management. Even though the accumulation of contamination is predictable within a certain degree, there are currently no effective ways… ▽ More As a new practical and economical solution to the aging problem of overhead line (OHL) assets, the technical policies of most power grid companies in the world experienced a gradual transition from scheduled preventive maintenance to a risk-based approach in asset management. Even though the accumulation of contamination is predictable within a certain degree, there are currently no effective ways to identify the risk of the insulator flashover in order to plan its replacement. This paper presents a novel machine learning (ML) based method for estimating the flashover probability of the cup-and-pin glass insulator string. The proposed method is based on the Extreme Gradient Boosting (XGBoost) supervised ML model, in which the leakage current (LC) features and applied voltage are used as the inputs. The established model can estimate the critical flashover voltage (U50%) for various designs of OHL insulators with different voltage levels. The proposed method is also able to accurately determine the condition of the insulator strings and instruct asset management engineers to take appropriate actions. △ Less

Submitted 26 July, 2024; originally announced July 2024.

Comments: 10 pages, 9 figures

arXiv:2407.12996 [pdf, other]

Sharpness-diversity tradeoff: improving flat ensembles with SharpBalance

Authors: Haiquan Lu, Xiaotian Liu, Yefan Zhou, Qunli Li, Kurt Keutzer, Michael W. Mahoney, Yujun Yan, Huanrui Yang, Yaoqing Yang

Abstract: Recent studies on deep ensembles have identified the sharpness of the local minima of individual learners and the diversity of the ensemble members as key factors in improving test-time performance. Building on this, our study investigates the interplay between sharpness and diversity within deep ensembles, illustrating their crucial role in robust generalization to both in-distribution (ID) and o… ▽ More Recent studies on deep ensembles have identified the sharpness of the local minima of individual learners and the diversity of the ensemble members as key factors in improving test-time performance. Building on this, our study investigates the interplay between sharpness and diversity within deep ensembles, illustrating their crucial role in robust generalization to both in-distribution (ID) and out-of-distribution (OOD) data. We discover a trade-off between sharpness and diversity: minimizing the sharpness in the loss landscape tends to diminish the diversity of individual members within the ensemble, adversely affecting the ensemble's improvement. The trade-off is justified through our theoretical analysis and verified empirically through extensive experiments. To address the issue of reduced diversity, we introduce SharpBalance, a novel training approach that balances sharpness and diversity within ensembles. Theoretically, we show that our training strategy achieves a better sharpness-diversity trade-off. Empirically, we conducted comprehensive evaluations in various data sets (CIFAR-10, CIFAR-100, TinyImageNet) and showed that SharpBalance not only effectively improves the sharpness-diversity trade-off, but also significantly improves ensemble performance in ID and OOD scenarios. △ Less

Submitted 17 July, 2024; originally announced July 2024.

arXiv:2406.18189 [pdf, other]

Functional knockoffs selection with applications to functional data analysis in high dimensions

Authors: Xinghao Qiao, Mingya Long, Qizhai Li

Abstract: The knockoffs is a recently proposed powerful framework that effectively controls the false discovery rate (FDR) for variable selection. However, none of the existing knockoff solutions are directly suited to handle multivariate or high-dimensional functional data, which has become increasingly prevalent in various scientific applications. In this paper, we propose a novel functional model-X knock… ▽ More The knockoffs is a recently proposed powerful framework that effectively controls the false discovery rate (FDR) for variable selection. However, none of the existing knockoff solutions are directly suited to handle multivariate or high-dimensional functional data, which has become increasingly prevalent in various scientific applications. In this paper, we propose a novel functional model-X knockoffs selection framework tailored to sparse high-dimensional functional models, and show that our proposal can achieve the effective FDR control for any sample size. Furthermore, we illustrate the proposed functional model-X knockoffs selection procedure along with the associated theoretical guarantees for both FDR control and asymptotic power using examples of commonly adopted functional linear additive regression models and the functional graphical model. In the construction of functional knockoffs, we integrate essential components including the correlation operator matrix, the Karhunen-Loève expansion, and semidefinite programming, and develop executable algorithms. We demonstrate the superiority of our proposed methods over the competitors through both extensive simulations and the analysis of two brain imaging datasets. △ Less

Submitted 27 June, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

arXiv:2406.08209 [pdf, other]

Forward-Euler time-discretization for Wasserstein gradient flows can be wrong

Authors: Yewei Xu, Qin Li

Abstract: In this note, we examine the forward-Euler discretization for simulating Wasserstein gradient flows. We provide two counter-examples showcasing the failure of this discretization even for a simple case where the energy functional is defined as the KL divergence against some nicely structured probability densities. A simple explanation of this failure is also discussed. In this note, we examine the forward-Euler discretization for simulating Wasserstein gradient flows. We provide two counter-examples showcasing the failure of this discretization even for a simple case where the energy functional is defined as the KL divergence against some nicely structured probability densities. A simple explanation of this failure is also discussed. △ Less

Submitted 12 June, 2024; originally announced June 2024.

MSC Class: 65M12

arXiv:2405.17079 [pdf, other]

Learning with User-Level Local Differential Privacy

Authors: Puning Zhao, Li Shen, Rongfei Fan, Qingming Li, Huiwen Wu, Jiafei Wu, Zhe Liu

Abstract: User-level privacy is important in distributed systems. Previous research primarily focuses on the central model, while the local models have received much less attention. Under the central model, user-level DP is strictly stronger than the item-level one. However, under the local model, the relationship between user-level and item-level LDP becomes more complex, thus the analysis is crucially dif… ▽ More User-level privacy is important in distributed systems. Previous research primarily focuses on the central model, while the local models have received much less attention. Under the central model, user-level DP is strictly stronger than the item-level one. However, under the local model, the relationship between user-level and item-level LDP becomes more complex, thus the analysis is crucially different. In this paper, we first analyze the mean estimation problem and then apply it to stochastic optimization, classification, and regression. In particular, we propose adaptive strategies to achieve optimal performance at all privacy levels. Moreover, we also obtain information-theoretic lower bounds, which show that the proposed methods are minimax optimal up to logarithmic factors. Unlike the central DP model, where user-level DP always leads to slower convergence, our result shows that under the local model, the convergence rates are nearly the same between user-level and item-level cases for distributions with bounded support. For heavy-tailed distributions, the user-level rate is even faster than the item-level one. △ Less

Submitted 27 May, 2024; originally announced May 2024.

arXiv:2404.09194 [pdf, other]

Bayesian modeling of co-occurrence microbial interaction networks

Authors: Tejasv Bedi, Bencong Zhu, Michael L. Neugent, Kevin C. Lutz, Nicole J. De Nisco, Qiwei Li

Abstract: The human body consists of microbiomes associated with the development and prevention of several diseases. These microbial organisms form several complex interactions that are informative to the scientific community for explaining disease progression and prevention. Contrary to the traditional view of the microbiome as a singular, assortative network, we introduce a novel statistical approach usin… ▽ More The human body consists of microbiomes associated with the development and prevention of several diseases. These microbial organisms form several complex interactions that are informative to the scientific community for explaining disease progression and prevention. Contrary to the traditional view of the microbiome as a singular, assortative network, we introduce a novel statistical approach using a weighted stochastic infinite block model to analyze the complex community structures within microbial co-occurrence microbial interaction networks. Our model defines connections between microbial taxa using a novel semi-parametric rank-based correlation method on their transformed relative abundances within a fully connected network framework. Employing a Bayesian nonparametric approach, the proposed model effectively clusters taxa into distinct communities while estimating the number of communities. The posterior summary of the taxa community membership is obtained based on the posterior probability matrix, which could naturally solve the label switching problem. Through simulation studies and real-world application to microbiome data from postmenopausal patients with recurrent urinary tract infections, we demonstrate that our method has superior clustering accuracy over alternative approaches. This advancement provides a more nuanced understanding of microbiome organization, with significant implications for disease research. △ Less

Submitted 14 April, 2024; originally announced April 2024.

Comments: 25 pages

arXiv:2403.17670 [pdf, other]

A family of Chatterjee's correlation coefficients and their properties

Authors: Muhong Gao, Qizhai Li

Abstract: Quantifying the strength of functional dependence between random scalars $X$ and $Y$ is an important statistical problem. While many existing correlation coefficients excel in identifying linear or monotone functional dependence, they fall short in capturing general non-monotone functional relationships. In response, we propose a family of correlation coefficients $ξ^{(h,F)}_n$, characterized by a… ▽ More Quantifying the strength of functional dependence between random scalars $X$ and $Y$ is an important statistical problem. While many existing correlation coefficients excel in identifying linear or monotone functional dependence, they fall short in capturing general non-monotone functional relationships. In response, we propose a family of correlation coefficients $ξ^{(h,F)}_n$, characterized by a continuous bivariate function $h$ and a cdf function $F$. By offering a range of selections for $h$ and $F$, $ξ^{(h,F)}_n$ encompasses a diverse class of novel correlation coefficients, while also incorporates the Chatterjee's correlation coefficient (Chatterjee, 2021) as a special case. We prove that $ξ^{(h,F)}_n$ converges almost surely to a deterministic limit $ξ^{(h,F)}$ as sample size $n$ approaches infinity. In addition, under appropriate conditions imposed on $h$ and $F$, the limit $ξ^{(h,F)}$ satisfies the three appealing properties: (P1). it belongs to the range of $[0,1]$; (P2). it equals 1 if and only if $Y$ is a measurable function of $X$; and (P3). it equals 0 if and only if $Y$ is independent of $X$. As amplified by our numerical experiments, our proposals provide practitioners with a variety of options to choose the most suitable correlation coefficient tailored to their specific practical needs. △ Less

Submitted 26 March, 2024; originally announced March 2024.

Comments: 27 pages, 4 figures

MSC Class: 62H20; 62G05

arXiv:2402.15515 [pdf]

Feasibility of Identifying Factors Related to Alzheimer's Disease and Related Dementia in Real-World Data

Authors: Aokun Chen, Qian Li, Yu Huang, Yongqiu Li, Yu-neng Chuang, Xia Hu, Serena Guo, Yonghui Wu, Yi Guo, Jiang Bian

Abstract: A comprehensive view of factors associated with AD/ADRD will significantly aid in studies to develop new treatments for AD/ADRD and identify high-risk populations and patients for prevention efforts. In our study, we summarized the risk factors for AD/ADRD by reviewing existing meta-analyses and review articles on risk and preventive factors for AD/ADRD. In total, we extracted 477 risk factors in… ▽ More A comprehensive view of factors associated with AD/ADRD will significantly aid in studies to develop new treatments for AD/ADRD and identify high-risk populations and patients for prevention efforts. In our study, we summarized the risk factors for AD/ADRD by reviewing existing meta-analyses and review articles on risk and preventive factors for AD/ADRD. In total, we extracted 477 risk factors in 10 categories from 537 studies. We constructed an interactive knowledge map to disseminate our study results. Most of the risk factors are accessible from structured Electronic Health Records (EHRs), and clinical narratives show promise as information sources. However, evaluating genomic risk factors using RWD remains a challenge, as genetic testing for AD/ADRD is still not a common practice and is poorly documented in both structured and unstructured EHRs. Considering the constantly evolving research on AD/ADRD risk factors, literature mining via NLP methods offers a solution to automatically update our knowledge map. △ Less

Submitted 3 February, 2024; originally announced February 2024.

arXiv:2401.09259 [pdf, other]

doi 10.1137/23M1615425

Mitigating distribution shift in machine learning-augmented hybrid simulation

Authors: Jiaxi Zhao, Qianxiao Li

Abstract: We study the problem of distribution shift generally arising in machine-learning augmented hybrid simulation, where parts of simulation algorithms are replaced by data-driven surrogates. We first establish a mathematical framework to understand the structure of machine-learning augmented hybrid simulation problems, and the cause and effect of the associated distribution shift. We show correlations… ▽ More We study the problem of distribution shift generally arising in machine-learning augmented hybrid simulation, where parts of simulation algorithms are replaced by data-driven surrogates. We first establish a mathematical framework to understand the structure of machine-learning augmented hybrid simulation problems, and the cause and effect of the associated distribution shift. We show correlations between distribution shift and simulation error both numerically and theoretically. Then, we propose a simple methodology based on tangent-space regularized estimator to control the distribution shift, thereby improving the long-term accuracy of the simulation results. In the linear dynamics case, we provide a thorough theoretical analysis to quantify the effectiveness of the proposed method. Moreover, we conduct several numerical experiments, including simulating a partially known reaction-diffusion equation and solving Navier-Stokes equations using the projection method with a data-driven pressure solver. In all cases, we observe marked improvements in simulation accuracy under the proposed method, especially for systems with high degrees of distribution shift, such as those with relatively strong non-linear reaction mechanisms, or flows at large Reynolds numbers. △ Less

Submitted 17 January, 2024; originally announced January 2024.

MSC Class: 68T99; 65M15; 37M05

arXiv:2401.04856 [pdf, other]

A Good Score Does not Lead to A Good Generative Model

Authors: Sixu Li, Shi Chen, Qin Li

Abstract: Score-based Generative Models (SGMs) is one leading method in generative modeling, renowned for their ability to generate high-quality samples from complex, high-dimensional data distributions. The method enjoys empirical success and is supported by rigorous theoretical convergence properties. In particular, it has been shown that SGMs can generate samples from a distribution that is close to the… ▽ More Score-based Generative Models (SGMs) is one leading method in generative modeling, renowned for their ability to generate high-quality samples from complex, high-dimensional data distributions. The method enjoys empirical success and is supported by rigorous theoretical convergence properties. In particular, it has been shown that SGMs can generate samples from a distribution that is close to the ground-truth if the underlying score function is learned well, suggesting the success of SGM as a generative model. We provide a counter-example in this paper. Through the sample complexity argument, we provide one specific setting where the score function is learned well. Yet, SGMs in this setting can only output samples that are Gaussian blurring of training data points, mimicking the effects of kernel density estimation. The finding resonates a series of recent finding that reveal that SGMs can demonstrate strong memorization effect and fail to generate. △ Less

Submitted 27 January, 2024; v1 submitted 9 January, 2024; originally announced January 2024.

arXiv:2401.00521 [pdf, other]

Multi-spatial Multi-temporal Air Quality Forecasting with Integrated Monitoring and Reanalysis Data

Authors: Yuxiao Hu, Qian Li, Xiaodan Shi, Jinyue Yan, Yuntian Chen

Abstract: Accurate air quality forecasting is crucial for public health, environmental monitoring and protection, and urban planning. However, existing methods fail to effectively utilize multi-scale information, both spatially and temporally. Spatially, there is a lack of integration between individual monitoring stations and city-wide scales. Temporally, the periodic nature of air quality variations is of… ▽ More Accurate air quality forecasting is crucial for public health, environmental monitoring and protection, and urban planning. However, existing methods fail to effectively utilize multi-scale information, both spatially and temporally. Spatially, there is a lack of integration between individual monitoring stations and city-wide scales. Temporally, the periodic nature of air quality variations is often overlooked or inadequately considered. To address these limitations, we present a novel Multi-spatial Multi-temporal air quality forecasting method based on Graph Convolutional Networks and Gated Recurrent Units (M2G2), bridging the gap in air quality forecasting across spatial and temporal scales. The proposed framework consists of two modules: Multi-scale Spatial GCN (MS-GCN) for spatial information fusion and Multi-scale Temporal GRU(MT-GRU) for temporal information integration. In the spatial dimension, the MS-GCN module employs a bidirectional learnable structure and a residual structure, enabling comprehensive information exchange between individual monitoring stations and the city-scale graph. Regarding the temporal dimension, the MT-GRU module adaptively combines information from different temporal scales through parallel hidden states. Leveraging meteorological indicators and four air quality indicators, we present comprehensive comparative analyses and ablation experiments, showcasing the higher accuracy of M2G2 in comparison to nine currently available advanced approaches across all aspects. The improvements of M2G2 over the second-best method on RMSE of the 24h/48h/72h are as follows: PM2.5: (7.72%, 6.67%, 10.45%); PM10: (6.43%, 5.68%, 7.73%); NO2: (5.07%, 7.76%, 16.60%); O3: (6.46%, 6.86%, 9.79%). Furthermore, we demonstrate the effectiveness of each module of M2G2 by ablation study. △ Less

Submitted 31 December, 2023; originally announced January 2024.

arXiv:2312.08670 [pdf, other]

Temporal-Spatial Entropy Balancing for Causal Continuous Treatment-Effect Estimation

Authors: Tao Hu, Honglong Zhang, Fan Zeng, Min Du, XiangKun Du, Yue Zheng, Quanqi Li, Mengran Zhang, Dan Yang, Jihao Wu

Abstract: In the field of intracity freight transportation, changes in order volume are significantly influenced by temporal and spatial factors. When building subsidy and pricing strategies, predicting the causal effects of these strategies on order volume is crucial. In the process of calculating causal effects, confounding variables can have an impact. Traditional methods to control confounding variables… ▽ More In the field of intracity freight transportation, changes in order volume are significantly influenced by temporal and spatial factors. When building subsidy and pricing strategies, predicting the causal effects of these strategies on order volume is crucial. In the process of calculating causal effects, confounding variables can have an impact. Traditional methods to control confounding variables handle data from a holistic perspective, which cannot ensure the precision of causal effects in specific temporal and spatial dimensions. However, temporal and spatial dimensions are extremely critical in the logistics field, and this limitation may directly affect the precision of subsidy and pricing strategies. To address these issues, this study proposes a technique based on flexible temporal-spatial grid partitioning. Furthermore, based on the flexible grid partitioning technique, we further propose a continuous entropy balancing method in the temporal-spatial domain, which named TS-EBCT (Temporal-Spatial Entropy Balancing for Causal Continue Treatments). The method proposed in this paper has been tested on two simulation datasets and two real datasets, all of which have achieved excellent performance. In fact, after applying the TS-EBCT method to the intracity freight transportation field, the prediction accuracy of the causal effect has been significantly improved. It brings good business benefits to the company's subsidy and pricing strategies. △ Less

Submitted 18 December, 2023; v1 submitted 14 December, 2023; originally announced December 2023.

Comments: 10 pages;

arXiv:2312.08324 [pdf, other]

Bayesian Nonparametric Clustering with Feature Selection for Spatially Resolved Transcriptomics Data

Authors: Bencong Zhu, Guanyu Hu, Yang Xie, Lin Xu, Xiaodan Fan, Qiwei Li

Abstract: The advent of next-generation sequencing-based spatially resolved transcriptomics (SRT) techniques has reshaped genomic studies by enabling high-throughput gene expression profiling while preserving spatial and morphological context. Nevertheless, there are inherent challenges associated with these new high-dimensional spatial data, such as zero-inflation, over-dispersion, and heterogeneity. These… ▽ More The advent of next-generation sequencing-based spatially resolved transcriptomics (SRT) techniques has reshaped genomic studies by enabling high-throughput gene expression profiling while preserving spatial and morphological context. Nevertheless, there are inherent challenges associated with these new high-dimensional spatial data, such as zero-inflation, over-dispersion, and heterogeneity. These challenges pose obstacles to effective clustering, which is a fundamental problem in SRT data analysis. Current computational approaches often rely on heuristic data preprocessing and arbitrary cluster number prespecification, leading to considerable information loss and consequently, suboptimal downstream analysis. In response to these challenges, we introduce BNPSpace, a novel Bayesian nonparametric spatial clustering framework that directly models SRT count data. BNPSpace facilitates the partitioning of the whole spatial domain, which is characterized by substantial heterogeneity, into homogeneous spatial domains with similar molecular characteristics while identifying a parsimonious set of discriminating genes among different spatial domains. Moreover, BNPSpace incorporates spatial information through a Markov random field prior model, encouraging a smooth and biologically meaningful partition pattern. △ Less

Submitted 13 December, 2023; originally announced December 2023.

arXiv:2312.07067 [pdf, other]

Focus on Hiders: Exploring Hidden Threats for Enhancing Adversarial Training

Authors: Qian Li, Yuxiao Hu, Yinpeng Dong, Dongxiao Zhang, Yuntian Chen

Abstract: Adversarial training is often formulated as a min-max problem, however, concentrating only on the worst adversarial examples causes alternating repetitive confusion of the model, i.e., previously defended or correctly classified samples are not defensible or accurately classifiable in subsequent adversarial training. We characterize such non-ignorable samples as "hiders", which reveal the hidden h… ▽ More Adversarial training is often formulated as a min-max problem, however, concentrating only on the worst adversarial examples causes alternating repetitive confusion of the model, i.e., previously defended or correctly classified samples are not defensible or accurately classifiable in subsequent adversarial training. We characterize such non-ignorable samples as "hiders", which reveal the hidden high-risk regions within the secure area obtained through adversarial training and prevent the model from finding the real worst cases. We demand the model to prevent hiders when defending against adversarial examples for improving accuracy and robustness simultaneously. By rethinking and redefining the min-max optimization problem for adversarial training, we propose a generalized adversarial training algorithm called Hider-Focused Adversarial Training (HFAT). HFAT introduces the iterative evolution optimization strategy to simplify the optimization problem and employs an auxiliary model to reveal hiders, effectively combining the optimization directions of standard adversarial training and prevention hiders. Furthermore, we introduce an adaptive weighting mechanism that facilitates the model in adaptively adjusting its focus between adversarial examples and hiders during different training periods. We demonstrate the effectiveness of our method based on extensive experiments, and ensure that HFAT can provide higher robustness and accuracy. △ Less

Submitted 12 December, 2023; originally announced December 2023.

arXiv:2312.03967 [pdf, other]

Test-negative designs with various reasons for testing: statistical bias and solution

Authors: Mengxin Yu, Tom Hongyi Liu, Kendrick Qijun Li, Nicholas Jewell, Eric Tchetgen Tchetgen, Dylan Small, Xu Shi, Bingkai Wang

Abstract: Test-negative designs are widely used for post-market evaluation of vaccine effectiveness, particularly in cases when randomized trials are not feasible. Differing from classical test-negative designs where only healthcare-seekers with symptoms are included, recent test-negative designs have involved individuals with various reasons for testing, especially in an outbreak setting. While including t… ▽ More Test-negative designs are widely used for post-market evaluation of vaccine effectiveness, particularly in cases when randomized trials are not feasible. Differing from classical test-negative designs where only healthcare-seekers with symptoms are included, recent test-negative designs have involved individuals with various reasons for testing, especially in an outbreak setting. While including these data can increase sample size and hence improve precision, concerns have been raised about whether they introduce bias into the current framework of test-negative designs, thereby demanding a formal statistical examination of this modified design. In this article, using statistical derivations, causal graphs, and numerical demonstrations, we show that the standard odds ratio estimator may be biased if various reasons for testing are not accounted for. To eliminate this bias, we identify three categories of reasons for testing, including symptoms, mandatory screening, and case contact tracing, and characterize associated statistical properties and estimands. Based on our characterization, we show how to consistently estimate each estimand via stratification. Furthermore, we describe when these estimands correspond to the same vaccine effectiveness parameter, and, when appropriate, propose a stratified estimator that can incorporate multiple reasons for testing and improve precision. The performance of our proposed method is demonstrated through simulation studies. △ Less

Submitted 26 April, 2025; v1 submitted 6 December, 2023; originally announced December 2023.

arXiv:2311.05067 [pdf, other]

Accelerating Exploration with Unlabeled Prior Data

Authors: Qiyang Li, Jason Zhang, Dibya Ghosh, Amy Zhang, Sergey Levine

Abstract: Learning to solve tasks from a sparse reward signal is a major challenge for standard reinforcement learning (RL) algorithms. However, in the real world, agents rarely need to solve sparse reward tasks entirely from scratch. More often, we might possess prior experience to draw on that provides considerable guidance about which actions and outcomes are possible in the world, which we can use to ex… ▽ More Learning to solve tasks from a sparse reward signal is a major challenge for standard reinforcement learning (RL) algorithms. However, in the real world, agents rarely need to solve sparse reward tasks entirely from scratch. More often, we might possess prior experience to draw on that provides considerable guidance about which actions and outcomes are possible in the world, which we can use to explore more effectively for new tasks. In this work, we study how prior data without reward labels may be used to guide and accelerate exploration for an agent solving a new sparse reward task. We propose a simple approach that learns a reward model from online experience, labels the unlabeled prior data with optimistic rewards, and then uses it concurrently alongside the online data for downstream policy and critic optimization. This general formula leads to rapid exploration in several challenging sparse-reward domains where tabula rasa exploration is insufficient, including the AntMaze domain, Adroit hand manipulation domain, and a visual simulated robotic manipulation domain. Our results highlight the ease of incorporating unlabeled prior data into existing online RL algorithms, and the (perhaps surprising) effectiveness of doing so. △ Less

Submitted 20 November, 2023; v1 submitted 8 November, 2023; originally announced November 2023.

Comments: 25 pages, 16 figures, 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

arXiv:2310.08867 [pdf]

A Survey of Methods for Handling Disk Data Imbalance

Authors: Shuangshuang Yuan, Peng Wu, Yuehui Chen, Qiang Li

Abstract: Class imbalance exists in many classification problems, and since the data is designed for accuracy, imbalance in data classes can lead to classification challenges with a few classes having higher misclassification costs. The Backblaze dataset, a widely used dataset related to hard discs, has a small amount of failure data and a large amount of health data, which exhibits a serious class imbalanc… ▽ More Class imbalance exists in many classification problems, and since the data is designed for accuracy, imbalance in data classes can lead to classification challenges with a few classes having higher misclassification costs. The Backblaze dataset, a widely used dataset related to hard discs, has a small amount of failure data and a large amount of health data, which exhibits a serious class imbalance. This paper provides a comprehensive overview of research in the field of imbalanced data classification. The discussion is organized into three main aspects: data-level methods, algorithmic-level methods, and hybrid methods. For each type of method, we summarize and analyze the existing problems, algorithmic ideas, strengths, and weaknesses. Additionally, the challenges of unbalanced data classification are discussed, along with strategies to address them. It is convenient for researchers to choose the appropriate method according to their needs. △ Less

Submitted 13 October, 2023; originally announced October 2023.

arXiv:2308.14671 [pdf, other]

A generalized Bayesian stochastic block model for microbiome community detection

Authors: Kevin C. Lutz, Michael L. Neugent, Tejasv Bedi, Nicole J. De Nisco, Qiwei Li

Abstract: Advances in next-generation sequencing technology have enabled the high-throughput profiling of metagenomes and accelerated the microbiome study. Recently, there has been a rise in quantitative studies that aim to decipher the microbiome co-occurrence network and its underlying community structure based on metagenomic sequence data. Uncovering the complex microbiome community structure is essentia… ▽ More Advances in next-generation sequencing technology have enabled the high-throughput profiling of metagenomes and accelerated the microbiome study. Recently, there has been a rise in quantitative studies that aim to decipher the microbiome co-occurrence network and its underlying community structure based on metagenomic sequence data. Uncovering the complex microbiome community structure is essential to understanding the role of the microbiome in disease progression and susceptibility. Taxonomic abundance data generated from metagenomic sequencing technologies are high-dimensional and compositional, suffering from uneven sampling depth, over-dispersion, and zero-inflation. These characteristics often challenge the reliability of the current methods for microbiome community detection. To this end, we propose a Bayesian stochastic block model to study the microbiome co-occurrence network based on the recently developed modified centered-log ratio transformation tailored for microbiome data analysis. Our model allows us to incorporate taxonomic tree information using a Markov random field prior. The model parameters are jointly inferred by using Markov chain Monte Carlo sampling techniques. Our simulation study showed that the proposed approach performs better than competing methods even when taxonomic tree information is non-informative. We applied our approach to a real urinary microbiome dataset from postmenopausal women, the first time the urinary microbiome co-occurrence network structure has been studied. In summary, this statistical methodology provides a new tool for facilitating advanced microbiome studies. △ Less

Submitted 28 August, 2023; originally announced August 2023.

arXiv:2307.01389 [pdf, other]

Identification of Causal Relationship between Amyloid-beta Accumulation and Alzheimer's Disease Progression via Counterfactual Inference

Authors: Haixing Dai, Mengxuan Hu, Qing Li, Lu Zhang, Lin Zhao, Dajiang Zhu, Ibai Diez, Jorge Sepulcre, Fan Zhang, Xingyu Gao, Manhua Liu, Quanzheng Li, Sheng Li, Tianming Liu, Xiang Li

Abstract: Alzheimer's disease (AD) is a neurodegenerative disorder that is beginning with amyloidosis, followed by neuronal loss and deterioration in structure, function, and cognition. The accumulation of amyloid-beta in the brain, measured through 18F-florbetapir (AV45) positron emission tomography (PET) imaging, has been widely used for early diagnosis of AD. However, the relationship between amyloid-bet… ▽ More Alzheimer's disease (AD) is a neurodegenerative disorder that is beginning with amyloidosis, followed by neuronal loss and deterioration in structure, function, and cognition. The accumulation of amyloid-beta in the brain, measured through 18F-florbetapir (AV45) positron emission tomography (PET) imaging, has been widely used for early diagnosis of AD. However, the relationship between amyloid-beta accumulation and AD pathophysiology remains unclear, and causal inference approaches are needed to uncover how amyloid-beta levels can impact AD development. In this paper, we propose a graph varying coefficient neural network (GVCNet) for estimating the individual treatment effect with continuous treatment levels using a graph convolutional neural network. We highlight the potential of causal inference approaches, including GVCNet, for measuring the regional causal connections between amyloid-beta accumulation and AD pathophysiology, which may serve as a robust tool for early diagnosis and tailored care. △ Less

Submitted 3 July, 2023; originally announced July 2023.

arXiv:2306.01675 [pdf, other]

Bayesian Segmentation Modeling of Epidemic Growth

Authors: Tejasv Bedi, Yanxun Xu, Qiwei Li

Abstract: Tracking the spread of infectious disease during a pandemic has posed a great challenge to the governments and health sectors on a global scale. To facilitate informed public health decision-making, the concerned parties usually rely on short-term daily and weekly projections generated via predictive modeling. Several deterministic and stochastic epidemiological models, including growth and compar… ▽ More Tracking the spread of infectious disease during a pandemic has posed a great challenge to the governments and health sectors on a global scale. To facilitate informed public health decision-making, the concerned parties usually rely on short-term daily and weekly projections generated via predictive modeling. Several deterministic and stochastic epidemiological models, including growth and compartmental models, have been proposed in the literature. These models assume that an epidemic would last over a short duration and the observed cases/deaths would attain a single peak. However, some infectious diseases, such as COVID-19, extend over a longer duration than expected. Moreover, time-varying disease transmission rates due to government interventions have made the observed data multi-modal. To address these challenges, this work proposes stochastic epidemiological models under a unified Bayesian framework augmented by a change-point detection mechanism to account for multiple peaks. The Bayesian framework allows us to incorporate prior knowledge, such as dates of influential policy changes, to predict the change-point locations precisely. We develop a trans-dimensional reversible jump Markov chain Monte Carlo algorithm to sample the posterior distributions of epidemiological parameters while estimating the number of change points and the resulting parameters. The proposed method is evaluated and compared to alternative methods in terms of change-point detection, parameter estimation, and long-term forecasting accuracy on both simulated and COVID-19 data of several major states in the United States. △ Less

Submitted 2 June, 2023; originally announced June 2023.

arXiv:2305.08204 [pdf, other]

glmmPen: High Dimensional Penalized Generalized Linear Mixed Models

Authors: Hillary M. Heiling, Naim U. Rashid, Quefeng Li, Joseph G. Ibrahim

Abstract: Generalized linear mixed models (GLMMs) are widely used in research for their ability to model correlated outcomes with non-Gaussian conditional distributions. The proper selection of fixed and random effects is a critical part of the modeling process since model misspecification may lead to significant bias. However, the joint selection of fixed and random effects has historically been limited to… ▽ More Generalized linear mixed models (GLMMs) are widely used in research for their ability to model correlated outcomes with non-Gaussian conditional distributions. The proper selection of fixed and random effects is a critical part of the modeling process since model misspecification may lead to significant bias. However, the joint selection of fixed and random effects has historically been limited to lower-dimensional GLMMs, largely due to the use of criterion-based model selection strategies. Here we present the R package glmmPen, one of the first to select fixed and random effects in higher dimension using a penalized GLMM modeling framework. Model parameters are estimated using a Monte Carlo Expectation Conditional Minimization (MCECM) algorithm, which leverages Stan and RcppArmadillo for increased computational efficiency. Our package supports the Binomial, Gaussian, and Poisson families and multiple penalty functions. In this manuscript we discuss the modeling procedure, estimation scheme, and software implementation through application to a pancreatic cancer subtyping study. Simulation results show our method has good performance in selecting both the fixed and random effects in high dimensional GLMMs. △ Less

Submitted 16 April, 2024; v1 submitted 14 May, 2023; originally announced May 2023.

arXiv:2305.08201 [pdf, ps, other]

Efficient Computation of High-Dimensional Penalized Generalized Linear Mixed Models by Latent Factor Modeling of the Random Effects

Authors: Hillary M. Heiling, Naim U. Rashid, Quefeng Li, Xianlu L. Peng, Jen Jen Yeh, Joseph G. Ibrahim

Abstract: Modern biomedical datasets are increasingly high dimensional and exhibit complex correlation structures. Generalized Linear Mixed Models (GLMMs) have long been employed to account for such dependencies. However, proper specification of the fixed and random effects in GLMMs is increasingly difficult in high dimensions, and computational complexity grows with increasing dimension of the random effec… ▽ More Modern biomedical datasets are increasingly high dimensional and exhibit complex correlation structures. Generalized Linear Mixed Models (GLMMs) have long been employed to account for such dependencies. However, proper specification of the fixed and random effects in GLMMs is increasingly difficult in high dimensions, and computational complexity grows with increasing dimension of the random effects. We present a novel reformulation of the GLMM using a factor model decomposition of the random effects, enabling scalable computation of GLMMs in high dimensions by reducing the latent space from a large number of random effects to a smaller set of latent factors. We also extend our prior work to estimate model parameters using a modified Monte Carlo Expectation Conditional Minimization algorithm, allowing us to perform variable selection on both the fixed and random effects simultaneously. We show through simulation that through this factor model decomposition, our method can fit high dimensional penalized GLMMs faster than comparable methods and more easily scale to larger dimensions not previously seen in existing approaches. △ Less

Submitted 16 April, 2024; v1 submitted 14 May, 2023; originally announced May 2023.

arXiv:2304.10466 [pdf, other]

Efficient Deep Reinforcement Learning Requires Regulating Overfitting

Authors: Qiyang Li, Aviral Kumar, Ilya Kostrikov, Sergey Levine

Abstract: Deep reinforcement learning algorithms that learn policies by trial-and-error must learn from limited amounts of data collected by actively interacting with the environment. While many prior works have shown that proper regularization techniques are crucial for enabling data-efficient RL, a general understanding of the bottlenecks in data-efficient RL has remained unclear. Consequently, it has bee… ▽ More Deep reinforcement learning algorithms that learn policies by trial-and-error must learn from limited amounts of data collected by actively interacting with the environment. While many prior works have shown that proper regularization techniques are crucial for enabling data-efficient RL, a general understanding of the bottlenecks in data-efficient RL has remained unclear. Consequently, it has been difficult to devise a universal technique that works well across all domains. In this paper, we attempt to understand the primary bottleneck in sample-efficient deep RL by examining several potential hypotheses such as non-stationarity, excessive action distribution shift, and overfitting. We perform thorough empirical analysis on state-based DeepMind control suite (DMC) tasks in a controlled and systematic way to show that high temporal-difference (TD) error on the validation set of transitions is the main culprit that severely affects the performance of deep RL algorithms, and prior methods that lead to good performance do in fact, control the validation TD error to be low. This observation gives us a robust principle for making deep RL efficient: we can hill-climb on the validation TD error by utilizing any form of regularization techniques from supervised learning. We show that a simple online model selection method that targets the validation TD error is effective across state-based DMC and Gym tasks. △ Less

Submitted 20 April, 2023; originally announced April 2023.

Comments: 26 pages, 18 figures, 3 tables, The International Conference on Learning Representations (ICLR) 2023

arXiv:2303.07050 [pdf, other]

Evaluation of wait time saving effectiveness of triage algorithms

Authors: Yee Lam Elim Thompson, Gary M Levine, Weijie Chen, Berkman Sahiner, Qin Li, Nicholas Petrick, Jana G Delfino, Miguel A Lago, Qian Cao, Qin Li, Frank W Samuelson

Abstract: In the past decade, Artificial Intelligence (AI) algorithms have made promising impacts to transform healthcare in all aspects. One application is to triage patients' radiological medical images based on the algorithm's binary outputs. Such AI-based prioritization software is known as computer-aided triage and notification (CADt). Their main benefit is to speed up radiological review of images wit… ▽ More In the past decade, Artificial Intelligence (AI) algorithms have made promising impacts to transform healthcare in all aspects. One application is to triage patients' radiological medical images based on the algorithm's binary outputs. Such AI-based prioritization software is known as computer-aided triage and notification (CADt). Their main benefit is to speed up radiological review of images with time-sensitive findings. However, as CADt devices become more common in clinical workflows, there is still a lack of quantitative methods to evaluate a device's effectiveness in saving patients' waiting times. In this paper, we present a mathematical framework based on queueing theory to calculate the average waiting time per patient image before and after a CADt device is used. We study four workflow models with multiple radiologists (servers) and priority classes for a range of AI diagnostic performance, radiologist's reading rates, and patient image (customer) arrival rates. Due to model complexity, an approximation method known as the Recursive Dimensionality Reduction technique is applied. We define a performance metric to measure the device's time-saving effectiveness. A software tool is developed to simulate clinical workflow of image review/interpretation, to verify theoretical results, and to provide confidence intervals of the performance metric we defined. It is shown quantitatively that a triage device is more effective in a busy, short-staffed setting, which is consistent with our clinical intuition and simulation results. Although this work is motivated by the need for evaluating CADt devices, the framework we present in this paper can be applied to any algorithm that prioritizes customers based on its binary outputs. △ Less

Submitted 13 March, 2023; originally announced March 2023.

arXiv:2212.09160 [pdf, other]

Stochastic Economic Dispatch Considering Demand Response and Endogenous Uncertainty

Authors: Nasrin Bayat, Qifeng Li, Joon-Hyuk Park

Abstract: This paper considers endogenous uncertainty (EnU) in the stochastic economic dispatch (SED) problem, where the endogenous uncertainty means decision dependent uncertainty. In this problem, demand response (DR) commitment is the source of the EnU. Nevertheless, EnU is not well considered in existing literature. Our first contribution is to build up an optimization model of DR-involved SED under EnU… ▽ More This paper considers endogenous uncertainty (EnU) in the stochastic economic dispatch (SED) problem, where the endogenous uncertainty means decision dependent uncertainty. In this problem, demand response (DR) commitment is the source of the EnU. Nevertheless, EnU is not well considered in existing literature. Our first contribution is to build up an optimization model of DR-involved SED under EnU (SED-DR-EnU). This is a computational challenging problem due to the EnU. Our second contribution is introducing a coupled learning enabled optimization algorithm which can effectively solve the proposed SED-DR-EnU problem. This strategy is tested on the IEEE 14 bus, and IEEE 39 bus systems, and the results showed the importance of considering EnU in the DR-involved SED problem. △ Less

Submitted 31 May, 2023; v1 submitted 18 December, 2022; originally announced December 2022.

arXiv:2212.08771 [pdf, other]

Assign Experiment Variants at Scale in Online Controlled Experiments

Authors: Qike Li, Samir Jamkhande, Pavel Kochetkov, Pai Liu

Abstract: Online controlled experiments (A/B tests) have become the gold standard for learning the impact of new product features in technology companies. Randomization enables the inference of causality from an A/B test. The randomized assignment maps end users to experiment buckets and balances user characteristics between the groups. Therefore, experiments can attribute any outcome differences between th… ▽ More Online controlled experiments (A/B tests) have become the gold standard for learning the impact of new product features in technology companies. Randomization enables the inference of causality from an A/B test. The randomized assignment maps end users to experiment buckets and balances user characteristics between the groups. Therefore, experiments can attribute any outcome differences between the experiment groups to the product feature under experiment. Technology companies run A/B tests at scale -- hundreds if not thousands of A/B tests concurrently, each with millions of users. The large scale poses unique challenges to randomization. First, the randomized assignment must be fast since the experiment service receives hundreds of thousands of queries per second. Second, the variant assignments must be independent between experiments. Third, the assignment must be consistent when users revisit or an experiment enrolls more users. We present a novel assignment algorithm and statistical tests to validate the randomized assignments. Our results demonstrate that not only is this algorithm computationally fast but also satisfies the statistical requirements -- unbiased and independent. △ Less

Submitted 16 December, 2022; originally announced December 2022.

arXiv:2211.03258 [pdf, other]

doi 10.1093/mnras/stad751

Nested sampling statistical errors

Authors: Andrew Fowlie, Qiao Li, Huifang Lv, Yecheng Sun, Jia Zhang, Le Zheng

Abstract: Nested sampling (NS) is a popular algorithm for Bayesian computation. We investigate statistical errors in NS both analytically and numerically. We show two analytic results. First, we show that the leading terms in Skilling's expression using information theory match the leading terms in Keeton's expression from an analysis of moments. This approximate agreement was previously only known numerica… ▽ More Nested sampling (NS) is a popular algorithm for Bayesian computation. We investigate statistical errors in NS both analytically and numerically. We show two analytic results. First, we show that the leading terms in Skilling's expression using information theory match the leading terms in Keeton's expression from an analysis of moments. This approximate agreement was previously only known numerically and was somewhat mysterious. Second, we show that the uncertainty in single NS runs approximately equals the standard deviation in repeated NS runs. Whilst intuitive, this was previously taken for granted. We close by investigating our results and their assumptions in several numerical examples, including cases in which NS uncertainties increase without bound. △ Less

Submitted 6 November, 2022; originally announced November 2022.

Comments: 12 pages + appendices, 3 figures

arXiv:2210.06025 [pdf, other]

Bregman Divergence-Based Data Integration with Application to Polygenic Risk Score (PRS) Heterogeneity Adjustment

Authors: Qinmengge Li, Matthew T. Patrick, Haihan Zhang, Chachrit Khunsriraksakul, Philip E. Stuart, Johann E. Gudjonsson, Rajan Nair, James T. Elder, Dajiang J. Liu, Jian Kang, Lam C. Tsoi, Kevin He

Abstract: Polygenic risk scores (PRS) have recently received much attention for genetics risk prediction. While successful for the Caucasian population, the PRS based on the minority population suffer from small sample sizes, high dimensionality and low signal-to-noise ratios, exacerbating already severe health disparities. Due to population heterogeneity, direct trans-ethnic prediction by utilizing the Cau… ▽ More Polygenic risk scores (PRS) have recently received much attention for genetics risk prediction. While successful for the Caucasian population, the PRS based on the minority population suffer from small sample sizes, high dimensionality and low signal-to-noise ratios, exacerbating already severe health disparities. Due to population heterogeneity, direct trans-ethnic prediction by utilizing the Caucasian model for the minority population also has limited performance. In addition, due to data privacy, the individual genotype data is not accessible for either the Caucasian population or the minority population. To address these challenges, we propose a Bregman divergence-based estimation procedure to measure and optimally balance the information from different populations. The proposed method only requires the use of encrypted summary statistics and improves the PRS performance for ethnic minority groups by incorporating additional information. We provide the asymptotic consistency and weak oracle property for the proposed method. Simulations and real data analyses also show its advantages in prediction and variable selection. △ Less

Submitted 12 October, 2022; originally announced October 2022.

Comments: 35 pages, 6 figures

arXiv:2209.13779 [pdf]

doi 10.3847/1538-4365/ac9b17

Solar Flare Index Prediction Using SDO/HMI Vector Magnetic Data Products with Statistical and Machine Learning Methods

Authors: Hewei Zhang, Qin Li, Yanxing Yang, Ju Jing, Jason T. L. Wang, Haimin Wang, Zuofeng Shang

Abstract: Solar flares, especially the M- and X-class flares, are often associated with coronal mass ejections (CMEs). They are the most important sources of space weather effects, that can severely impact the near-Earth environment. Thus it is essential to forecast flares (especially the M-and X-class ones) to mitigate their destructive and hazardous consequences. Here, we introduce several statistical and… ▽ More Solar flares, especially the M- and X-class flares, are often associated with coronal mass ejections (CMEs). They are the most important sources of space weather effects, that can severely impact the near-Earth environment. Thus it is essential to forecast flares (especially the M-and X-class ones) to mitigate their destructive and hazardous consequences. Here, we introduce several statistical and Machine Learning approaches to the prediction of the AR's Flare Index (FI) that quantifies the flare productivity of an AR by taking into account the numbers of different class flares within a certain time interval. Specifically, our sample includes 563 ARs appeared on solar disk from May 2010 to Dec 2017. The 25 magnetic parameters, provided by the Space-weather HMI Active Region Patches (SHARP) from Helioseismic and Magnetic Imager (HMI) on board the Solar Dynamics Observatory (SDO), characterize coronal magnetic energy stored in ARs by proxy and are used as the predictors. We investigate the relationship between these SHARP parameters and the FI of ARs with a machine-learning algorithm (spline regression) and the resampling method (Synthetic Minority Over-Sampling Technique for Regression with Gaussian Noise, short by SMOGN). Based on the established relationship, we are able to predict the value of FIs for a given AR within the next 1-day period. Compared with other 4 popular machine learning algorithms, our methods improve the accuracy of FI prediction, especially for large FI. In addition, we sort the importance of SHARP parameters by Borda Count method calculated from the ranks that are rendered by 9 different machine learning methods. △ Less

Submitted 1 December, 2022; v1 submitted 27 September, 2022; originally announced September 2022.

Journal ref: The Astrophysical Journal Supplement Series (2022), Volume 263, Number 2

arXiv:2209.12388 [pdf, other]

Joint and Individual Component Regression

Authors: Peiyao Wang, Haodong Wang, Quefeng Li, Dinggang Shen, Yufeng Liu

Abstract: Multi-group data are commonly seen in practice. Such data structure consists of data from multiple groups and can be challenging to analyze due to data heterogeneity. We propose a novel Joint and Individual Component Regression (JICO) model to analyze multi-group data. In particular, our proposed model decomposes the response into shared and group-specific components, which are driven by low-rank… ▽ More Multi-group data are commonly seen in practice. Such data structure consists of data from multiple groups and can be challenging to analyze due to data heterogeneity. We propose a novel Joint and Individual Component Regression (JICO) model to analyze multi-group data. In particular, our proposed model decomposes the response into shared and group-specific components, which are driven by low-rank approximations of joint and individual structures from the predictors respectively. The joint structure has the same regression coefficients across multiple groups, whereas individual structures have group-specific regression coefficients. Moreover, the choice of global and individual ranks allows our model to cover global and group-specific models as special cases. For model estimation, we formulate this framework under the representation of latent components and propose an iterative algorithm to solve for the joint and individual scores under the new representation. To construct the latent scores, we utilize the Continuum Regression (CR), which provides a unified framework that covers the Ordinary Least Squares (OLS), the Partial Least Squares (PLS), and the Principal Component Regression (PCR) as its special cases. We show that JICO attains a good balance between global and group-specific models and remains flexible due to the usage of CR. Finally, we conduct simulation studies and analysis of an Alzheimer's disease dataset to further demonstrate the effectiveness of JICO. R implementation of JICO is available online at https://github.com/peiyaow/JICO. △ Less

Submitted 25 September, 2022; originally announced September 2022.

arXiv:2208.02246 [pdf, other]

AdaCat: Adaptive Categorical Discretization for Autoregressive Models

Authors: Qiyang Li, Ajay Jain, Pieter Abbeel

Abstract: Autoregressive generative models can estimate complex continuous data distributions, like trajectory rollouts in an RL environment, image intensities, and audio. Most state-of-the-art models discretize continuous data into several bins and use categorical distributions over the bins to approximate the continuous data distribution. The advantage is that the categorical distribution can easily expre… ▽ More Autoregressive generative models can estimate complex continuous data distributions, like trajectory rollouts in an RL environment, image intensities, and audio. Most state-of-the-art models discretize continuous data into several bins and use categorical distributions over the bins to approximate the continuous data distribution. The advantage is that the categorical distribution can easily express multiple modes and are straightforward to optimize. However, such approximation cannot express sharp changes in density without using significantly more bins, making it parameter inefficient. We propose an efficient, expressive, multimodal parameterization called Adaptive Categorical Discretization (AdaCat). AdaCat discretizes each dimension of an autoregressive model adaptively, which allows the model to allocate density to fine intervals of interest, improving parameter efficiency. AdaCat generalizes both categoricals and quantile-based regression. AdaCat is a simple add-on to any discretization-based distribution estimator. In experiments, AdaCat improves density estimation for real-world tabular data, images, audio, and trajectories, and improves planning in model-based offline RL. △ Less

Submitted 3 August, 2022; originally announced August 2022.

Comments: Uncertainty in Artificial Intelligence (UAI) 2022 13 pages, 4 figures

arXiv:2208.01237 [pdf, ps, other]

Doubly Robust Proximal Causal Inference under Confounded Outcome-Dependent Sampling

Authors: Kendrick Qijun Li, Xu Shi, Wang Miao, Eric Tchetgen Tchetgen

Abstract: Unmeasured confounding and selection bias are often of concern in observational studies and may invalidate a causal analysis if not appropriately accounted for. Under outcome-dependent sampling, a latent factor that has causal effects on the treatment, outcome, and sample selection process may cause both unmeasured confounding and selection bias, rendering standard causal parameters unidentifiable… ▽ More Unmeasured confounding and selection bias are often of concern in observational studies and may invalidate a causal analysis if not appropriately accounted for. Under outcome-dependent sampling, a latent factor that has causal effects on the treatment, outcome, and sample selection process may cause both unmeasured confounding and selection bias, rendering standard causal parameters unidentifiable without additional assumptions. Under an odds ratio model for the treatment effect, Li et al. 2022 established both proximal identification and estimation of causal effects by leveraging a pair of negative control variables as proxies of latent factors at the source of both confounding and selection bias. However, their approach relies exclusively on the existence and correct specification of a so-called treatment confounding bridge function, a model that restricts the treatment assignment mechanism. In this article, we propose doubly robust estimation under the odds ratio model with respect to two nuisance functions -- a treatment confounding bridge function and an outcome confounding bridge function that restricts the outcome law, such that our estimator is consistent and asymptotically normal if either bridge function model is correctly specified, without knowing which one is. Thus, our proposed doubly robust estimator is potentially more robust than that of Li et al. 2022. Our simulations confirm that the proposed proximal estimators of an odds ratio causal effect can adequately account for both residual confounding and selection bias under stated conditions with well-calibrated confidence intervals in a wide range of scenarios, where standard methods generally fail to be consistent. In addition, the proposed doubly robust estimator is consistent if at least one confounding bridge function is correctly specified. △ Less

Submitted 2 August, 2022; originally announced August 2022.

Comments: 43 pages, 1 figure

Showing 1–50 of 199 results for author: LI, Q