-
Multivariate regression with missing response data for modelling regional DNA methylation QTLs
Authors:
Shomoita Alam,
Yixiao Zeng,
Sasha Bernatsky,
Marie Hudson,
Inés Colmegna,
David A. Stephens,
Celia M. T. Greenwood,
Archer Y. Yang
Abstract:
Identifying genetic regulators of DNA methylation (mQTLs) with multivariate models enhances statistical power, but is challenged by missing data from bisulfite sequencing. Standard imputation-based methods can introduce bias, limiting reliable inference. We propose \texttt{missoNet}, a novel convex estimation framework that jointly estimates regression coefficients and the precision matrix from da…
▽ More
Identifying genetic regulators of DNA methylation (mQTLs) with multivariate models enhances statistical power, but is challenged by missing data from bisulfite sequencing. Standard imputation-based methods can introduce bias, limiting reliable inference. We propose \texttt{missoNet}, a novel convex estimation framework that jointly estimates regression coefficients and the precision matrix from data with missing responses. By using unbiased surrogate estimators, our three-stage procedure avoids imputation while simultaneously performing variable selection and learning the conditional dependence structure among responses. We establish theoretical error bounds, and our simulations demonstrate that \texttt{missoNet} consistently outperforms existing methods in both prediction and sparsity recovery. In a real-world mQTL analysis of the CARTaGENE cohort, \texttt{missoNet} achieved superior predictive accuracy and false-discovery control on a held-out validation set, identifying known and credible novel genetic associations. The method offers a robust, efficient, and theoretically grounded tool for genomic analyses, and is available as an R package.
△ Less
Submitted 8 July, 2025;
originally announced July 2025.
-
Testing for large-dimensional covariance matrix under differential privacy
Authors:
Shiwei Sang,
Yicheng Zeng,
Xuehu Zhu,
Shurong Zheng
Abstract:
The increasing prevalence of high-dimensional data across various applications has raised significant privacy concerns in statistical inference. In this paper, we propose a differentially private integrated statistic for testing large-dimensional covariance structures, enabling accurate statistical insights while safeguarding privacy. First, we analyze the global sensitivity of sample eigenvalues…
▽ More
The increasing prevalence of high-dimensional data across various applications has raised significant privacy concerns in statistical inference. In this paper, we propose a differentially private integrated statistic for testing large-dimensional covariance structures, enabling accurate statistical insights while safeguarding privacy. First, we analyze the global sensitivity of sample eigenvalues for sub-Gaussian populations, where our method bypasses the commonly assumed boundedness of data covariates. For sufficiently large sample size, the privatized statistic guarantees privacy with high probability. Furthermore, when the ratio of dimension to sample size, $d/n \to y \in (0, \infty)$, the privatized test is asymptotically distribution-free with well-known critical values, and detects the local alternative hypotheses distinct from the null at the fastest rate of $1/\sqrt{n}$. Extensive numerical studies on synthetic and real data showcase the validity and powerfulness of our proposed method.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
Characteristic function-based tests for spatial randomness
Authors:
Yiran Zeng,
Dale L. Zimmerman
Abstract:
We introduce a new type of test for complete spatial randomness that applies to mapped point patterns in a rectangle or a cube of any dimension. This is the first test of its kind to be based on characteristic functions and utilizes a weighted L2-distance between the empirical and uniform characteristic functions. It is simple to calculate and does not require adjusting for edge effects. An effici…
▽ More
We introduce a new type of test for complete spatial randomness that applies to mapped point patterns in a rectangle or a cube of any dimension. This is the first test of its kind to be based on characteristic functions and utilizes a weighted L2-distance between the empirical and uniform characteristic functions. It is simple to calculate and does not require adjusting for edge effects. An efficient algorithm is developed to find the asymptotic null distribution of the test statistic under the Cauchy weight function. In a simulation, our test shows varying sensitivity to different levels of spatial interaction depending on the scale parameter of the Cauchy weight function. Tests with different parameter values can be combined to create a Bonferroni-corrected omnibus test, which is almost always more powerful than the popular L-test and the Clark-Evans test for detecting heterogeneous and aggregated alternatives, although less powerful than the L-test for detecting regular alternatives. The simplicity of empirical characteristic function makes it straightforward to extend our test to non-rectangular or sparsely sampled point patterns.
△ Less
Submitted 10 April, 2025;
originally announced April 2025.
-
Computational Efficient and Minimax Optimal Nonignorable Matrix Completion
Authors:
Yuanhong A,
Guoyu Zhang,
Yongcheng Zeng,
Bo Zhang
Abstract:
While the matrix completion problem has attracted considerable attention over the decades, few works address the nonignorable missing issue and all have their limitations. In this article, we propose a nuclear norm regularized row- and column-wise matrix U-statistic loss function for the generalized nonignorable missing mechanism, a flexible and generally applicable missing mechanism which contain…
▽ More
While the matrix completion problem has attracted considerable attention over the decades, few works address the nonignorable missing issue and all have their limitations. In this article, we propose a nuclear norm regularized row- and column-wise matrix U-statistic loss function for the generalized nonignorable missing mechanism, a flexible and generally applicable missing mechanism which contains both ignorable and nonignorable missing mechanism assumptions. The proposed method achieves computational efficiency comparable to the existing missing-at-random approaches, while providing the near minimax optimal statistical convergence rate guarantees for the more general nonignorable missing case. We propose an accelerated proximal gradient algorithm to solve the associated optimization problem, and characterize the interaction between algorithmic and statistical convergence. Simulations and real data analyzes further support the practical utility of the proposed method.
△ Less
Submitted 26 June, 2025; v1 submitted 4 April, 2025;
originally announced April 2025.
-
Learning Counterfactual Outcomes Under Rank Preservation
Authors:
Peng Wu,
Haoxuan Li,
Chunyuan Zheng,
Yan Zeng,
Jiawei Chen,
Yang Liu,
Ruocheng Guo,
Kun Zhang
Abstract:
Counterfactual inference aims to estimate the counterfactual outcome at the individual level given knowledge of an observed treatment and the factual outcome, with broad applications in fields such as epidemiology, econometrics, and management science. Previous methods rely on a known structural causal model (SCM) or assume the homogeneity of the exogenous variable and strict monotonicity between…
▽ More
Counterfactual inference aims to estimate the counterfactual outcome at the individual level given knowledge of an observed treatment and the factual outcome, with broad applications in fields such as epidemiology, econometrics, and management science. Previous methods rely on a known structural causal model (SCM) or assume the homogeneity of the exogenous variable and strict monotonicity between the outcome and exogenous variable. In this paper, we propose a principled approach for identifying and estimating the counterfactual outcome. We first introduce a simple and intuitive rank preservation assumption to identify the counterfactual outcome without relying on a known structural causal model. Building on this, we propose a novel ideal loss for theoretically unbiased learning of the counterfactual outcome and further develop a kernel-based estimator for its empirical estimation. Our theoretical analysis shows that the rank preservation assumption is not stronger than the homogeneity and strict monotonicity assumptions, and shows that the proposed ideal loss is convex, and the proposed estimator is unbiased. Extensive semi-synthetic and real-world experiments are conducted to demonstrate the effectiveness of the proposed method.
△ Less
Submitted 10 February, 2025;
originally announced February 2025.
-
Moving toward best practice when using propensity score weighting in survey observational studies
Authors:
Yukang Zeng,
Fan Li,
Guangyu Tong
Abstract:
Propensity score weighting is a common method for estimating treatment effects with survey data. The method is applied to minimize confounding using measured covariates that are often different between individuals in treatment and control. However, existing literature does not reach a consensus on the optimal use of survey weights for population-level inference in the propensity score weighting an…
▽ More
Propensity score weighting is a common method for estimating treatment effects with survey data. The method is applied to minimize confounding using measured covariates that are often different between individuals in treatment and control. However, existing literature does not reach a consensus on the optimal use of survey weights for population-level inference in the propensity score weighting analysis. Under the balancing weights framework, we provided a unified solution for incorporating survey weights in both the propensity score of estimation and the outcome regression model. We derived estimators for different target populations, including the combined, treated, controlled, and overlap populations. We provide a unified expression of the sandwich variance estimator and demonstrate that the survey-weighted estimator is asymptotically normal, as established through the theory of M-estimators. Through an extensive series of simulation studies, we examined the performance of our derived estimators and compared the results to those of alternative methods. We further carried out two case studies to illustrate the application of the different methods of propensity score analysis with complex survey data. We concluded with a discussion of our findings and provided practical guidelines for propensity score weighting analysis of observational data from complex surveys.
△ Less
Submitted 27 January, 2025;
originally announced January 2025.
-
Exact Bounds of Spearman's footrule in the Presence of Missing Data with Applications to Independence Testing
Authors:
Yijin Zeng,
Niall M. Adams,
Dean A. Bodenham
Abstract:
This work studies exact bounds of Spearman's footrule between two partially observed $n$-dimensional distinct real-valued vectors $X$ and $Y$. The lower bound is obtained by sequentially constructing imputations of the partially observed vectors, each with a non-increasing value of Spearman's footrule. The upper bound is found by first considering the set of all possible values of Spearman's footr…
▽ More
This work studies exact bounds of Spearman's footrule between two partially observed $n$-dimensional distinct real-valued vectors $X$ and $Y$. The lower bound is obtained by sequentially constructing imputations of the partially observed vectors, each with a non-increasing value of Spearman's footrule. The upper bound is found by first considering the set of all possible values of Spearman's footrule for imputations of $X$ and $Y$, and then the size of this set is gradually reduced using several constraints. Algorithms with computational complexities $O(n^2)$ and $O(n^3)$ are provided for computing the lower and upper bound of Spearman's footrule for $X$ and $Y$, respectively. As an application of the bounds, we propose a novel two-sample independence testing method for data with missing values. Improving on all existing approaches, our method controls the Type I error under arbitrary missingness. Simulation results demonstrate our method has good power, typically when the proportion of pairs containing missing data is below $15\%$.
△ Less
Submitted 20 January, 2025;
originally announced January 2025.
-
Reliable Imputed-Sample Assisted Vertical Federated Learning
Authors:
Yaopei Zeng,
Lei Liu,
Shaoguo Liu,
Hongjian Dou,
Baoyuan Wu,
Li Liu
Abstract:
Vertical Federated Learning (VFL) is a well-known FL variant that enables multiple parties to collaboratively train a model without sharing their raw data. Existing VFL approaches focus on overlapping samples among different parties, while their performance is constrained by the limited number of these samples, leaving numerous non-overlapping samples unexplored. Some previous work has explored te…
▽ More
Vertical Federated Learning (VFL) is a well-known FL variant that enables multiple parties to collaboratively train a model without sharing their raw data. Existing VFL approaches focus on overlapping samples among different parties, while their performance is constrained by the limited number of these samples, leaving numerous non-overlapping samples unexplored. Some previous work has explored techniques for imputing missing values in samples, but often without adequate attention to the quality of the imputed samples. To address this issue, we propose a Reliable Imputed-Sample Assisted (RISA) VFL framework to effectively exploit non-overlapping samples by selecting reliable imputed samples for training VFL models. Specifically, after imputing non-overlapping samples, we introduce evidence theory to estimate the uncertainty of imputed samples, and only samples with low uncertainty are selected. In this way, high-quality non-overlapping samples are utilized to improve VFL model. Experiments on two widely used datasets demonstrate the significant performance gains achieved by the RISA, especially with the limited overlapping samples, e.g., a 48% accuracy gain on CIFAR-10 with only 1% overlapping samples.
△ Less
Submitted 10 January, 2025;
originally announced January 2025.
-
Local Learning for Covariate Selection in Nonparametric Causal Effect Estimation with Latent Variables
Authors:
Zheng Li,
Feng Xie,
Xichen Guo,
Yan Zeng,
Hao Zhang,
Zhi Geng
Abstract:
Estimating causal effects from nonexperimental data is a fundamental problem in many fields of science. A key component of this task is selecting an appropriate set of covariates for confounding adjustment to avoid bias. Most existing methods for covariate selection often assume the absence of latent variables and rely on learning the global network structure among variables. However, identifying…
▽ More
Estimating causal effects from nonexperimental data is a fundamental problem in many fields of science. A key component of this task is selecting an appropriate set of covariates for confounding adjustment to avoid bias. Most existing methods for covariate selection often assume the absence of latent variables and rely on learning the global network structure among variables. However, identifying the global structure can be unnecessary and inefficient, especially when our primary interest lies in estimating the effect of a treatment variable on an outcome variable. To address this limitation, we propose a novel local learning approach for covariate selection in nonparametric causal effect estimation, which accounts for the presence of latent variables. Our approach leverages testable independence and dependence relationships among observed variables to identify a valid adjustment set for a target causal relationship, ensuring both soundness and completeness under standard assumptions. We validate the effectiveness of our algorithm through extensive experiments on both synthetic and real-world data.
△ Less
Submitted 19 May, 2025; v1 submitted 25 November, 2024;
originally announced November 2024.
-
Testability of Instrumental Variables in Additive Nonlinear, Non-Constant Effects Models
Authors:
Xichen Guo,
Zheng Li,
Biwei Huang,
Yan Zeng,
Zhi Geng,
Feng Xie
Abstract:
We address the issue of the testability of instrumental variables derived from observational data. Most existing testable implications are centered on scenarios where the treatment is a discrete variable, e.g., instrumental inequality (Pearl, 1995), or where the effect is assumed to be constant, e.g., instrumental variables condition based on the principle of independent mechanisms (Burauel, 2023)…
▽ More
We address the issue of the testability of instrumental variables derived from observational data. Most existing testable implications are centered on scenarios where the treatment is a discrete variable, e.g., instrumental inequality (Pearl, 1995), or where the effect is assumed to be constant, e.g., instrumental variables condition based on the principle of independent mechanisms (Burauel, 2023). However, treatments can often be continuous variables, such as drug dosages or nutritional content levels, and non-constant effects may occur in many real-world scenarios. In this paper, we consider an additive nonlinear, non-constant effects model with unmeasured confounders, in which treatments can be either discrete or continuous, and propose an Auxiliary-based Independence Test (AIT) condition to test whether a variable is a valid instrument. We first show that if the candidate instrument is valid, then the AIT condition holds. Moreover, we illustrate the implications of the AIT condition and demonstrate that, in certain conditions, AIT conditions are necessary and sufficient to detect all invalid IVs. We also extend the AIT condition to include covariates and introduce a practical testing algorithm. Experimental results on both synthetic and three different real-world datasets show the effectiveness of our proposed condition.
△ Less
Submitted 18 November, 2024;
originally announced November 2024.
-
LLM Embeddings Improve Test-time Adaptation to Tabular $Y|X$-Shifts
Authors:
Yibo Zeng,
Jiashuo Liu,
Henry Lam,
Hongseok Namkoong
Abstract:
For tabular datasets, the change in the relationship between the label and covariates ($Y|X$-shifts) is common due to missing variables (a.k.a. confounders). Since it is impossible to generalize to a completely new and unknown domain, we study models that are easy to adapt to the target domain even with few labeled examples. We focus on building more informative representations of tabular data tha…
▽ More
For tabular datasets, the change in the relationship between the label and covariates ($Y|X$-shifts) is common due to missing variables (a.k.a. confounders). Since it is impossible to generalize to a completely new and unknown domain, we study models that are easy to adapt to the target domain even with few labeled examples. We focus on building more informative representations of tabular data that can mitigate $Y|X$-shifts, and propose to leverage the prior world knowledge in LLMs by serializing (write down) the tabular data to encode it. We find LLM embeddings alone provide inconsistent improvements in robustness, but models trained on them can be well adapted/finetuned to the target domain even using 32 labeled observations. Our finding is based on a comprehensive and systematic study consisting of 7650 source-target pairs and benchmark against 261,000 model configurations trained by 22 algorithms. Our observation holds when ablating the size of accessible target data and different adaptation strategies. The code is available at https://github.com/namkoong-lab/LLM-Tabular-Shifts.
△ Less
Submitted 9 October, 2024;
originally announced October 2024.
-
Deep Uncertainty-Based Explore for Index Construction and Retrieval in Recommendation System
Authors:
Xin Jiang,
Kaiqiang Wang,
Yinlong Wang,
Fengchang Lv,
Taiyang Peng,
Shuai Yang,
Xianteng Wu,
Pengye Zhang,
Shuo Yuan,
Yifan Zeng
Abstract:
In recommendation systems, the relevance and novelty of the final results are selected through a cascade system of Matching -> Ranking -> Strategy. The matching model serves as the starting point of the pipeline and determines the upper bound of the subsequent stages. Balancing the relevance and novelty of matching results is a crucial step in the design and optimization of recommendation systems,…
▽ More
In recommendation systems, the relevance and novelty of the final results are selected through a cascade system of Matching -> Ranking -> Strategy. The matching model serves as the starting point of the pipeline and determines the upper bound of the subsequent stages. Balancing the relevance and novelty of matching results is a crucial step in the design and optimization of recommendation systems, contributing significantly to improving recommendation quality. However, the typical matching algorithms have not simultaneously addressed the relevance and novelty perfectly. One main reason is that deep matching algorithms exhibit significant uncertainty when estimating items in the long tail (e.g., due to insufficient training samples) items.The uncertainty not only affects the training of the models but also influences the confidence in the index construction and beam search retrieval process of these models. This paper proposes the UICR (Uncertainty-based explore for Index Construction and Retrieval) algorithm, which introduces the concept of uncertainty modeling in the matching stage and achieves multi-task modeling of model uncertainty and index uncertainty. The final matching results are obtained by combining the relevance score and uncertainty score infered by the model. Experimental results demonstrate that the UICR improves novelty without sacrificing relevance on realworld industrial productive environments and multiple open-source datasets. Remarkably, online A/B test results of display advertising in Shopee demonstrates the effectiveness of the proposed algorithm.
△ Less
Submitted 5 August, 2024; v1 submitted 21 July, 2024;
originally announced August 2024.
-
Identification and Estimation of the Bi-Directional MR with Some Invalid Instruments
Authors:
Feng Xie,
Zhen Yao,
Lin Xie,
Yan Zeng,
Zhi Geng
Abstract:
We consider the challenging problem of estimating causal effects from purely observational data in the bi-directional Mendelian randomization (MR), where some invalid instruments, as well as unmeasured confounding, usually exist. To address this problem, most existing methods attempt to find proper valid instrumental variables (IVs) for the target causal effect by expert knowledge or by assuming t…
▽ More
We consider the challenging problem of estimating causal effects from purely observational data in the bi-directional Mendelian randomization (MR), where some invalid instruments, as well as unmeasured confounding, usually exist. To address this problem, most existing methods attempt to find proper valid instrumental variables (IVs) for the target causal effect by expert knowledge or by assuming that the causal model is a one-directional MR model. As such, in this paper, we first theoretically investigate the identification of the bi-directional MR from observational data. In particular, we provide necessary and sufficient conditions under which valid IV sets are correctly identified such that the bi-directional MR model is identifiable, including the causal directions of a pair of phenotypes (i.e., the treatment and outcome). Moreover, based on the identification theory, we develop a cluster fusion-like method to discover valid IV sets and estimate the causal effects of interest. We theoretically demonstrate the correctness of the proposed algorithm. Experimental results show the effectiveness of our method for estimating causal effects in bi-directional MR.
△ Less
Submitted 12 July, 2024; v1 submitted 10 July, 2024;
originally announced July 2024.
-
MMD Two-sample Testing in the Presence of Arbitrarily Missing Data
Authors:
Yijin Zeng,
Niall M. Adams,
Dean A. Bodenham
Abstract:
In many real-world applications, it is common that a proportion of the data may be missing or only partially observed. We develop a novel two-sample testing method based on the Maximum Mean Discrepancy (MMD) which accounts for missing data in both samples, without making assumptions about the missingness mechanism. Our approach is based on deriving the mathematically precise bounds of the MMD test…
▽ More
In many real-world applications, it is common that a proportion of the data may be missing or only partially observed. We develop a novel two-sample testing method based on the Maximum Mean Discrepancy (MMD) which accounts for missing data in both samples, without making assumptions about the missingness mechanism. Our approach is based on deriving the mathematically precise bounds of the MMD test statistic after accounting for all possible missing values. To the best of our knowledge, it is the only two-sample testing method that is guaranteed to control the Type I error for both univariate and multivariate data where data may be arbitrarily missing. Simulation results show that our method has good statistical power, typically for cases where 5% to 10% of the data are missing. We highlight the value of our approach when the data are missing not at random, a context in which either ignoring the missing values or using common imputation methods may not control the Type I error.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
Policy Learning for Balancing Short-Term and Long-Term Rewards
Authors:
Peng Wu,
Ziyu Shen,
Feng Xie,
Zhongyao Wang,
Chunchen Liu,
Yan Zeng
Abstract:
Empirical researchers and decision-makers spanning various domains frequently seek profound insights into the long-term impacts of interventions. While the significance of long-term outcomes is undeniable, an overemphasis on them may inadvertently overshadow short-term gains. Motivated by this, this paper formalizes a new framework for learning the optimal policy that effectively balances both lon…
▽ More
Empirical researchers and decision-makers spanning various domains frequently seek profound insights into the long-term impacts of interventions. While the significance of long-term outcomes is undeniable, an overemphasis on them may inadvertently overshadow short-term gains. Motivated by this, this paper formalizes a new framework for learning the optimal policy that effectively balances both long-term and short-term rewards, where some long-term outcomes are allowed to be missing. In particular, we first present the identifiability of both rewards under mild assumptions. Next, we deduce the semiparametric efficiency bounds, along with the consistency and asymptotic normality of their estimators. We also reveal that short-term outcomes, if associated, contribute to improving the estimator of the long-term reward. Based on the proposed estimators, we develop a principled policy learning approach and further derive the convergence rates of regret and estimation errors associated with the learned policy. Extensive experiments are conducted to validate the effectiveness of the proposed method, demonstrating its practical applicability.
△ Less
Submitted 15 September, 2024; v1 submitted 6 May, 2024;
originally announced May 2024.
-
On two-sample testing for data with arbitrarily missing values
Authors:
Yijin Zeng,
Niall M. Adams,
Dean A. Bodenham
Abstract:
We develop a new rank-based approach for univariate two-sample testing in the presence of missing data which makes no assumptions about the missingness mechanism. This approach is a theoretical extension of the Wilcoxon-Mann-Whitney test that controls the Type I error by providing exact bounds for the test statistic after accounting for the number of missing values. Greater statistical power is sh…
▽ More
We develop a new rank-based approach for univariate two-sample testing in the presence of missing data which makes no assumptions about the missingness mechanism. This approach is a theoretical extension of the Wilcoxon-Mann-Whitney test that controls the Type I error by providing exact bounds for the test statistic after accounting for the number of missing values. Greater statistical power is shown when the method is extended to account for a bounded domain. Furthermore, exact bounds are provided on the proportions of data that can be missing in the two samples while yielding a significant result. Simulations demonstrate that our method has good power, typically for cases of $10\%$ to $20\%$ missing data, while standard imputation approaches fail to control the Type I error. We illustrate our method on complex clinical trial data in which patients' withdrawal from the trial lead to missing values.
△ Less
Submitted 22 March, 2024;
originally announced March 2024.
-
Where to serve and return in Badminton Men's Double?
Authors:
Xuelin Zhu,
Yu Sun,
Yumin Zeng,
Cong Xu
Abstract:
This study aims to analyze the service and return landing areas in badminton men's double, based on data extracted from 20 badminton matches. We find that most services land near the center-line, while returns tend to land in the crossing areas of the serving team's court. Using generalized logit models, we are able to predict the return landing area based on features of the service and return rou…
▽ More
This study aims to analyze the service and return landing areas in badminton men's double, based on data extracted from 20 badminton matches. We find that most services land near the center-line, while returns tend to land in the crossing areas of the serving team's court. Using generalized logit models, we are able to predict the return landing area based on features of the service and return round. We find that the direction of the service and the footwork and grip of the receiver could indicate his intended return landing area. Additionally, we discover that servers tend to intercept in specific areas based on their serving position. Our results offer valuable insights into the strategic decisions made by players in the service and return of a badminton rally.
△ Less
Submitted 27 October, 2023;
originally announced October 2023.
-
The Expressive Power of Low-Rank Adaptation
Authors:
Yuchen Zeng,
Kangwook Lee
Abstract:
Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method that leverages low-rank adaptation of weight matrices, has emerged as a prevalent technique for fine-tuning pre-trained models such as large language models and diffusion models. Despite its huge success in practice, the theoretical underpinnings of LoRA have largely remained unexplored. This paper takes the first step to bridge…
▽ More
Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method that leverages low-rank adaptation of weight matrices, has emerged as a prevalent technique for fine-tuning pre-trained models such as large language models and diffusion models. Despite its huge success in practice, the theoretical underpinnings of LoRA have largely remained unexplored. This paper takes the first step to bridge this gap by theoretically analyzing the expressive power of LoRA. We prove that, for fully connected neural networks, LoRA can adapt any model $f$ to accurately represent any smaller target model $\overline{f}$ if LoRA-rank $\geq(\text{width of }f) \times \frac{\text{depth of }\overline{f}}{\text{depth of }f}$. We also quantify the approximation error when LoRA-rank is lower than the threshold. For Transformer networks, we show any model can be adapted to a target model of the same size with rank-$(\frac{\text{embedding size}}{2})$ LoRA adapters.
△ Less
Submitted 17 March, 2024; v1 submitted 26 October, 2023;
originally announced October 2023.
-
Ensemble Active Learning by Contextual Bandits for AI Incubation in Manufacturing
Authors:
Yingyan Zeng,
Xiaoyu Chen,
Ran Jin
Abstract:
It is challenging but important to save annotation efforts in streaming data acquisition to maintain data quality for supervised learning base learners. We propose an ensemble active learning method to actively acquire samples for annotation by contextual bandits, which is will enforce the exploration-exploitation balance and leading to improved AI modeling performance.
It is challenging but important to save annotation efforts in streaming data acquisition to maintain data quality for supervised learning base learners. We propose an ensemble active learning method to actively acquire samples for annotation by contextual bandits, which is will enforce the exploration-exploitation balance and leading to improved AI modeling performance.
△ Less
Submitted 10 October, 2023; v1 submitted 10 October, 2023;
originally announced October 2023.
-
Identifying vital nodes through augmented random walks on higher-order networks
Authors:
Yujie Zeng,
Yiming Huang,
Xiao-Long Ren,
Linyuan Lü
Abstract:
Empirical networks possess considerable heterogeneity of node connections, resulting in a small portion of nodes playing crucial roles in network structure and function. Yet, how to characterize nodes' influence and identify vital nodes is by far still unclear in the study of networks with higher-order interactions. In this paper, we introduce a multi-order graph obtained by incorporating the high…
▽ More
Empirical networks possess considerable heterogeneity of node connections, resulting in a small portion of nodes playing crucial roles in network structure and function. Yet, how to characterize nodes' influence and identify vital nodes is by far still unclear in the study of networks with higher-order interactions. In this paper, we introduce a multi-order graph obtained by incorporating the higher-order bipartite graph and the classical pairwise graph, and propose a Higher-order Augmented Random Walk (HoRW) model through random walking on it. This representation preserves as much information about the higher-interacting network as possible. The results indicate that the proposed method effectively addresses the localization problem of certain classical centralities. In contrast to random walks along pairwise interactions only, performing more walks along higher-order interactions assists in not only identifying the most important nodes but also distinguishing nodes that ranked in the middle and bottom. Our method outperforms classical centralities in identifying vital nodes and can scale to various tasks in networks, including information spread maximization and network dismantling problems. The proposed higher-order representation and the random walk model provide novel insights and potent tools for studying higher-order mechanisms and functionality.
△ Less
Submitted 3 December, 2023; v1 submitted 11 May, 2023;
originally announced May 2023.
-
LAVA: Data Valuation without Pre-Specified Learning Algorithms
Authors:
Hoang Anh Just,
Feiyang Kang,
Jiachen T. Wang,
Yi Zeng,
Myeongseob Ko,
Ming Jin,
Ruoxi Jia
Abstract:
Traditionally, data valuation (DV) is posed as a problem of equitably splitting the validation performance of a learning algorithm among the training data. As a result, the calculated data values depend on many design choices of the underlying learning algorithm. However, this dependence is undesirable for many DV use cases, such as setting priorities over different data sources in a data acquisit…
▽ More
Traditionally, data valuation (DV) is posed as a problem of equitably splitting the validation performance of a learning algorithm among the training data. As a result, the calculated data values depend on many design choices of the underlying learning algorithm. However, this dependence is undesirable for many DV use cases, such as setting priorities over different data sources in a data acquisition process and informing pricing mechanisms in a data marketplace. In these scenarios, data needs to be valued before the actual analysis and the choice of the learning algorithm is still undetermined then. Another side-effect of the dependence is that to assess the value of individual points, one needs to re-run the learning algorithm with and without a point, which incurs a large computation burden. This work leapfrogs over the current limits of data valuation methods by introducing a new framework that can value training data in a way that is oblivious to the downstream learning algorithm. Our main results are as follows. (1) We develop a proxy for the validation performance associated with a training set based on a non-conventional class-wise Wasserstein distance between training and validation sets. We show that the distance characterizes the upper bound of the validation performance for any given model under certain Lipschitz conditions. (2) We develop a novel method to value individual data based on the sensitivity analysis of the class-wise Wasserstein distance. Importantly, these values can be directly obtained for free from the output of off-the-shelf optimization solvers when computing the distance. (3) We evaluate our new data valuation framework over various use cases related to detecting low-quality data and show that, surprisingly, the learning-agnostic feature of our framework enables a significant improvement over SOTA performance while being orders of magnitude faster.
△ Less
Submitted 19 December, 2023; v1 submitted 28 April, 2023;
originally announced May 2023.
-
Sketched Ridgeless Linear Regression: The Role of Downsampling
Authors:
Xin Chen,
Yicheng Zeng,
Siyue Yang,
Qiang Sun
Abstract:
Overparametrization often helps improve the generalization performance. This paper presents a dual view of overparametrization suggesting that downsampling may also help generalize. Focusing on the proportional regime $m\asymp n \asymp p$, where $m$ represents the sketching size, $n$ is the sample size, and $p$ is the feature dimensionality, we investigate two out-of-sample prediction risks of the…
▽ More
Overparametrization often helps improve the generalization performance. This paper presents a dual view of overparametrization suggesting that downsampling may also help generalize. Focusing on the proportional regime $m\asymp n \asymp p$, where $m$ represents the sketching size, $n$ is the sample size, and $p$ is the feature dimensionality, we investigate two out-of-sample prediction risks of the sketched ridgeless least square estimator. Our findings challenge conventional beliefs by showing that downsampling does not always harm generalization but can actually improve it in certain cases. We identify the optimal sketching size that minimizes out-of-sample prediction risks and demonstrate that the optimally sketched estimator exhibits stabler risk curves, eliminating the peaks of those for the full-sample estimator. To facilitate practical implementation, we propose an empirical procedure to determine the optimal sketching size. Finally, we extend our analysis to cover central limit theorems and misspecified models. Numerical studies strongly support our theory.
△ Less
Submitted 13 October, 2023; v1 submitted 2 February, 2023;
originally announced February 2023.
-
Cyclical Kernel Adaptive Metropolis
Authors:
Jianan Canal Li,
Yimeng Zeng,
Wentao Guo
Abstract:
We propose cKAM, cyclical Kernel Adaptive Metropolis, which incorporates a cyclical stepsize scheme to allow control for exploration and sampling. We show that on a crafted bimodal distribution, existing Adaptive Metropolis type algorithms would fail to converge to the true posterior distribution. We point out that this is because adaptive samplers estimates the local/global covariance structure u…
▽ More
We propose cKAM, cyclical Kernel Adaptive Metropolis, which incorporates a cyclical stepsize scheme to allow control for exploration and sampling. We show that on a crafted bimodal distribution, existing Adaptive Metropolis type algorithms would fail to converge to the true posterior distribution. We point out that this is because adaptive samplers estimates the local/global covariance structure using past history of the chain, which will lead to adaptive algorithms be trapped in a local mode. We demonstrate that cKAM encourages exploration of the posterior distribution and allows the sampler to escape from a local mode, while maintaining the high performance of adaptive methods.
△ Less
Submitted 29 June, 2022; v1 submitted 29 June, 2022;
originally announced June 2022.
-
A study of tree-based methods and their combination
Authors:
Yinuo Zeng
Abstract:
Tree-based methods are popular machine learning techniques used in various fields. In this work, we review their foundations and a general framework the importance sampled learning ensemble (ISLE) that accelerates their fitting process. Furthermore, we describe a model combination strategy called the adaptive regression by mixing (ARM), which is feasible for tree-based methods via ISLE. Moreover,…
▽ More
Tree-based methods are popular machine learning techniques used in various fields. In this work, we review their foundations and a general framework the importance sampled learning ensemble (ISLE) that accelerates their fitting process. Furthermore, we describe a model combination strategy called the adaptive regression by mixing (ARM), which is feasible for tree-based methods via ISLE. Moreover, three modified ISLEs are proposed, and their performance are evaluated on the real data sets.
△ Less
Submitted 29 April, 2022;
originally announced April 2022.
-
ModelPred: A Framework for Predicting Trained Model from Training Data
Authors:
Yingyan Zeng,
Jiachen T. Wang,
Si Chen,
Hoang Anh Just,
Ran Jin,
Ruoxi Jia
Abstract:
In this work, we propose ModelPred, a framework that helps to understand the impact of changes in training data on a trained model. This is critical for building trust in various stages of a machine learning pipeline: from cleaning poor-quality samples and tracking important ones to be collected during data preparation, to calibrating uncertainty of model prediction, to interpreting why certain be…
▽ More
In this work, we propose ModelPred, a framework that helps to understand the impact of changes in training data on a trained model. This is critical for building trust in various stages of a machine learning pipeline: from cleaning poor-quality samples and tracking important ones to be collected during data preparation, to calibrating uncertainty of model prediction, to interpreting why certain behaviors of a model emerge during deployment. Specifically, ModelPred learns a parameterized function that takes a dataset $S$ as the input and predicts the model obtained by training on $S$. Our work differs from the recent work of Datamodels [1] as we aim for predicting the trained model parameters directly instead of the trained model behaviors. We demonstrate that a neural network-based set function class is capable of learning the complex relationships between the training data and model parameters. We introduce novel global and local regularization techniques to prevent overfitting and we rigorously characterize the expressive power of neural networks (NN) in approximating the end-to-end training process. Through extensive empirical investigations, we show that ModelPred enables a variety of applications that boost the interpretability and accountability of machine learning (ML), such as data valuation, data selection, memorization quantification, and model calibration.
△ Less
Submitted 23 December, 2022; v1 submitted 24 November, 2021;
originally announced November 2021.
-
Few-Shot Domain Adaptation For End-to-End Communication
Authors:
Jayaram Raghuram,
Yijing Zeng,
Dolores García Martí,
Rafael Ruiz Ortiz,
Somesh Jha,
Joerg Widmer,
Suman Banerjee
Abstract:
The problem of end-to-end learning of a communication system using an autoencoder -- consisting of an encoder, channel, and decoder modeled using neural networks -- has recently been shown to be a promising approach. A challenge faced in the practical adoption of this learning approach is that under changing channel conditions (e.g. a wireless link), it requires frequent retraining of the autoenco…
▽ More
The problem of end-to-end learning of a communication system using an autoencoder -- consisting of an encoder, channel, and decoder modeled using neural networks -- has recently been shown to be a promising approach. A challenge faced in the practical adoption of this learning approach is that under changing channel conditions (e.g. a wireless link), it requires frequent retraining of the autoencoder in order to maintain a low decoding error rate. Since retraining is both time consuming and requires a large number of samples, it becomes impractical when the channel distribution is changing quickly. We propose to address this problem using a fast and sample-efficient (few-shot) domain adaptation method that does not change the encoder and decoder networks. Different from conventional training-time unsupervised or semi-supervised domain adaptation, here we have a trained autoencoder from a source distribution, that we want to adapt (at test time) to a target distribution using only a small labeled dataset and no unlabeled data. Our method focuses on a Gaussian mixture density network based channel model, and formulates its adaptation based on class and component-conditional affine transformations. The learned affine transformations are used to design an optimal input transformation at the decoder to compensate for the distribution shift, and effectively present to the decoder inputs close to the source distribution. Experiments on a real mmWave FPGA setup as well as a number of simulated distribution changes common to the wireless setting demonstrate the effectiveness of our method at adaptation using very small number of target domain samples.
△ Less
Submitted 25 July, 2022; v1 submitted 2 August, 2021;
originally announced August 2021.
-
Generalization Bounds with Minimal Dependency on Hypothesis Class via Distributionally Robust Optimization
Authors:
Yibo Zeng,
Henry Lam
Abstract:
Established approaches to obtain generalization bounds in data-driven optimization and machine learning mostly build on solutions from empirical risk minimization (ERM), which depend crucially on the functional complexity of the hypothesis class. In this paper, we present an alternate route to obtain these bounds on the solution from distributionally robust optimization (DRO), a recent data-driven…
▽ More
Established approaches to obtain generalization bounds in data-driven optimization and machine learning mostly build on solutions from empirical risk minimization (ERM), which depend crucially on the functional complexity of the hypothesis class. In this paper, we present an alternate route to obtain these bounds on the solution from distributionally robust optimization (DRO), a recent data-driven optimization framework based on worst-case analysis and the notion of ambiguity set to capture statistical uncertainty. In contrast to the hypothesis class complexity in ERM, our DRO bounds depend on the ambiguity set geometry and its compatibility with the true loss function. Notably, when using statistical distances such as maximum mean discrepancy, Wasserstein distance, or $φ$-divergence in the DRO, our analysis implies generalization bounds whose dependence on the hypothesis class appears the minimal possible: The bound depends solely on the true loss function, independent of any other candidates in the hypothesis class. To our best knowledge, it is the first generalization bound of this type in the literature, and we hope our findings can open the door for a better understanding of DRO, especially its benefits on loss minimization and other machine learning applications.
△ Less
Submitted 12 October, 2022; v1 submitted 21 June, 2021;
originally announced June 2021.
-
Efficient Peer Effects Estimators with Group Effects
Authors:
Guido M. Kuersteiner,
Ingmar R. Prucha,
Ying Zeng
Abstract:
We study linear peer effects models where peers interact in groups, individual's outcomes are linear in the group mean outcome and characteristics, and group effects are random. Our specification is motivated by the moment conditions imposed in Graham 2008. We show that these moment conditions can be cast in terms of a linear random group effects model and lead to a class of GMM estimators that ar…
▽ More
We study linear peer effects models where peers interact in groups, individual's outcomes are linear in the group mean outcome and characteristics, and group effects are random. Our specification is motivated by the moment conditions imposed in Graham 2008. We show that these moment conditions can be cast in terms of a linear random group effects model and lead to a class of GMM estimators that are generally identified as long as there is sufficient variation in group size. We also show that our class of GMM estimators contains a Quasi Maximum Likelihood estimator (QMLE) for the random group effects model, as well as the Wald estimator of Graham 2008 and the within estimator of Lee 2007 as special cases. Our identification results extend insights in Graham 2008 that show how assumptions about random group effects as well as variation in group size can be used to overcome the reflection problem in identifying peer effects. Our QMLE and GMM estimators accommodate additional covariates and are valid in situations with a large but finite number of different group sizes or types. Because our estimators are general moment based procedures, using instruments other than binary group indicators in estimation is straight forward. Our QMLE estimator accommodates group level covariates in the spirit of Mundlak and Chamberlain and offers an alternative to fixed effects specifications. Monte-Carlo simulations show that the bias of the QMLE estimator decreases with the number of groups and the variation in group size, and increases with group size. We also prove the consistency and asymptotic normality of the estimator under reasonable assumptions.
△ Less
Submitted 25 April, 2022; v1 submitted 10 May, 2021;
originally announced May 2021.
-
A Robust graph attention network with dynamic adjusted Graph
Authors:
Xianchen Zhou,
Yaoyun Zeng,
Hongxia Wang
Abstract:
Graph Attention Networks(GATs) are useful deep learning models to deal with the graph data. However, recent works show that the classical GAT is vulnerable to adversarial attacks. It degrades dramatically with slight perturbations. Therefore, how to enhance the robustness of GAT is a critical problem. Robust GAT(RoGAT) is proposed in this paper to improve the robustness of GAT based on the revisio…
▽ More
Graph Attention Networks(GATs) are useful deep learning models to deal with the graph data. However, recent works show that the classical GAT is vulnerable to adversarial attacks. It degrades dramatically with slight perturbations. Therefore, how to enhance the robustness of GAT is a critical problem. Robust GAT(RoGAT) is proposed in this paper to improve the robustness of GAT based on the revision of the attention mechanism. Different from the original GAT, which uses the attention mechanism for different edges but is still sensitive to the perturbation, RoGAT adds an extra dynamic attention score progressively and improves the robustness. Firstly, RoGAT revises the edges weight based on the smoothness assumption which is quite common for ordinary graphs. Secondly, RoGAT further revises the features to suppress features' noise. Then, an extra attention score is generated by the dynamic edge's weight and can be used to reduce the impact of adversarial attacks. Different experiments against targeted and untargeted attacks on citation data on citation data demonstrate that RoGAT outperforms most of the recent defensive methods.
△ Less
Submitted 4 August, 2022; v1 submitted 27 September, 2020;
originally announced September 2020.
-
Improving Query Efficiency of Black-box Adversarial Attack
Authors:
Yang Bai,
Yuyuan Zeng,
Yong Jiang,
Yisen Wang,
Shu-Tao Xia,
Weiwei Guo
Abstract:
Deep neural networks (DNNs) have demonstrated excellent performance on various tasks, however they are under the risk of adversarial examples that can be easily generated when the target model is accessible to an attacker (white-box setting). As plenty of machine learning models have been deployed via online services that only provide query outputs from inaccessible models (e.g. Google Cloud Visio…
▽ More
Deep neural networks (DNNs) have demonstrated excellent performance on various tasks, however they are under the risk of adversarial examples that can be easily generated when the target model is accessible to an attacker (white-box setting). As plenty of machine learning models have been deployed via online services that only provide query outputs from inaccessible models (e.g. Google Cloud Vision API2), black-box adversarial attacks (inaccessible target model) are of critical security concerns in practice rather than white-box ones. However, existing query-based black-box adversarial attacks often require excessive model queries to maintain a high attack success rate. Therefore, in order to improve query efficiency, we explore the distribution of adversarial examples around benign inputs with the help of image structure information characterized by a Neural Process, and propose a Neural Process based black-box adversarial attack (NP-Attack) in this paper. Extensive experiments show that NP-Attack could greatly decrease the query counts under the black-box setting.
△ Less
Submitted 25 September, 2020; v1 submitted 24 September, 2020;
originally announced September 2020.
-
Causal Discovery with Multi-Domain LiNGAM for Latent Factors
Authors:
Yan Zeng,
Shohei Shimizu,
Ruichu Cai,
Feng Xie,
Michio Yamamoto,
Zhifeng Hao
Abstract:
Discovering causal structures among latent factors from observed data is a particularly challenging problem. Despite some efforts for this problem, existing methods focus on the single-domain data only. In this paper, we propose Multi-Domain Linear Non-Gaussian Acyclic Models for Latent Factors (MD-LiNA), where the causal structure among latent factors of interest is shared for all domains, and we…
▽ More
Discovering causal structures among latent factors from observed data is a particularly challenging problem. Despite some efforts for this problem, existing methods focus on the single-domain data only. In this paper, we propose Multi-Domain Linear Non-Gaussian Acyclic Models for Latent Factors (MD-LiNA), where the causal structure among latent factors of interest is shared for all domains, and we provide its identification results. The model enriches the causal representation for multi-domain data. We propose an integrated two-phase algorithm to estimate the model. In particular, we first locate the latent factors and estimate the factor loading matrix. Then to uncover the causal structure among shared latent factors of interest, we derive a score function based on the characterization of independence relations between external influences and the dependence relations between multi-domain latent factors and latent factors of interest. We show that the proposed method provides locally consistent estimators. Experimental results on both synthetic and real-world data demonstrate the efficacy and robustness of our approach.
△ Less
Submitted 22 April, 2022; v1 submitted 19 September, 2020;
originally announced September 2020.
-
Fine-tuning Is Not Enough: A Simple yet Effective Watermark Removal Attack for DNN Models
Authors:
Shangwei Guo,
Tianwei Zhang,
Han Qiu,
Yi Zeng,
Tao Xiang,
Yang Liu
Abstract:
Watermarking has become the tendency in protecting the intellectual property of DNN models. Recent works, from the adversary's perspective, attempted to subvert watermarking mechanisms by designing watermark removal attacks. However, these attacks mainly adopted sophisticated fine-tuning techniques, which have certain fatal drawbacks or unrealistic assumptions. In this paper, we propose a novel wa…
▽ More
Watermarking has become the tendency in protecting the intellectual property of DNN models. Recent works, from the adversary's perspective, attempted to subvert watermarking mechanisms by designing watermark removal attacks. However, these attacks mainly adopted sophisticated fine-tuning techniques, which have certain fatal drawbacks or unrealistic assumptions. In this paper, we propose a novel watermark removal attack from a different perspective. Instead of just fine-tuning the watermarked models, we design a simple yet powerful transformation algorithm by combining imperceptible pattern embedding and spatial-level transformations, which can effectively and blindly destroy the memorization of watermarked models to the watermark samples. We also introduce a lightweight fine-tuning strategy to preserve the model performance. Our solution requires much less resource or knowledge about the watermarking scheme than prior works. Extensive experimental results indicate that our attack can bypass state-of-the-art watermarking solutions with very high success rates. Based on our attack, we propose watermark augmentation techniques to enhance the robustness of existing watermarks.
△ Less
Submitted 17 May, 2021; v1 submitted 18 September, 2020;
originally announced September 2020.
-
Leveraging Organizational Resources to Adapt Models to New Data Modalities
Authors:
Sahaana Suri,
Raghuveer Chanda,
Neslihan Bulut,
Pradyumna Narayana,
Yemao Zeng,
Peter Bailis,
Sugato Basu,
Girija Narlikar,
Christopher Re,
Abishek Sethi
Abstract:
As applications in large organizations evolve, the machine learning (ML) models that power them must adapt the same predictive tasks to newly arising data modalities (e.g., a new video content launch in a social media application requires existing text or image models to extend to video). To solve this problem, organizations typically create ML pipelines from scratch. However, this fails to utiliz…
▽ More
As applications in large organizations evolve, the machine learning (ML) models that power them must adapt the same predictive tasks to newly arising data modalities (e.g., a new video content launch in a social media application requires existing text or image models to extend to video). To solve this problem, organizations typically create ML pipelines from scratch. However, this fails to utilize the domain expertise and data they have cultivated from developing tasks for existing modalities. We demonstrate how organizational resources, in the form of aggregate statistics, knowledge bases, and existing services that operate over related tasks, enable teams to construct a common feature space that connects new and existing data modalities. This allows teams to apply methods for training data curation (e.g., weak supervision and label propagation) and model training (e.g., forms of multi-modal learning) across these different data modalities. We study how this use of organizational resources composes at production scale in over 5 classification tasks at Google, and demonstrate how it reduces the time needed to develop models for new modalities from months to weeks to days.
△ Less
Submitted 23 August, 2020;
originally announced August 2020.
-
Mean field theory for deep dropout networks: digging up gradient backpropagation deeply
Authors:
Wei Huang,
Richard Yi Da Xu,
Weitao Du,
Yutian Zeng,
Yunce Zhao
Abstract:
In recent years, the mean field theory has been applied to the study of neural networks and has achieved a great deal of success. The theory has been applied to various neural network structures, including CNNs, RNNs, Residual networks, and Batch normalization. Inevitably, recent work has also covered the use of dropout. The mean field theory shows that the existence of depth scales that limit the…
▽ More
In recent years, the mean field theory has been applied to the study of neural networks and has achieved a great deal of success. The theory has been applied to various neural network structures, including CNNs, RNNs, Residual networks, and Batch normalization. Inevitably, recent work has also covered the use of dropout. The mean field theory shows that the existence of depth scales that limit the maximum depth of signal propagation and gradient backpropagation. However, the gradient backpropagation is derived under the gradient independence assumption that weights used during feed forward are drawn independently from the ones used in backpropagation. This is not how neural networks are trained in a real setting. Instead, the same weights used in a feed-forward step needs to be carried over to its corresponding backpropagation. Using this realistic condition, we perform theoretical computation on linear dropout networks and a series of experiments on dropout networks. Our empirical results show an interesting phenomenon that the length gradients can backpropagate for a single input and a pair of inputs are governed by the same depth scale. Besides, we study the relationship between variance and mean of statistical metrics of the gradient and shown an emergence of universality. Finally, we investigate the maximum trainable length for deep dropout networks through a series of experiments using MNIST and CIFAR10 and provide a more precise empirical formula that describes the trainable length than original work.
△ Less
Submitted 13 April, 2020; v1 submitted 19 December, 2019;
originally announced December 2019.
-
Order Determination for Spiked Models
Authors:
Yicheng Zeng,
Lixing Zhu
Abstract:
Motivated by dimension reduction in regression analysis and signal detection, we investigate the order determination for large dimension matrices including spiked models of which the numbers of covariates are proportional to the sample sizes for different models. Because the asymptotic behaviour of the estimated eigenvalues of the corresponding matrices differ completely from those in fixed dimens…
▽ More
Motivated by dimension reduction in regression analysis and signal detection, we investigate the order determination for large dimension matrices including spiked models of which the numbers of covariates are proportional to the sample sizes for different models. Because the asymptotic behaviour of the estimated eigenvalues of the corresponding matrices differ completely from those in fixed dimension scenarios, we then discuss the largest possible number we can identify and introduce a "valley-cliff" criterion. We propose two versions of the criterion: one based on the original differences of eigenvalues and the other based on the transformed differences, which reduces the effects of ridge selection in the former one. This generic method is very easy to implement and computationally inexpensive, and it can be applied to various matrices. As examples, we focus on spiked population models, spiked Fisher matrices and factor models with auto-covariance matrices. Numerical studies are conducted to examine the method's finite sample performances and to compare it with existing methods.
△ Less
Submitted 31 October, 2019;
originally announced October 2019.
-
Multiway clustering via tensor block models
Authors:
Miaoyan Wang,
Yuchen Zeng
Abstract:
We consider the problem of identifying multiway block structure from a large noisy tensor. Such problems arise frequently in applications such as genomics, recommendation system, topic modeling, and sensor network localization. We propose a tensor block model, develop a unified least-square estimation, and obtain the theoretical accuracy guarantees for multiway clustering. The statistical converge…
▽ More
We consider the problem of identifying multiway block structure from a large noisy tensor. Such problems arise frequently in applications such as genomics, recommendation system, topic modeling, and sensor network localization. We propose a tensor block model, develop a unified least-square estimation, and obtain the theoretical accuracy guarantees for multiway clustering. The statistical convergence of the estimator is established, and we show that the associated clustering procedure achieves partition consistency. A sparse regularization is further developed for identifying important blocks with elevated means. The proposal handles a broad range of data types, including binary, continuous, and hybrid observations. Through simulation and application to two real datasets, we demonstrate the outperformance of our approach over previous methods.
△ Less
Submitted 2 January, 2021; v1 submitted 10 June, 2019;
originally announced June 2019.
-
Context Aware Machine Learning
Authors:
Yun Zeng
Abstract:
We propose a principle for exploring context in machine learning models. Starting with a simple assumption that each observation may or may not depend on its context, a conditional probability distribution is decomposed into two parts: context-free and context-sensitive. Then by employing the log-linear word production model for relating random variables to their embedding space representation and…
▽ More
We propose a principle for exploring context in machine learning models. Starting with a simple assumption that each observation may or may not depend on its context, a conditional probability distribution is decomposed into two parts: context-free and context-sensitive. Then by employing the log-linear word production model for relating random variables to their embedding space representation and making use of the convexity of natural exponential function, we show that the embedding of an observation can also be decomposed into a weighted sum of two vectors, representing its context-free and context-sensitive parts, respectively. This simple treatment of context provides a unified view of many existing deep learning models, leading to revisions of these models able to achieve significant performance boost. Specifically, our upgraded version of a recent sentence embedding model not only outperforms the original one by a large margin, but also leads to a new, principled approach for compositing the embeddings of bag-of-words features, as well as a new architecture for modeling attention in deep neural networks. More surprisingly, our new principle provides a novel understanding of the gates and equations defined by the long short term memory model, which also leads to a new model that is able to converge significantly faster and achieve much lower prediction errors. Furthermore, our principle also inspires a new type of generic neural network layer that better resembles real biological neurons than the traditional linear mapping plus nonlinear activation based architecture. Its multi-layer extension provides a new principle for deep neural networks which subsumes residual network (ResNet) as its special case, and its extension to convolutional neutral network model accounts for irrelevant input (e.g., background in an image) in addition to filtering.
△ Less
Submitted 19 January, 2019; v1 submitted 10 January, 2019;
originally announced January 2019.
-
Automatic Seismic Salt Interpretation with Deep Convolutional Neural Networks
Authors:
Yu Zeng,
Kebei Jiang,
Jie Chen
Abstract:
One of the most crucial tasks in seismic reflection imaging is to identify the salt bodies with high precision. Traditionally, this is accomplished by visually picking the salt/sediment boundaries, which requires a great amount of manual work and may introduce systematic bias. With recent progress of deep learning algorithm and growing computational power, a great deal of efforts have been made to…
▽ More
One of the most crucial tasks in seismic reflection imaging is to identify the salt bodies with high precision. Traditionally, this is accomplished by visually picking the salt/sediment boundaries, which requires a great amount of manual work and may introduce systematic bias. With recent progress of deep learning algorithm and growing computational power, a great deal of efforts have been made to replace human effort with machine power in salt body interpretation. Currently, the method of Convolutional neural networks (CNN) is revolutionizing the computer vision field and has been a hot topic in the image analysis. In this paper, the benefits of CNN-based classification are demonstrated by using a state-of-art network structure U-Net, along with the residual learning framework ResNet, to delineate salt body with high precision. Network adjustments, including the Exponential Linear Units (ELU) activation function, the Lovász-Softmax loss function, and stratified $K$-fold cross-validation, have been deployed to further improve the prediction accuracy. The preliminary result using SEG Advanced Modeling (SEAM) data shows good agreement between the predicted salt body and manually interpreted salt body, especially in areas with weak reflections. This indicates the great potential of applying CNN for salt-related interpretations.
△ Less
Submitted 24 November, 2018;
originally announced December 2018.
-
Application of Machine Learning in Rock Facies Classification with Physics-Motivated Feature Augmentation
Authors:
Jie Chen,
Yu Zeng
Abstract:
With recent progress in algorithms and the availability of massive amounts of computation power, application of machine learning techniques is becoming a hot topic in the oil and gas industry. One of the most promising aspects to apply machine learning to the upstream field is the rock facies classification in reservoir characterization, which is crucial in determining the net pay thickness of res…
▽ More
With recent progress in algorithms and the availability of massive amounts of computation power, application of machine learning techniques is becoming a hot topic in the oil and gas industry. One of the most promising aspects to apply machine learning to the upstream field is the rock facies classification in reservoir characterization, which is crucial in determining the net pay thickness of reservoirs, thus a definitive factor in drilling decision making process. For complex machine learning tasks like facies classification, feature engineering is often critical. This paper shows the inclusion of physics-motivated feature interaction in feature augmentation can further improve the capability of machine learning in rock facies classification. We demonstrate this approach with the SEG 2016 machine learning contest dataset and the top winning algorithms. The improvement is roboust and can be $\sim5\%$ better than current existing best F-1 score, where F-1 is an evaluation metric used to quantify average prediction accuracy.
△ Less
Submitted 29 August, 2018;
originally announced August 2018.
-
Future Energy Consumption Prediction Based on Grey Forecast Model
Authors:
Yuan Zeng,
Miao Luo,
Yuzhong Liu
Abstract:
We use grey forecast model to predict the future energy consumption of four states in the U.S, and make some improvments to the model.
We use grey forecast model to predict the future energy consumption of four states in the U.S, and make some improvments to the model.
△ Less
Submitted 5 April, 2018;
originally announced April 2018.
-
Scalable Mutual Information Estimation using Dependence Graphs
Authors:
Morteza Noshad,
Yu Zeng,
Alfred O. Hero III
Abstract:
The Mutual Information (MI) is an often used measure of dependency between two random variables utilized in information theory, statistics and machine learning. Recently several MI estimators have been proposed that can achieve parametric MSE convergence rate. However, most of the previously proposed estimators have the high computational complexity of at least $O(N^2)$. We propose a unified metho…
▽ More
The Mutual Information (MI) is an often used measure of dependency between two random variables utilized in information theory, statistics and machine learning. Recently several MI estimators have been proposed that can achieve parametric MSE convergence rate. However, most of the previously proposed estimators have the high computational complexity of at least $O(N^2)$. We propose a unified method for empirical non-parametric estimation of general MI function between random vectors in $\mathbb{R}^d$ based on $N$ i.i.d. samples. The reduced complexity MI estimator, called the ensemble dependency graph estimator (EDGE), combines randomized locality sensitive hashing (LSH), dependency graphs, and ensemble bias-reduction methods. We prove that EDGE achieves optimal computational complexity $O(N)$, and can achieve the optimal parametric MSE rate of $O(1/N)$ if the density is $d$ times differentiable. To the best of our knowledge EDGE is the first non-parametric MI estimator that can achieve parametric MSE rates with linear time complexity. We illustrate the utility of EDGE for the analysis of the information plane (IP) in deep learning. Using EDGE we shed light on a controversy on whether or not the compression property of information bottleneck (IB) in fact holds for ReLu and other rectification functions in deep neural networks (DNN).
△ Less
Submitted 23 November, 2018; v1 submitted 27 January, 2018;
originally announced January 2018.
-
A Supervised STDP-based Training Algorithm for Living Neural Networks
Authors:
Yuan Zeng,
Kevin Devincentis,
Yao Xiao,
Zubayer Ibne Ferdous,
Xiaochen Guo,
Zhiyuan Yan,
Yevgeny Berdichevsky
Abstract:
Neural networks have shown great potential in many applications like speech recognition, drug discovery, image classification, and object detection. Neural network models are inspired by biological neural networks, but they are optimized to perform machine learning tasks on digital computers. The proposed work explores the possibilities of using living neural networks in vitro as basic computation…
▽ More
Neural networks have shown great potential in many applications like speech recognition, drug discovery, image classification, and object detection. Neural network models are inspired by biological neural networks, but they are optimized to perform machine learning tasks on digital computers. The proposed work explores the possibilities of using living neural networks in vitro as basic computational elements for machine learning applications. A new supervised STDP-based learning algorithm is proposed in this work, which considers neuron engineering constrains. A 74.7% accuracy is achieved on the MNIST benchmark for handwritten digit recognition.
△ Less
Submitted 21 March, 2018; v1 submitted 30 October, 2017;
originally announced October 2017.
-
Online Adaptive Machine Learning Based Algorithm for Implied Volatility Surface Modeling
Authors:
Yaxiong Zeng,
Diego Klabjan
Abstract:
In this work, we design a machine learning based method, online adaptive primal support vector regression (SVR), to model the implied volatility surface (IVS). The algorithm proposed is the first derivation and implementation of an online primal kernel SVR. It features enhancements that allow efficient online adaptive learning by embedding the idea of local fitness and budget maintenance to dynami…
▽ More
In this work, we design a machine learning based method, online adaptive primal support vector regression (SVR), to model the implied volatility surface (IVS). The algorithm proposed is the first derivation and implementation of an online primal kernel SVR. It features enhancements that allow efficient online adaptive learning by embedding the idea of local fitness and budget maintenance to dynamically update support vectors upon pattern drifts. For algorithm acceleration, we implement its most computationally intensive parts in a Field Programmable Gate Arrays hardware, where a 132x speedup over CPU is achieved during online prediction. Using intraday tick data from the E-mini S&P 500 options market, we show that the Gaussian kernel outperforms the linear kernel in regulating the size of support vectors, and that our empirical IVS algorithm beats two competing online methods with regards to model complexity and regression errors (the mean absolute percentage error of our algorithm is up to 13%). Best results are obtained at the center of the IVS grid due to its larger number of adjacent support vectors than the edges of the grid. Sensitivity analysis is also presented to demonstrate how hyper parameters affect the error rates and model complexity.
△ Less
Submitted 7 June, 2018; v1 submitted 6 June, 2017;
originally announced June 2017.
-
Hybrid safe-strong rules for efficient optimization in lasso-type problems
Authors:
Yaohui Zeng,
Tianbao Yang,
Patrick Breheny
Abstract:
The lasso model has been widely used for model selection in data mining, machine learning, and high-dimensional statistical analysis. However, with the ultrahigh-dimensional, large-scale data sets now collected in many real-world applications, it is important to develop algorithms to solve the lasso that efficiently scale up to problems of this size. Discarding features from certain steps of the a…
▽ More
The lasso model has been widely used for model selection in data mining, machine learning, and high-dimensional statistical analysis. However, with the ultrahigh-dimensional, large-scale data sets now collected in many real-world applications, it is important to develop algorithms to solve the lasso that efficiently scale up to problems of this size. Discarding features from certain steps of the algorithm is a powerful technique for increasing efficiency and addressing the Big Data challenge. In this paper, we propose a family of hybrid safe-strong rules (HSSR) which incorporate safe screening rules into the sequential strong rule (SSR) to remove unnecessary computational burden. In particular, we present two instances of HSSR, namely SSR-Dome and SSR-BEDPP, for the standard lasso problem. We further extend SSR-BEDPP to the elastic net and group lasso problems to demonstrate the generalizability of the hybrid screening idea. Extensive numerical experiments with synthetic and real data sets are conducted for both the standard lasso and the group lasso problems. Results show that our proposed hybrid rules can substantially outperform existing state-of-the-art rules.
△ Less
Submitted 1 June, 2020; v1 submitted 27 April, 2017;
originally announced April 2017.
-
Optimized Cost per Click in Taobao Display Advertising
Authors:
Han Zhu,
Junqi Jin,
Chang Tan,
Fei Pan,
Yifan Zeng,
Han Li,
Kun Gai
Abstract:
Taobao, as the largest online retail platform in the world, provides billions of online display advertising impressions for millions of advertisers every day. For commercial purposes, the advertisers bid for specific spots and target crowds to compete for business traffic. The platform chooses the most suitable ads to display in tens of milliseconds. Common pricing methods include cost per mille (…
▽ More
Taobao, as the largest online retail platform in the world, provides billions of online display advertising impressions for millions of advertisers every day. For commercial purposes, the advertisers bid for specific spots and target crowds to compete for business traffic. The platform chooses the most suitable ads to display in tens of milliseconds. Common pricing methods include cost per mille (CPM) and cost per click (CPC). Traditional advertising systems target certain traits of users and ad placements with fixed bids, essentially regarded as coarse-grained matching of bid and traffic quality. However, the fixed bids set by the advertisers competing for different quality requests cannot fully optimize the advertisers' key requirements. Moreover, the platform has to be responsible for the business revenue and user experience. Thus, we proposed a bid optimizing strategy called optimized cost per click (OCPC) which automatically adjusts the bid to achieve finer matching of bid and traffic quality of page view (PV) request granularity. Our approach optimizes advertisers' demands, platform business revenue and user experience and as a whole improves traffic allocation efficiency. We have validated our approach in Taobao display advertising system in production. The online A/B test shows our algorithm yields substantially better results than previous fixed bid manner.
△ Less
Submitted 29 January, 2019; v1 submitted 27 February, 2017;
originally announced March 2017.
-
Tuple-oriented Compression for Large-scale Mini-batch Stochastic Gradient Descent
Authors:
Fengan Li,
Lingjiao Chen,
Yijing Zeng,
Arun Kumar,
Jeffrey F. Naughton,
Jignesh M. Patel,
Xi Wu
Abstract:
Data compression is a popular technique for improving the efficiency of data processing workloads such as SQL queries and more recently, machine learning (ML) with classical batch gradient methods. But the efficacy of such ideas for mini-batch stochastic gradient descent (MGD), arguably the workhorse algorithm of modern ML, is an open question. MGD's unique data access pattern renders prior art, i…
▽ More
Data compression is a popular technique for improving the efficiency of data processing workloads such as SQL queries and more recently, machine learning (ML) with classical batch gradient methods. But the efficacy of such ideas for mini-batch stochastic gradient descent (MGD), arguably the workhorse algorithm of modern ML, is an open question. MGD's unique data access pattern renders prior art, including those designed for batch gradient methods, less effective. We fill this crucial research gap by proposing a new lossless compression scheme we call tuple-oriented compression (TOC) that is inspired by an unlikely source, the string/text compression scheme Lempel-Ziv-Welch, but tailored to MGD in a way that preserves tuple boundaries within mini-batches. We then present a suite of novel compressed matrix operation execution techniques tailored to the TOC compression scheme that operate directly over the compressed data representation and avoid decompression overheads. An extensive empirical evaluation with real-world datasets shows that TOC consistently achieves substantial compression ratios by up to 51x and reduces runtimes for MGD workloads by up to 10.2x in popular ML systems.
△ Less
Submitted 20 January, 2019; v1 submitted 22 February, 2017;
originally announced February 2017.
-
The biglasso Package: A Memory- and Computation-Efficient Solver for Lasso Model Fitting with Big Data in R
Authors:
Yaohui Zeng,
Patrick Breheny
Abstract:
Penalized regression models such as the lasso have been extensively applied to analyzing high-dimensional data sets. However, due to memory limitations, existing R packages like glmnet and ncvreg are not capable of fitting lasso-type models for ultrahigh-dimensional, multi-gigabyte data sets that are increasingly seen in many areas such as genetics, genomics, biomedical imaging, and high-frequency…
▽ More
Penalized regression models such as the lasso have been extensively applied to analyzing high-dimensional data sets. However, due to memory limitations, existing R packages like glmnet and ncvreg are not capable of fitting lasso-type models for ultrahigh-dimensional, multi-gigabyte data sets that are increasingly seen in many areas such as genetics, genomics, biomedical imaging, and high-frequency finance. In this research, we implement an R package called biglasso that tackles this challenge. biglasso utilizes memory-mapped files to store the massive data on the disk, only reading data into memory when necessary during model fitting, and is thus able to handle out-of-core computation seamlessly. Moreover, it's equipped with newly proposed, more efficient feature screening rules, which substantially accelerate the computation. Benchmarking experiments show that our biglasso package, as compared to existing popular ones like glmnet, is much more memory- and computation-efficient. We further analyze a 31 GB real data set on a laptop with only 16 GB RAM to demonstrate the out-of-core computation capability of biglasso in analyzing massive data sets that cannot be accommodated by existing R packages.
△ Less
Submitted 11 March, 2018; v1 submitted 20 January, 2017;
originally announced January 2017.
-
Overlapping group logistic regression with applications to genetic pathway selection
Authors:
Yaohui Zeng,
Patrick Breheny
Abstract:
Discovering important genes that account for the phenotype of interest has long been challenging in genomewide expression analysis. Analyses such as Gene Set Enrichment Analysis (GSEA) that incorporate pathway information have become widespread in hypothesis testing, but pathway-based approaches have been largely absent from regression methods due to the challenges of dealing with overlapping path…
▽ More
Discovering important genes that account for the phenotype of interest has long been challenging in genomewide expression analysis. Analyses such as Gene Set Enrichment Analysis (GSEA) that incorporate pathway information have become widespread in hypothesis testing, but pathway-based approaches have been largely absent from regression methods due to the challenges of dealing with overlapping pathways and the resulting lack of available software. The R package grpreg is widely used to fit group lasso and other group-penalized regression models; in this study, we develop an extension, grpregOverlap, to allow for overlapping group structure using the latent variable approach proposed by Jacob et al. (2009). We compare this approach to the ordinary lasso and to GSEA using both simulated and real data. We find that incorporation of prior pathway information substantially improves the accuracy of gene expression classifiers, and we shed light on several ways in which hypothesis-testing approaches such as GSEA differ from regression approaches with respect to the analysis of pathway data.
△ Less
Submitted 13 September, 2016; v1 submitted 17 October, 2015;
originally announced October 2015.
-
Non-parametric Power-law Data Clustering
Authors:
Xuhui Fan,
Yiling Zeng,
Longbing Cao
Abstract:
It has always been a great challenge for clustering algorithms to automatically determine the cluster numbers according to the distribution of datasets. Several approaches have been proposed to address this issue, including the recent promising work which incorporate Bayesian Nonparametrics into the $k$-means clustering procedure. This approach shows simplicity in implementation and solidity in th…
▽ More
It has always been a great challenge for clustering algorithms to automatically determine the cluster numbers according to the distribution of datasets. Several approaches have been proposed to address this issue, including the recent promising work which incorporate Bayesian Nonparametrics into the $k$-means clustering procedure. This approach shows simplicity in implementation and solidity in theory, while it also provides a feasible way to inference in large scale datasets. However, several problems remains unsolved in this pioneering work, including the power-law data applicability, mechanism to merge centers to avoid the over-fitting problem, clustering order problem, e.t.c.. To address these issues, the Pitman-Yor Process based k-means (namely \emph{pyp-means}) is proposed in this paper. Taking advantage of the Pitman-Yor Process, \emph{pyp-means} treats clusters differently by dynamically and adaptively changing the threshold to guarantee the generation of power-law clustering results. Also, one center agglomeration procedure is integrated into the implementation to be able to merge small but close clusters and then adaptively determine the cluster number. With more discussion on the clustering order, the convergence proof, complexity analysis and extension to spectral clustering, our approach is compared with traditional clustering algorithm and variational inference methods. The advantages and properties of pyp-means are validated by experiments on both synthetic datasets and real world datasets.
△ Less
Submitted 12 June, 2013;
originally announced June 2013.
-
On the Performance of Spectrum Sensing Algorithms using Multiple Antennas
Authors:
Ying-Chang Liang,
Guangming Pan,
Yonghong Zeng
Abstract:
In recent years, some spectrum sensing algorithms using multiple antennas, such as the eigenvalue based detection (EBD), have attracted a lot of attention. In this paper, we are interested in deriving the asymptotic distributions of the test statistics of the EBD algorithms. Two EBD algorithms using sample covariance matrices are considered: maximum eigenvalue detection (MED) and condition number…
▽ More
In recent years, some spectrum sensing algorithms using multiple antennas, such as the eigenvalue based detection (EBD), have attracted a lot of attention. In this paper, we are interested in deriving the asymptotic distributions of the test statistics of the EBD algorithms. Two EBD algorithms using sample covariance matrices are considered: maximum eigenvalue detection (MED) and condition number detection (CND). The earlier studies usually assume that the number of antennas (K) and the number of samples (N) are both large, thus random matrix theory (RMT) can be used to derive the asymptotic distributions of the maximum and minimum eigenvalues of the sample covariance matrices. While assuming the number of antennas being large simplifies the derivations, in practice, the number of antennas equipped at a single secondary user is usually small, say 2 or 3, and once designed, this antenna number is fixed. Thus in this paper, our objective is to derive the asymptotic distributions of the eigenvalues and condition numbers of the sample covariance matrices for any fixed K but large N, from which the probability of detection and probability of false alarm can be obtained. The proposed methodology can also be used to analyze the performance of other EBD algorithms. Finally, computer simulations are presented to validate the accuracy of the derived results.
△ Less
Submitted 18 August, 2010;
originally announced August 2010.