-
Discrimination-free Insurance Pricing with Privatized Sensitive Attributes
Authors:
Tianhe Zhang,
Suhan Liu,
Peng Shi
Abstract:
Fairness has emerged as a critical consideration in the landscape of machine learning algorithms, particularly as AI continues to transform decision-making across societal domains. To ensure that these algorithms are free from bias and do not discriminate against individuals based on sensitive attributes such as gender and race, the field of algorithmic bias has introduced various fairness concept…
▽ More
Fairness has emerged as a critical consideration in the landscape of machine learning algorithms, particularly as AI continues to transform decision-making across societal domains. To ensure that these algorithms are free from bias and do not discriminate against individuals based on sensitive attributes such as gender and race, the field of algorithmic bias has introduced various fairness concepts, along with methodologies to achieve these notions in different contexts. Despite the rapid advancement, not all sectors have embraced these fairness principles to the same extent. One specific sector that merits attention in this regard is insurance. Within the realm of insurance pricing, fairness is defined through a distinct and specialized framework. Consequently, achieving fairness according to established notions does not automatically ensure fair pricing in insurance. In particular, regulators are increasingly emphasizing transparency in pricing algorithms and imposing constraints on insurance companies on the collection and utilization of sensitive consumer attributes. These factors present additional challenges in the implementation of fairness in pricing algorithms. To address these complexities and comply with regulatory demands, we propose an efficient method for constructing fair models that are tailored to the insurance domain, using only privatized sensitive attributes. Notably, our approach ensures statistical guarantees, does not require direct access to sensitive attributes, and adapts to varying transparency requirements, addressing regulatory demands while ensuring fairness in insurance pricing.
△ Less
Submitted 16 April, 2025;
originally announced April 2025.
-
Functional Singular Value Decomposition
Authors:
Jianbin Tan,
Pixu Shi,
Anru R. Zhang
Abstract:
Heterogeneous functional data commonly arise in time series and longitudinal studies. To uncover the statistical structures of such data, we propose Functional Singular Value Decomposition (FSVD), a unified framework encompassing various tasks for the analysis of functional data with potential heterogeneity. We establish the mathematical foundation of FSVD by proving its existence and providing it…
▽ More
Heterogeneous functional data commonly arise in time series and longitudinal studies. To uncover the statistical structures of such data, we propose Functional Singular Value Decomposition (FSVD), a unified framework encompassing various tasks for the analysis of functional data with potential heterogeneity. We establish the mathematical foundation of FSVD by proving its existence and providing its fundamental properties. We then develop an implementation approach for noisy and irregularly observed functional data based on a novel alternating minimization scheme and provide theoretical guarantees for its convergence and estimation accuracy. The FSVD framework also introduces the concepts of intrinsic basis functions and intrinsic basis vectors, representing two fundamental structural aspects of random functions. These concepts enable FSVD to provide new and improved solutions to tasks including functional principal component analysis, factor models, functional clustering, functional linear regression, and functional completion, while effectively handling heterogeneity and irregular temporal sampling. Through extensive simulations, we demonstrate that FSVD-based methods consistently outperform existing methods across these tasks. To showcase the value of FSVD in real-world datasets, we apply it to extract temporal patterns from a COVID-19 case count dataset and perform data completion on an electronic health record dataset.
△ Less
Submitted 16 February, 2025; v1 submitted 4 October, 2024;
originally announced October 2024.
-
Supervised low-rank approximation of high-dimensional multivariate functional data via tensor decomposition
Authors:
Mohammad Samsul Alam,
Ana-Maria Staicu,
Pixu Shi
Abstract:
Motivated by the challenges of analyzing high-dimensional ($p \gg n$) sequencing data from longitudinal microbiome studies, where samples are collected at multiple time points from each subject, we propose supervised functional tensor singular value decomposition (SupFTSVD), a novel dimensionality reduction method that leverages auxiliary information in the dimensionality reduction of high-dimensi…
▽ More
Motivated by the challenges of analyzing high-dimensional ($p \gg n$) sequencing data from longitudinal microbiome studies, where samples are collected at multiple time points from each subject, we propose supervised functional tensor singular value decomposition (SupFTSVD), a novel dimensionality reduction method that leverages auxiliary information in the dimensionality reduction of high-dimensional functional tensors. Although multivariate functional principal component analysis is a natural choice for dimensionality reduction of multivariate functional data, it becomes computationally burdensome in high-dimensional settings. Low-rank tensor decomposition is a feasible alternative and has gained popularity in recent literature, but existing methods in this realm are often incapable of simultaneously utilizing the temporal structure of the data and subject-level auxiliary information. SupFTSVD overcomes these limitations by generating low-rank representations of high-dimensional functional tensors while incorporating subject-level auxiliary information and accounting for the functional nature of the data. Moreover, SupFTSVD produces low-dimensional representations of subjects, features, and time, as well as subject-specific trajectories, providing valuable insights into the biological significance of variations within the data. In simulation studies, we demonstrate that our method achieves notable improvement in tensor approximation accuracy and loading estimation by utilizing auxiliary information. Finally, we applied SupFTSVD to two longitudinal microbiome studies where biologically meaningful patterns in the data were revealed.
△ Less
Submitted 14 October, 2024; v1 submitted 20 September, 2024;
originally announced September 2024.
-
Estimating Conditional Average Treatment Effects with Heteroscedasticity by Model Averaging and Matching
Authors:
Pengfei Shi,
Xinyu Zhang,
Wei Zhong
Abstract:
We propose a model averaging approach, combined with a partition and matching method to estimate the conditional average treatment effects under heteroskedastic error settings. The proposed approach has asymptotic optimality and consistency of weights and estimator. Numerical studies show that our method has good finite-sample performances.
We propose a model averaging approach, combined with a partition and matching method to estimate the conditional average treatment effects under heteroskedastic error settings. The proposed approach has asymptotic optimality and consistency of weights and estimator. Numerical studies show that our method has good finite-sample performances.
△ Less
Submitted 15 December, 2024; v1 submitted 14 April, 2023;
originally announced April 2023.
-
Deep Machine Learning Reconstructing Lattice Topology with Strong Thermal Fluctuations
Authors:
Xiao-Han Wang,
Pei Shi,
Bin Xi,
Jie Hu,
Shi-Ju Ran
Abstract:
Applying artificial intelligence to scientific problems (namely AI for science) is currently under hot debate. However, the scientific problems differ much from the conventional ones with images, texts, and etc., where new challenges emerges with the unbalanced scientific data and complicated effects from the physical setups. In this work, we demonstrate the validity of the deep convolutional neur…
▽ More
Applying artificial intelligence to scientific problems (namely AI for science) is currently under hot debate. However, the scientific problems differ much from the conventional ones with images, texts, and etc., where new challenges emerges with the unbalanced scientific data and complicated effects from the physical setups. In this work, we demonstrate the validity of the deep convolutional neural network (CNN) on reconstructing the lattice topology (i.e., spin connectivities) in the presence of strong thermal fluctuations and unbalanced data. Taking the kinetic Ising model with Glauber dynamics as an example, the CNN maps the time-dependent local magnetic momenta (a single-node feature) evolved from a specific initial configuration (dubbed as an evolution instance) to the probabilities of the presences of the possible couplings. Our scheme distinguishes from the previous ones that might require the knowledge on the node dynamics, the responses from perturbations, or the evaluations of statistic quantities such as correlations or transfer entropy from many evolution instances. The fine tuning avoids the "barren plateau" caused by the strong thermal fluctuations at high temperatures. Accurate reconstructions can be made where the thermal fluctuations dominate over the correlations and consequently the statistic methods in general fail. Meanwhile, we unveil the generalization of CNN on dealing with the instances evolved from the unlearnt initial spin configurations and those with the unlearnt lattices. We raise an open question on the learning with unbalanced data in the nearly "double-exponentially" large sample space.
△ Less
Submitted 8 August, 2022;
originally announced August 2022.
-
Physics-informed ConvNet: Learning Physical Field from a Shallow Neural Network
Authors:
Pengpeng Shi,
Zhi Zeng,
Tianshou Liang
Abstract:
Big-data-based artificial intelligence (AI) supports profound evolution in almost all of science and technology. However, modeling and forecasting multi-physical systems remain a challenge due to unavoidable data scarcity and noise. Improving the generalization ability of neural networks by "teaching" domain knowledge and developing a new generation of models combined with the physical laws have b…
▽ More
Big-data-based artificial intelligence (AI) supports profound evolution in almost all of science and technology. However, modeling and forecasting multi-physical systems remain a challenge due to unavoidable data scarcity and noise. Improving the generalization ability of neural networks by "teaching" domain knowledge and developing a new generation of models combined with the physical laws have become promising areas of machine learning research. Different from "deep" fully-connected neural networks embedded with physical information (PINN), a novel shallow framework named physics-informed convolutional network (PICN) is recommended from a CNN perspective, in which the physical field is generated by a deconvolution layer and a single convolution layer. The difference fields forming the physical operator are constructed using the pre-trained shallow convolution layer. An efficient linear interpolation network calculates the loss function involving boundary conditions and the physical constraints in irregular geometry domains. The effectiveness of the current development is illustrated through some numerical cases involving the solving (and estimation) of nonlinear physical operator equations and recovering physical information from noisy observations. Its potential advantage in approximating physical fields with multi-frequency components indicates that PICN may become an alternative neural network solver in physics-informed machine learning.
△ Less
Submitted 7 February, 2022; v1 submitted 26 January, 2022;
originally announced January 2022.
-
Weak signal identification and inference in penalized likelihood models for categorical responses
Authors:
Yuexia Zhang,
Peibei Shi,
Zhongyi Zhu,
Linbo Wang,
Annie Qu
Abstract:
Penalized likelihood models are widely used to simultaneously select variables and estimate model parameters. However, the existence of weak signals can lead to inaccurate variable selection, biased parameter estimation, and invalid inference. Thus, identifying weak signals accurately and making valid inferences are crucial in penalized likelihood models. We develop a unified approach to identify…
▽ More
Penalized likelihood models are widely used to simultaneously select variables and estimate model parameters. However, the existence of weak signals can lead to inaccurate variable selection, biased parameter estimation, and invalid inference. Thus, identifying weak signals accurately and making valid inferences are crucial in penalized likelihood models. We develop a unified approach to identify weak signals and make inferences in penalized likelihood models, including the special case when the responses are categorical. To identify weak signals, we use the estimated selection probability of each covariate as a measure of the signal strength and formulate a signal identification criterion. To construct confidence intervals, we propose a two-step inference procedure. Extensive simulation studies show that the proposed procedure outperforms several existing methods. We illustrate the proposed method by applying it to the Practice Fusion diabetes data set.
△ Less
Submitted 11 December, 2022; v1 submitted 17 August, 2021;
originally announced August 2021.
-
Guaranteed Functional Tensor Singular Value Decomposition
Authors:
Rungang Han,
Pixu Shi,
Anru R. Zhang
Abstract:
This paper introduces the functional tensor singular value decomposition (FTSVD), a novel dimension reduction framework for tensors with one functional mode and several tabular modes. The problem is motivated by high-order longitudinal data analysis. Our model assumes the observed data to be a random realization of an approximate CP low-rank functional tensor measured on a discrete time grid. Inco…
▽ More
This paper introduces the functional tensor singular value decomposition (FTSVD), a novel dimension reduction framework for tensors with one functional mode and several tabular modes. The problem is motivated by high-order longitudinal data analysis. Our model assumes the observed data to be a random realization of an approximate CP low-rank functional tensor measured on a discrete time grid. Incorporating tensor algebra and the theory of Reproducing Kernel Hilbert Space (RKHS), we propose a novel RKHS-based constrained power iteration with spectral initialization. Our method can successfully estimate both singular vectors and functions of the low-rank structure in the observed data. With mild assumptions, we establish the non-asymptotic contractive error bounds for the proposed algorithm. The superiority of the proposed framework is demonstrated via extensive experiments on both simulated and real data.
△ Less
Submitted 25 October, 2023; v1 submitted 9 August, 2021;
originally announced August 2021.
-
Regression for Copula-linked Compound Distributions with Applications in Modeling Aggregate Insurance Claims
Authors:
Peng Shi,
Zifeng Zhao
Abstract:
In actuarial research, a task of particular interest and importance is to predict the loss cost for individual risks so that informative decisions are made in various insurance operations such as underwriting, ratemaking, and capital management. The loss cost is typically viewed to follow a compound distribution where the summation of the severity variables is stopped by the frequency variable. A…
▽ More
In actuarial research, a task of particular interest and importance is to predict the loss cost for individual risks so that informative decisions are made in various insurance operations such as underwriting, ratemaking, and capital management. The loss cost is typically viewed to follow a compound distribution where the summation of the severity variables is stopped by the frequency variable. A challenging issue in modeling such outcome is to accommodate the potential dependence between the number of claims and the size of each individual claim. In this article, we introduce a novel regression framework for compound distributions that uses a copula to accommodate the association between the frequency and the severity variables, and thus allows for arbitrary dependence between the two components. We further show that the new model is very flexible and is easily modified to account for incomplete data due to censoring or truncation. The flexibility of the proposed model is illustrated using both simulated and real data sets. In the analysis of granular claims data from property insurance, we find substantive negative relationship between the number and the size of insurance claims. In addition, we demonstrate that ignoring the frequency-severity association could lead to biased decision-making in insurance operations.
△ Less
Submitted 12 October, 2019;
originally announced October 2019.
-
Traffic Flow Combination Forecasting Method Based on Improved LSTM and ARIMA
Authors:
Boyi Liu,
Xiangyan Tang,
Jieren Cheng,
Pengchao Shi
Abstract:
Traffic flow forecasting is hot spot research of intelligent traffic system construction. The existing traffic flow prediction methods have problems such as poor stability, high data requirements, or poor adaptability. In this paper, we define the traffic data time singularity ratio in the dropout module and propose a combination prediction method based on the improved long short-term memory neura…
▽ More
Traffic flow forecasting is hot spot research of intelligent traffic system construction. The existing traffic flow prediction methods have problems such as poor stability, high data requirements, or poor adaptability. In this paper, we define the traffic data time singularity ratio in the dropout module and propose a combination prediction method based on the improved long short-term memory neural network and time series autoregressive integrated moving average model (SDLSTM-ARIMA), which is derived from the Recurrent Neural Networks (RNN) model. It compares the traffic data time singularity with the probability value in the dropout module and combines them at unequal time intervals to achieve an accurate prediction of traffic flow data. Then, we design an adaptive traffic flow embedded system that can adapt to Java, Python and other languages and other interfaces. The experimental results demonstrate that the method based on the SDLSTM - ARIMA model has higher accuracy than the similar method using only autoregressive integrated moving average or autoregressive. Our embedded traffic prediction system integrating computer vision, machine learning and cloud has the advantages such as high accuracy, high reliability and low cost. Therefore, it has a wide application prospect.
△ Less
Submitted 25 June, 2019;
originally announced June 2019.
-
A new perspective from a Dirichlet model for forecasting outstanding liabilities of nonlife insurers
Authors:
Karthik Sriram,
Peng Shi
Abstract:
Forecasting the outstanding claim liabilities to set adequate reserves is critical for a nonlife insurer's solvency. Chain-Ladder and Bornhuetter-Ferguson are two prominent actuarial approaches used for this task. The selection between the two approaches is often ad hoc due to different underlying assumptions. We introduce a Dirichlet model that provides a common statistical framework for the two…
▽ More
Forecasting the outstanding claim liabilities to set adequate reserves is critical for a nonlife insurer's solvency. Chain-Ladder and Bornhuetter-Ferguson are two prominent actuarial approaches used for this task. The selection between the two approaches is often ad hoc due to different underlying assumptions. We introduce a Dirichlet model that provides a common statistical framework for the two approaches, with some appealing properties. Depending on the type of information available, the model inference naturally leads to either Chain-Ladder or Bornhuetter-Ferguson prediction. Using claims data on Worker's compensation insurance from several US insurers, we discuss both frequentist and Bayesian inference.
△ Less
Submitted 9 April, 2019;
originally announced April 2019.
-
Implementation of Frequency-Severity Association in BMS Ratemaking
Authors:
Rosy Oh,
Peng Shi,
Jae Youn Ahn
Abstract:
A Bonus-Malus System (BMS) in insurance is a premium adjustment mechanism widely used in a posteriori ratemaking process to set the premium for the next contract period based on a policyholder's claim history. The current practice in BMS implementation relies on the assumption of independence between claim frequency and severity, despite the fact that a series of recent studies report evidence of…
▽ More
A Bonus-Malus System (BMS) in insurance is a premium adjustment mechanism widely used in a posteriori ratemaking process to set the premium for the next contract period based on a policyholder's claim history. The current practice in BMS implementation relies on the assumption of independence between claim frequency and severity, despite the fact that a series of recent studies report evidence of a significant frequency-severity relationship, particularly in automobile insurance. To address this discrepancy, we propose a copula-based correlated random effects model to accommodate the dependence between claim frequency and severity, and further illustrate how to incorporate such dependence into the current BMS. We derive analytical solutions to the optimal relativities under the proposed framework and provide numerical experiments based on real data analysis to assess the effect of frequency-severity dependence in BMS ratemaking.
△ Less
Submitted 14 March, 2019;
originally announced March 2019.
-
Sales Demand Forecast in E-commerce using a Long Short-Term Memory Neural Network Methodology
Authors:
Kasun Bandara,
Peibei Shi,
Christoph Bergmeir,
Hansika Hewamalage,
Quoc Tran,
Brian Seaman
Abstract:
Generating accurate and reliable sales forecasts is crucial in the E-commerce business. The current state-of-the-art techniques are typically univariate methods, which produce forecasts considering only the historical sales data of a single product. However, in a situation where large quantities of related time series are available, conditioning the forecast of an individual time series on past be…
▽ More
Generating accurate and reliable sales forecasts is crucial in the E-commerce business. The current state-of-the-art techniques are typically univariate methods, which produce forecasts considering only the historical sales data of a single product. However, in a situation where large quantities of related time series are available, conditioning the forecast of an individual time series on past behaviour of similar, related time series can be beneficial. Since the product assortment hierarchy in an E-commerce platform contains large numbers of related products, in which the sales demand patterns can be correlated, our attempt is to incorporate this cross-series information in a unified model. We achieve this by globally training a Long Short-Term Memory network (LSTM) that exploits the non-linear demand relationships available in an E-commerce product assortment hierarchy. Aside from the forecasting framework, we also propose a systematic pre-processing framework to overcome the challenges in the E-commerce business. We also introduce several product grouping strategies to supplement the LSTM learning schemes, in situations where sales patterns in a product portfolio are disparate. We empirically evaluate the proposed forecasting framework on a real-world online marketplace dataset from Walmart.com. Our method achieves competitive results on category level and super-departmental level datasets, outperforming state-of-the-art techniques.
△ Less
Submitted 11 August, 2019; v1 submitted 13 January, 2019;
originally announced January 2019.
-
High-dimensional Log-Error-in-Variable Regression with Applications to Microbial Compositional Data Analysis
Authors:
Pixu Shi,
Yuchen Zhou,
Anru R. Zhang
Abstract:
In microbiome and genomic studies, the regression of compositional data has been a crucial tool for identifying microbial taxa or genes that are associated with clinical phenotypes. To account for the variation in sequencing depth, the classic log-contrast model is often used where read counts are normalized into compositions. However, zero read counts and the randomness in covariates remain criti…
▽ More
In microbiome and genomic studies, the regression of compositional data has been a crucial tool for identifying microbial taxa or genes that are associated with clinical phenotypes. To account for the variation in sequencing depth, the classic log-contrast model is often used where read counts are normalized into compositions. However, zero read counts and the randomness in covariates remain critical issues. In this article, we introduce a surprisingly simple, interpretable, and efficient method for the estimation of compositional data regression through the lens of a novel high-dimensional log-error-in-variable regression model. The proposed method provides both corrections on sequencing data with possible overdispersion and simultaneously avoids any subjective imputation of zero read counts. We provide theoretical justifications with matching upper and lower bounds for the estimation error. The merit of the procedure is illustrated through real data analysis and simulation studies.
△ Less
Submitted 10 March, 2021; v1 submitted 28 November, 2018;
originally announced November 2018.
-
Enhanced Pricing and Management of Bundled Insurance Risks with Dependence-aware Prediction using Pair Copula Construction
Authors:
Peng Shi,
Zifeng Zhao
Abstract:
We propose a dependence-aware predictive modeling framework for multivariate risks stemmed from an insurance contract with bundling features - an important type of policy increasingly offered by major insurance companies. The bundling feature naturally leads to longitudinal measurements of multiple insurance risks, and correct pricing and management of such risks is of fundamental interest to fina…
▽ More
We propose a dependence-aware predictive modeling framework for multivariate risks stemmed from an insurance contract with bundling features - an important type of policy increasingly offered by major insurance companies. The bundling feature naturally leads to longitudinal measurements of multiple insurance risks, and correct pricing and management of such risks is of fundamental interest to financial stability of the macroeconomy. We build a novel predictive model that fully captures the dependence among the multivariate repeated risk measurements. Specifically, the longitudinal measurement of each individual risk is first modeled using pair copula construction with a D-vine structure, and the multiple D-vines are then integrated by a flexible copula. The proposed model provides a unified modeling framework for multivariate longitudinal data that can accommodate different scales of measurements, including continuous, discrete, and mixed observations, and thus can be potentially useful for various economic studies. A computationally efficient sequential method is proposed for model estimation and inference, and its performance is investigated both theoretically and via simulation studies. In the application, we examine multivariate bundled risks in multi-peril property insurance using proprietary data from a commercial property insurance provider. The proposed model is found to provide improved decision making for several key insurance operations. For underwriting, we show that the experience rate priced by the proposed model leads to a 9% lift in the insurer's net revenue. For reinsurance, we show that the insurer underestimates the risk of the retained insurance portfolio by 10% when ignoring the dependence among bundled insurance risks.
△ Less
Submitted 15 October, 2023; v1 submitted 18 May, 2018;
originally announced May 2018.
-
Modeling Multivariate Time Series with Copula-linked Univariate D-vines
Authors:
Zifeng Zhao,
Peng Shi,
Zhengjun Zhang
Abstract:
This paper proposes a novel multivariate time series model named Copula-linked univariate D-vines (CuDvine), which enables the simultaneous copula-based modeling of both temporal and cross-sectional dependence for multivariate time series. To construct CuDvine, we first build a semiparametric univariate D-vine time series model (uDvine) based on a D-vine. The uDvine generalizes the existing first-…
▽ More
This paper proposes a novel multivariate time series model named Copula-linked univariate D-vines (CuDvine), which enables the simultaneous copula-based modeling of both temporal and cross-sectional dependence for multivariate time series. To construct CuDvine, we first build a semiparametric univariate D-vine time series model (uDvine) based on a D-vine. The uDvine generalizes the existing first-order copula-based Markov chain models to Markov chains of an arbitrary-order. Building upon uDvine, we construct CuDvine by linking multiple uDvines via a parametric copula. As a simple and tractable model, CuDvine provides flexible models for marginal behavior and temporal dependence of time series, and can also incorporate sophisticated cross-sectional dependence such as time-varying and spatio-temporal dependence for high-dimensional applications. Robust and computationally efficient procedures, including a sequential model selection method and a two-stage MLE, are proposed for model estimation and inference, and their statistical properties are investigated. Numerical experiments are conducted to demonstrate the flexibility of CuDvine, and to examine the performance of the sequential model selection procedure and the two-stage MLE. Real data applications on the Australian electricity price data demonstrate the superior performance of CuDvine to traditional multivariate time series models.
△ Less
Submitted 30 November, 2020; v1 submitted 8 May, 2018;
originally announced May 2018.
-
Generalized Linear Models with Linear Constraints for Microbiome Compositional Data
Authors:
Jiarui Lu,
Pixu Shi,
Hongzhe Li
Abstract:
Motivated by regression analysis for microbiome compositional data, this paper considers generalized linear regression analysis with compositional covariates, where a group of linear constraints on regression coefficients are imposed to account for the compositional nature of the data and to achieve subcompositional coherence. A penalized likelihood estimation procedure using a generalized acceler…
▽ More
Motivated by regression analysis for microbiome compositional data, this paper considers generalized linear regression analysis with compositional covariates, where a group of linear constraints on regression coefficients are imposed to account for the compositional nature of the data and to achieve subcompositional coherence. A penalized likelihood estimation procedure using a generalized accelerated proximal gradient method is developed to efficiently estimate the regression coefficients. A de-biased procedure is developed to obtain asymptotically unbiased and normally distributed estimates, which leads to valid confidence intervals of the regression coefficients. Simulations results show the correctness of the coverage probability of the confidence intervals and smaller variances of the estimates when the appropriate linear constraints are imposed. The methods are illustrated by a microbiome study in order to identify bacterial species that are associated with inflammatory bowel disease (IBD) and to predict IBD using fecal microbiome.
△ Less
Submitted 9 January, 2018;
originally announced January 2018.
-
A Model for Paired-Multinomial Data and Its Application to Analysis of Data on a Taxonomic Tree
Authors:
Pixu Shi,
Hongzhe Li
Abstract:
In human microbiome studies, sequencing reads data are often summarized as counts of bacterial taxa at various taxonomic levels specified by a taxonomic tree. This paper considers the problem of analyzing two repeated measurements of microbiome data from the same subjects. Such data are often collected to assess the change of microbial composition after certain treatment, or the difference in micr…
▽ More
In human microbiome studies, sequencing reads data are often summarized as counts of bacterial taxa at various taxonomic levels specified by a taxonomic tree. This paper considers the problem of analyzing two repeated measurements of microbiome data from the same subjects. Such data are often collected to assess the change of microbial composition after certain treatment, or the difference in microbial compositions across body sites. Existing models for such count data are limited in modeling the covariance structure of the counts and in handling paired multinomial count data. A new probability distribution is proposed for paired-multinomial count data, which allows flexible covariance structure and can be used to model repeatedly measured multivariate count data. Based on this distribution, a test statistic is developed for testing the difference in compositions based on paired multinomial count data. The proposed test can be applied to the count data observed on a taxonomic tree in order to test difference in microbiome compositions and to identify the subtrees with different subcompositions. Simulation results indicate that proposed test has correct type 1 errors and increased power compared to some commonly used methods. An analysis of an upper respiratory tract microbiome data set is used to illustrate the proposed methods.
△ Less
Submitted 15 February, 2017;
originally announced February 2017.
-
Weak Signal Identification and Inference in Penalized Model Selection
Authors:
Peibei Shi,
Annie Qu
Abstract:
Weak signal identification and inference are very important in the area of penalized model selection, yet they are under-developed and not well-studied. Existing inference procedures for penalized estimators are mainly focused on strong signals. In this paper, we propose an identification procedure for weak signals in finite samples, and pro- vide a transition phase in-between noise and strong sig…
▽ More
Weak signal identification and inference are very important in the area of penalized model selection, yet they are under-developed and not well-studied. Existing inference procedures for penalized estimators are mainly focused on strong signals. In this paper, we propose an identification procedure for weak signals in finite samples, and pro- vide a transition phase in-between noise and strong signal strengths. We also introduce a new two-step inferential method to construct better confidence intervals for the identified weak signals. Our theory development assumes that variables are orthogonally designed. Both theory and numerical studies indicate that the proposed method leads to better confidence coverage for weak signals, compared with those using asymptotic inference. In addition, the proposed method out- performs the perturbation and bootstrap resampling approaches. We illustrate our method for HIV antiretroviral drug susceptibility data to identify genetic mutations associated with HIV drug resistance.
△ Less
Submitted 14 November, 2016;
originally announced November 2016.
-
Regression Analysis for Microbiome Compositional Data
Authors:
Pixu Shi,
Anru Zhang,
Hongzhe Li
Abstract:
One important problem in microbiome analysis is to identify the bacterial taxa that are associated with a response, where the microbiome data are summarized as the composition of the bacterial taxa at different taxonomic levels. This paper considers regression analysis with such compositional data as covariates. In order to satisfy the subcompositional coherence of the results, linear models with…
▽ More
One important problem in microbiome analysis is to identify the bacterial taxa that are associated with a response, where the microbiome data are summarized as the composition of the bacterial taxa at different taxonomic levels. This paper considers regression analysis with such compositional data as covariates. In order to satisfy the subcompositional coherence of the results, linear models with a set of linear constraints on the regression coefficients are introduced. Such models allow regression analysis for subcompositions and include the log-contrast model for compositional covariates as a special case. A penalized estimation procedure for estimating the regression coefficients and for selecting variables under the linear constraints is developed. A method is also proposed to obtain de-biased estimates of the regression coefficients that are asymptotically unbiased and have a joint asymptotic multivariate normal distribution. This provides valid confidence intervals of the regression coefficients and can be used to obtain the $p$-values. Simulation results show the validity of the confidence intervals and smaller variances of the de-biased estimates when the linear constraints are imposed. The proposed methods are applied to a gut microbiome data set and identify four bacterial genera that are associated with the body mass index after adjusting for the total fat and caloric intakes.
△ Less
Submitted 3 March, 2016;
originally announced March 2016.
-
Demand Modeling, Forecasting, and Counterfactuals, Part I
Authors:
Parag A. Pathak,
Peng Shi
Abstract:
There are relatively few systematic comparisons of the ex ante counterfactual predictions from structural models to what occurs ex post. This paper uses a large-scale policy change in Boston in 2014 to investigate the performance of discrete choice models of demand compared to simpler alternatives. In 2013, Boston Public Schools (BPS) proposed alternative zone configurations in their school choice…
▽ More
There are relatively few systematic comparisons of the ex ante counterfactual predictions from structural models to what occurs ex post. This paper uses a large-scale policy change in Boston in 2014 to investigate the performance of discrete choice models of demand compared to simpler alternatives. In 2013, Boston Public Schools (BPS) proposed alternative zone configurations in their school choice plan, each of which alters the set of schools participants are allowed to rank. Pathak Shi (2013) estimated discrete choice models of demand using families' historical choices and these demand models were used to forecast the outcomes under alternative plans. BPS, the school committee, and the public used these forecasts to compare alternatives and eventually adopt a new plan for Spring 2014. This paper updates the forecasts using the most recently available historical data on participants' submitted preferences and also makes forecasts based on an alternative statistical model not based a random utility foundation. We describe our analysis plan, the methodology, and the target forecast outcomes. Our ex ante forecasts eliminate any scope for post-analysis bias because they are made before new preferences are submitted. Part II will use newly submitted preference data to evaluate these forecasts and assess the strengths and limitations of discrete choice models of demand in our context.
△ Less
Submitted 14 January, 2015; v1 submitted 28 January, 2014;
originally announced January 2014.
-
Methods to Calculate the Upper Bound of Gini Coefficient Based on Grouped Data and the Result for China
Authors:
Pixu Shi,
Anru R. Zhang
Abstract:
Determining an upper bound, particularly the optimal upper bound of the Gini coefficient when dealing with grouped data without specified income brackets, remains an important and open question. In this paper, we introduce an efficient algorithm to calculate the exact optimal upper bound of the Gini coefficient with provable guarantees. To exemplify these methods, we also offer computed results fo…
▽ More
Determining an upper bound, particularly the optimal upper bound of the Gini coefficient when dealing with grouped data without specified income brackets, remains an important and open question. In this paper, we introduce an efficient algorithm to calculate the exact optimal upper bound of the Gini coefficient with provable guarantees. To exemplify these methods, we also offer computed results for the Gini coefficients of urban and rural China spanning the years 2003 to 2008.
△ Less
Submitted 14 January, 2025; v1 submitted 21 May, 2013;
originally announced May 2013.