-
Density Prediction of Income Distribution Based on Mixed Frequency Data
Authors:
Yinzhi Wang,
Yingqiu Zhu,
Ben-Chang Shia,
Lei Qin
Abstract:
Modeling large dependent datasets in modern time series analysis is a crucial research area. One effective approach to handle such datasets is to transform the observations into density functions and apply statistical methods for further analysis. Income distribution forecasting, a common application scenario, benefits from predicting density functions as it accounts for uncertainty around point e…
▽ More
Modeling large dependent datasets in modern time series analysis is a crucial research area. One effective approach to handle such datasets is to transform the observations into density functions and apply statistical methods for further analysis. Income distribution forecasting, a common application scenario, benefits from predicting density functions as it accounts for uncertainty around point estimates, leading to more informed policy formulation. However, predictive modeling becomes challenging when dealing with mixed-frequency data. To address this challenge, this paper introduces a mixed data sampling regression model for probability density functions (PDF-MIDAS). To mitigate variance inflation caused by high-frequency prediction variables, we utilize exponential Almon polynomials with fewer parameters to regularize the coefficient structure. Additionally, we propose an iterative estimation method based on quadratic programming and the BFGS algorithm. Simulation analyses demonstrate that as the sample size for estimating density functions and observation length increase, the estimator approaches the true value. Real data analysis reveals that compared to single-sequence prediction models, PDF-MIDAS incorporating high-frequency exogenous variables offers a wider range of application scenarios with superior fitting and prediction performance.
△ Less
Submitted 21 July, 2025;
originally announced July 2025.
-
Interconnections of Multimorbidity-Related Clinical Outcomes: Analysis of Health Administrative Claims Data with a Dynamic Network Approach
Authors:
Hao Mei,
Haonan Xiao,
Ben-Chang Shia,
Guanzhong Qiao,
Yang Li
Abstract:
Given the rising complexity and burden of multimorbidity, it is crucial to provide evidence-based support for managing multimorbidity-related clinical outcomes. This study introduces a dynamic network approach to investigate conditional and time-varying interconnections in disease-specific clinical outcomes. Our method effectively tackles the issue of zero inflation, a frequent challenge in medica…
▽ More
Given the rising complexity and burden of multimorbidity, it is crucial to provide evidence-based support for managing multimorbidity-related clinical outcomes. This study introduces a dynamic network approach to investigate conditional and time-varying interconnections in disease-specific clinical outcomes. Our method effectively tackles the issue of zero inflation, a frequent challenge in medical data that complicates traditional modeling techniques. The theoretical foundations of the proposed approach are rigorously developed and validated through extensive simulations. Using Taiwan's health administrative claims data from 2000 to 2013, we construct 14 yearly networks that are temporally correlated, featuring 125 nodes that represent different disease conditions. Key network properties, such as connectivity, module, and temporal variation are analyzed. To demonstrate how these networks can inform multimorbidity management, we focus on breast cancer and analyze the relevant network structures. The findings provide valuable clinical insights that enhance the current understanding of multimorbidity. The proposed methods offer promising applications in shaping treatment strategies, optimizing health resource allocation, and informing health policy development in the context of multimorbidity management.
△ Less
Submitted 8 April, 2025;
originally announced April 2025.
-
Pan-disease clustering analysis of the trend of period prevalence
Authors:
Sneha Jadhav,
Chenjin Ma,
Yefei Jiang,
Ben-Chang Shia,
Shuangge Ma
Abstract:
For all diseases, prevalence has been carefully studied. In the "classic" paradigm, the prevalence of different diseases has usually been studied separately. Accumulating evidences have shown that diseases can be "correlated". The joint analysis of prevalence of multiple diseases can provide important insights beyond individual-disease analysis, however, has not been well conducted. In this study,…
▽ More
For all diseases, prevalence has been carefully studied. In the "classic" paradigm, the prevalence of different diseases has usually been studied separately. Accumulating evidences have shown that diseases can be "correlated". The joint analysis of prevalence of multiple diseases can provide important insights beyond individual-disease analysis, however, has not been well conducted. In this study, we take advantage of the uniquely valuable Taiwan National Health Insurance Research Database (NHIRD), and conduct a pan-disease analysis of period prevalence trend. The goal is to identify clusters within which diseases share similar period prevalence trends. For this purpose, a novel penalization pursuit approach is developed, which has an intuitive formulation and satisfactory properties. In data analysis, the period prevalence values are computed using records on close to 1 million subjects and 14 years of observation. For 405 diseases, 35 nontrivial clusters (with sizes larger than one) and 27 trivial clusters (with sizes one) are identified. The results differ significantly from those of the alternatives. A closer examination suggests that the clustering results have sound interpretations. This study is the first to conduct a pan-disease clustering analysis of prevalence trend using the uniquely valuable NHIRD data and can have important value in multiple aspects.
△ Less
Submitted 17 September, 2018;
originally announced September 2018.
-
Variable Selection with Scalable Bootstrap in Generalized Linear Model for Massive Data
Authors:
Zhibing He,
Yichen Qin,
Ben-Chang Shia,
Yang Li
Abstract:
Bootstrap is commonly used as a tool for non-parametric statistical inference to estimate meaningful parameters in Variable Selection Models. However, for massive dataset that has exponential growth rate, the computation of Bootstrap Variable Selection (BootVS) can be a crucial issue. In this paper, we propose the method of Variable Selection with Bag of Little Bootstraps (BLBVS) on General Linear…
▽ More
Bootstrap is commonly used as a tool for non-parametric statistical inference to estimate meaningful parameters in Variable Selection Models. However, for massive dataset that has exponential growth rate, the computation of Bootstrap Variable Selection (BootVS) can be a crucial issue. In this paper, we propose the method of Variable Selection with Bag of Little Bootstraps (BLBVS) on General Linear Regression and extend it to Generalized Linear Model for selecting important parameters and assessing the quality of estimators' computation efficiency by analyzing results of multiple bootstrap sub-samples. The introduced method best suits large datasets which have parallel and distributed computing structures. To test the performance of BLBVS, we compare it with BootVS from different aspects via empirical studies. The results of simulations show our method has excellent performance. A real data analysis, Risk Forecast of Credit Cards, is also presented to illustrate the computational superiority of BLBVS on large scale datasets, and the result demonstrates the usefulness and validity of our proposed method.
△ Less
Submitted 23 December, 2016; v1 submitted 6 December, 2016;
originally announced December 2016.