-
Spatiotemporal Assessment of Aircraft Noise Exposure Using Mobile Phone-Derived Population Estimates and High-Resolution Noise Measurements
Authors:
Soohwan Oh,
Hyunsoo Cho,
Jungwoo Cho
Abstract:
Aircraft noise exposure has traditionally been assessed using static residential population data and long-term average noise metrics, often overlooking the dynamic nature of human mobility and temporal variations in operational conditions. This study proposes a data-driven framework that integrates high-resolution noise measurements from airport monitoring terminals with mobile phone-derived de fa…
▽ More
Aircraft noise exposure has traditionally been assessed using static residential population data and long-term average noise metrics, often overlooking the dynamic nature of human mobility and temporal variations in operational conditions. This study proposes a data-driven framework that integrates high-resolution noise measurements from airport monitoring terminals with mobile phone-derived de facto population estimates to evaluate noise exposure with fine spatio-temporal resolution. We develop hourly noise exposure profiles and quantify the number of individuals affected across regions and time windows, using both absolute counts and inequality metrics such as Gini coefficients. This enables a nuanced examination of not only who is exposed, but when and where the burden is concentrated. At our case study airport, operational runway patterns resulted in recurring spatial shifts in noise exposure. By incorporating de facto population data, we demonstrate that identical noise operations can yield unequal impacts depending on the time and location of population presence, highlighting the importance of accounting for population dynamics in exposure assessment. Our approach offers a scalable basis for designing population-sensitive noise abatement strategies, contributing to more equitable and transparent aviation noise management.
△ Less
Submitted 22 April, 2025;
originally announced April 2025.
-
Estimation of System Parameters Including Repeated Cross-Sectional Data through Emulator-Informed Deep Generative Model
Authors:
Hyunwoo Cho,
Sung Woong Cho,
Hyeontae Jo,
Hyung Ju Hwang
Abstract:
Differential equations (DEs) are crucial for modeling the evolution of natural or engineered systems. Traditionally, the parameters in DEs are adjusted to fit data from system observations. However, in fields such as politics, economics, and biology, available data are often independently collected at distinct time points from different subjects (i.e., repeated cross-sectional (RCS) data). Convent…
▽ More
Differential equations (DEs) are crucial for modeling the evolution of natural or engineered systems. Traditionally, the parameters in DEs are adjusted to fit data from system observations. However, in fields such as politics, economics, and biology, available data are often independently collected at distinct time points from different subjects (i.e., repeated cross-sectional (RCS) data). Conventional optimization techniques struggle to accurately estimate DE parameters when RCS data exhibit various heterogeneities, leading to a significant loss of information. To address this issue, we propose a new estimation method called the emulator-informed deep-generative model (EIDGM), designed to handle RCS data. Specifically, EIDGM integrates a physics-informed neural network-based emulator that immediately generates DE solutions and a Wasserstein generative adversarial network-based parameter generator that can effectively mimic the RCS data. We evaluated EIDGM on exponential growth, logistic population models, and the Lorenz system, demonstrating its superior ability to accurately capture parameter distributions. Additionally, we applied EIDGM to an experimental dataset of Amyloid beta 40 and beta 42, successfully capturing diverse parameter distribution shapes. This shows that EIDGM can be applied to model a wide range of systems and extended to uncover the operating principles of systems based on limited data.
△ Less
Submitted 27 December, 2024;
originally announced December 2024.
-
Generating Synthetic Electronic Health Record (EHR) Data: A Review with Benchmarking
Authors:
Xingran Chen,
Zhenke Wu,
Xu Shi,
Hyunghoon Cho,
Bhramar Mukherjee
Abstract:
We conduct a scoping review of existing approaches for synthetic EHR data generation, and benchmark major methods with proposed open-source software to offer recommendations for practitioners. We search three academic databases for our scoping review. Methods are benchmarked on open-source EHR datasets, MIMIC-III/IV. Seven existing methods covering major categories and two baseline methods are imp…
▽ More
We conduct a scoping review of existing approaches for synthetic EHR data generation, and benchmark major methods with proposed open-source software to offer recommendations for practitioners. We search three academic databases for our scoping review. Methods are benchmarked on open-source EHR datasets, MIMIC-III/IV. Seven existing methods covering major categories and two baseline methods are implemented and compared. Evaluation metrics concern data fidelity, downstream utility, privacy protection, and computational cost. 42 studies are identified and classified into five categories. Seven open-source methods covering all categories are selected, trained on MIMIC-III, and evaluated on MIMIC-III or MIMIC-IV for transportability considerations. Among them, GAN-based methods demonstrate competitive performance in fidelity and utility on MIMIC-III; rule-based methods excel in privacy protection. Similar findings are observed on MIMIC-IV, except that GAN-based methods further outperform the baseline methods in preserving fidelity. A Python package, ``SynthEHRella'', is provided to integrate various choices of approaches and evaluation metrics, enabling more streamlined exploration and evaluation of multiple methods. We found that method choice is governed by the relative importance of the evaluation metrics in downstream use cases. We provide a decision tree to guide the choice among the benchmarked methods. Based on the decision tree, GAN-based methods excel when distributional shifts exist between the training and testing populations. Otherwise, CorGAN and MedGAN are most suitable for association modeling and predictive modeling, respectively. Future research should prioritize enhancing fidelity of the synthetic data while controlling privacy exposure, and comprehensive benchmarking of longitudinal or conditional generation methods.
△ Less
Submitted 6 November, 2024;
originally announced November 2024.
-
Privacy-Preserving Dynamic Assortment Selection
Authors:
Young Hyun Cho,
Will Wei Sun
Abstract:
With the growing demand for personalized assortment recommendations, concerns over data privacy have intensified, highlighting the urgent need for effective privacy-preserving strategies. This paper presents a novel framework for privacy-preserving dynamic assortment selection using the multinomial logit (MNL) bandits model. Our approach employs a perturbed upper confidence bound method, integrati…
▽ More
With the growing demand for personalized assortment recommendations, concerns over data privacy have intensified, highlighting the urgent need for effective privacy-preserving strategies. This paper presents a novel framework for privacy-preserving dynamic assortment selection using the multinomial logit (MNL) bandits model. Our approach employs a perturbed upper confidence bound method, integrating calibrated noise into user utility estimates to balance between exploration and exploitation while ensuring robust privacy protection. We rigorously prove that our policy satisfies Joint Differential Privacy (JDP), which better suits dynamic environments than traditional differential privacy, effectively mitigating inference attack risks. This analysis is built upon a novel objective perturbation technique tailored for MNL bandits, which is also of independent interest. Theoretically, we derive a near-optimal regret bound of $\tilde{O}(\sqrt{T})$ for our policy and explicitly quantify how privacy protection impacts regret. Through extensive simulations and an application to the Expedia hotel dataset, we demonstrate substantial performance enhancements over the benchmark method.
△ Less
Submitted 29 October, 2024;
originally announced October 2024.
-
Formal Privacy Guarantees with Invariant Statistics
Authors:
Young Hyun Cho,
Jordan Awan
Abstract:
Motivated by the 2020 US Census products, this paper extends differential privacy (DP) to address the joint release of DP outputs and nonprivate statistics, referred to as invariant. Our framework, Semi-DP, redefines adjacency by focusing on datasets that conform to the given invariant, ensuring indistinguishability between adjacent datasets within invariant-conforming datasets. We further develop…
▽ More
Motivated by the 2020 US Census products, this paper extends differential privacy (DP) to address the joint release of DP outputs and nonprivate statistics, referred to as invariant. Our framework, Semi-DP, redefines adjacency by focusing on datasets that conform to the given invariant, ensuring indistinguishability between adjacent datasets within invariant-conforming datasets. We further develop customized mechanisms that satisfy Semi-DP, including the Gaussian mechanism and the optimal $K$-norm mechanism for rank-deficient sensitivity spaces. Our framework is applied to contingency table analysis which is relevant to the 2020 US Census, illustrating how Semi-DP enables the release of private outputs given the one-way margins as the invariant. Additionally, we provide a privacy analysis of the 2020 US Decennial Census using the Semi-DP framework, revealing that the effective privacy guarantees are weaker than advertised.
△ Less
Submitted 22 October, 2024;
originally announced October 2024.
-
Moving sum procedure for multiple change point detection in large factor models
Authors:
Matteo Barigozzi,
Haeran Cho,
Lorenzo Trapani
Abstract:
The paper proposes a moving sum methodology for detecting multiple change points in high-dimensional time series under a factor model, where changes are attributed to those in loadings as well as emergence or disappearance of factors. We establish the asymptotic null distribution of the proposed test for family-wise error control, and show the consistency of the procedure for multiple change point…
▽ More
The paper proposes a moving sum methodology for detecting multiple change points in high-dimensional time series under a factor model, where changes are attributed to those in loadings as well as emergence or disappearance of factors. We establish the asymptotic null distribution of the proposed test for family-wise error control, and show the consistency of the procedure for multiple change point estimation. Simulation studies and an application to a large dataset of volatilities demonstrate the competitive performance of the proposed method.
△ Less
Submitted 3 October, 2024;
originally announced October 2024.
-
Tail-robust factor modelling of vector and tensor time series in high dimensions
Authors:
Matteo Barigozzi,
Haeran Cho,
Hyeyoung Maeng
Abstract:
We study the problem of factor modelling vector- and tensor-valued time series in the presence of heavy tails in the data, which produce anomalous observations with non-negligible probability. For this, we propose to combine a two-step procedure for tensor data decomposition with data truncation, which is easy to implement and does not require an iterative search for a numerical solution. Departin…
▽ More
We study the problem of factor modelling vector- and tensor-valued time series in the presence of heavy tails in the data, which produce anomalous observations with non-negligible probability. For this, we propose to combine a two-step procedure for tensor data decomposition with data truncation, which is easy to implement and does not require an iterative search for a numerical solution. Departing away from the light-tail assumptions often adopted in the time series factor modelling literature, we derive the consistency and asymptotic normality of the proposed estimators while assuming the existence of the $(2 + 2ε)$-th moment only for some $ε\in (0, 1)$. Our rates explicitly depend on $ε$ characterising the effect of heavy tails, and on the chosen level of truncation. We also propose a consistent criterion for determining the number of factors. Simulation studies and applications to two macroeconomic datasets demonstrate the good performance of the proposed estimators.
△ Less
Submitted 26 February, 2025; v1 submitted 12 July, 2024;
originally announced July 2024.
-
Estimation and Inference for Change Points in Functional Regression Time Series
Authors:
Shivam Kumar,
Haotian Xu,
Haeran Cho,
Daren Wang
Abstract:
In this paper, we study the estimation and inference of change points under a functional linear regression model with changes in the slope function. We present a novel Functional Regression Binary Segmentation (FRBS) algorithm which is computationally efficient as well as achieving consistency in multiple change point detection. This algorithm utilizes the predictive power of piece-wise constant f…
▽ More
In this paper, we study the estimation and inference of change points under a functional linear regression model with changes in the slope function. We present a novel Functional Regression Binary Segmentation (FRBS) algorithm which is computationally efficient as well as achieving consistency in multiple change point detection. This algorithm utilizes the predictive power of piece-wise constant functional linear regression models in the reproducing kernel Hilbert space framework. We further propose a refinement step that improves the localization rate of the initial estimator output by FRBS, and derive asymptotic distributions of the refined estimators for two different regimes determined by the magnitude of a change. To facilitate the construction of confidence intervals for underlying change points based on the limiting distribution, we propose a consistent block-type long-run variance estimator. Our theoretical justifications for the proposed approach accommodate temporal dependence and heavy-tailedness in both the functional covariates and the measurement errors. Empirical effectiveness of our methodology is demonstrated through extensive simulation studies and an application to the Standard and Poor's 500 index dataset.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
Manipulating a Continuous Instrumental Variable in an Observational Study of Premature Babies: Algorithm, Partial Identification Bounds, and Inference under Randomization and Biased Randomization Assumptions
Authors:
Zhe Chen,
Min Haeng Cho,
Bo Zhang
Abstract:
Regionalization of intensive care for premature babies refers to a triage system of mothers with high-risk pregnancies to hospitals of varied capabilities based on risks faced by infants. Due to the limited capacity of high-level hospitals, which are equipped with advanced expertise to provide critical care, understanding the effect of delivering premature babies at such hospitals on infant mortal…
▽ More
Regionalization of intensive care for premature babies refers to a triage system of mothers with high-risk pregnancies to hospitals of varied capabilities based on risks faced by infants. Due to the limited capacity of high-level hospitals, which are equipped with advanced expertise to provide critical care, understanding the effect of delivering premature babies at such hospitals on infant mortality for different subgroups of high-risk mothers could facilitate the design of an efficient perinatal regionalization system. Towards answering this question, Baiocchi et al. (2010) proposed to strengthen an excess-travel-time-based, continuous instrumental variable (IV) in an IV-based, matched-pair design by switching focus to a smaller cohort amenable to being paired with a larger separation in the IV dose. Three elements changed with the strengthened IV: the study cohort, compliance rate and latent complier subgroup. Here, we introduce a non-bipartite, template matching algorithm that embeds data into a target, pair-randomized encouragement trial which maintains fidelity to the original study cohort while strengthening the IV. We then study randomization-based and IV-dependent, biased-randomization-based inference of partial identification bounds for the sample average treatment effect (SATE) in an IV-based matched pair design, which deviates from the usual effect ratio estimand in that the SATE is agnostic to the IV and who is matched to whom, although a strengthened IV design could narrow the partial identification bounds. Based on our proposed strengthened-IV design, we found that delivering at a high-level NICU reduced preterm babies' mortality rate compared to a low-level NICU for $81,766 \times 2 = 163,532$ mothers and their preterm babies and the effect appeared to be minimal among non-black, low-risk mothers.
△ Less
Submitted 27 September, 2024; v1 submitted 26 April, 2024;
originally announced April 2024.
-
Interval-censored linear quantile regression
Authors:
Taehwa Choi,
Seohyeon Park,
Hunyong Cho,
Sangbum Choi
Abstract:
Censored quantile regression has emerged as a prominent alternative to classical Cox's proportional hazards model or accelerated failure time model in both theoretical and applied statistics. While quantile regression has been extensively studied for right-censored survival data, methodologies for analyzing interval-censored data remain limited in the survival analysis literature. This paper intro…
▽ More
Censored quantile regression has emerged as a prominent alternative to classical Cox's proportional hazards model or accelerated failure time model in both theoretical and applied statistics. While quantile regression has been extensively studied for right-censored survival data, methodologies for analyzing interval-censored data remain limited in the survival analysis literature. This paper introduces a novel local weighting approach for estimating linear censored quantile regression, specifically tailored to handle diverse forms of interval-censored survival data. The estimation equation and the corresponding convex objective function for the regression parameter can be constructed as a weighted average of quantile loss contributions at two interval endpoints. The weighting components are nonparametrically estimated using local kernel smoothing or ensemble machine learning techniques. To estimate the nonparametric distribution mass for interval-censored data, a modified EM algorithm for nonparametric maximum likelihood estimation is employed by introducing subject-specific latent Poisson variables. The proposed method's empirical performance is demonstrated through extensive simulation studies and real data analyses of two HIV/AIDS datasets.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
Detection and inference of changes in high-dimensional linear regression with non-sparse structures
Authors:
Haeran Cho,
Tobias Kley,
Housen Li
Abstract:
For data segmentation in high-dimensional linear regression settings, the regression parameters are often assumed to be sparse segment-wise, which enables many existing methods to estimate the parameters locally via $\ell_1$-regularised maximum likelihood-type estimation and then contrast them for change point detection. Contrary to this common practice, we show that the exact sparsity of neither…
▽ More
For data segmentation in high-dimensional linear regression settings, the regression parameters are often assumed to be sparse segment-wise, which enables many existing methods to estimate the parameters locally via $\ell_1$-regularised maximum likelihood-type estimation and then contrast them for change point detection. Contrary to this common practice, we show that the exact sparsity of neither regression parameters nor their differences, a.k.a.\ differential parameters, is necessary for consistency in multiple change point detection. In fact, both statistically and computationally, better efficiency is attained by a simple strategy that scans for large discrepancies in local covariance between the regressors and the response. We go a step further and propose a suite of tools for directly inferring about the differential parameters post-segmentation, which are applicable even when the regression parameters themselves are non-sparse. Theoretical investigations are conducted under general conditions permitting non-Gaussianity, temporal dependence and ultra-high dimensionality. Numerical results from simulated and macroeconomic datasets demonstrate the competitiveness and efficacy of the proposed methods. Implementation of all methods is provided in the R package \texttt{inferchange} on GitHub.
△ Less
Submitted 13 May, 2025; v1 submitted 10 February, 2024;
originally announced February 2024.
-
Fair Streaming Principal Component Analysis: Statistical and Algorithmic Viewpoint
Authors:
Junghyun Lee,
Hanseul Cho,
Se-Young Yun,
Chulhee Yun
Abstract:
Fair Principal Component Analysis (PCA) is a problem setting where we aim to perform PCA while making the resulting representation fair in that the projected distributions, conditional on the sensitive attributes, match one another. However, existing approaches to fair PCA have two main problems: theoretically, there has been no statistical foundation of fair PCA in terms of learnability; practica…
▽ More
Fair Principal Component Analysis (PCA) is a problem setting where we aim to perform PCA while making the resulting representation fair in that the projected distributions, conditional on the sensitive attributes, match one another. However, existing approaches to fair PCA have two main problems: theoretically, there has been no statistical foundation of fair PCA in terms of learnability; practically, limited memory prevents us from using existing approaches, as they explicitly rely on full access to the entire data. On the theoretical side, we rigorously formulate fair PCA using a new notion called \emph{probably approximately fair and optimal} (PAFO) learnability. On the practical side, motivated by recent advances in streaming algorithms for addressing memory limitation, we propose a new setting called \emph{fair streaming PCA} along with a memory-efficient algorithm, fair noisy power method (FNPM). We then provide its {\it statistical} guarantee in terms of PAFO-learnability, which is the first of its kind in fair PCA literature. Lastly, we verify the efficacy and memory efficiency of our algorithm on real-world datasets.
△ Less
Submitted 28 October, 2023;
originally announced October 2023.
-
Nonparametric data segmentation in multivariate time series via joint characteristic functions
Authors:
Euan T. McGonigle,
Haeran Cho
Abstract:
Modern time series data often exhibit complex dependence and structural changes which are not easily characterised by shifts in the mean or model parameters. We propose a nonparametric data segmentation methodology for multivariate time series termed NP-MOJO. By considering joint characteristic functions between the time series and its lagged values, NP-MOJO is able to detect change points in the…
▽ More
Modern time series data often exhibit complex dependence and structural changes which are not easily characterised by shifts in the mean or model parameters. We propose a nonparametric data segmentation methodology for multivariate time series termed NP-MOJO. By considering joint characteristic functions between the time series and its lagged values, NP-MOJO is able to detect change points in the marginal distribution, but also those in possibly non-linear serial dependence, all without the need to pre-specify the type of changes. We show the theoretical consistency of NP-MOJO in estimating the total number and the locations of the change points, and demonstrate the good performance of NP-MOJO against a variety of change point scenarios. We further demonstrate its usefulness in applications to seismology and economic time series.
△ Less
Submitted 6 March, 2025; v1 submitted 12 May, 2023;
originally announced May 2023.
-
fnets: An R Package for Network Estimation and Forecasting via Factor-Adjusted VAR Modelling
Authors:
Dom Owens,
Haeran Cho,
Matteo Barigozzi
Abstract:
The package fnets for the R language implements the suite of methodologies proposed by Barigozzi et al. (2022) for the network estimation and forecasting of high-dimensional time series under a factor-adjusted vector autoregressive model, which permits strong spatial and temporal correlations in the data. Additionally, we provide tools for visualising the networks underlying the time series data a…
▽ More
The package fnets for the R language implements the suite of methodologies proposed by Barigozzi et al. (2022) for the network estimation and forecasting of high-dimensional time series under a factor-adjusted vector autoregressive model, which permits strong spatial and temporal correlations in the data. Additionally, we provide tools for visualising the networks underlying the time series data after adjusting for the presence of factors. The package also offers data-driven methods for selecting tuning parameters including the number of factors, vector autoregressive order and thresholds for estimating the edge sets of the networks of interest in time series analysis. We demonstrate various features of fnets on simulated datasets as well as real data on electricity prices.
△ Less
Submitted 4 July, 2023; v1 submitted 27 January, 2023;
originally announced January 2023.
-
SGDA with shuffling: faster convergence for nonconvex-PŁ minimax optimization
Authors:
Hanseul Cho,
Chulhee Yun
Abstract:
Stochastic gradient descent-ascent (SGDA) is one of the main workhorses for solving finite-sum minimax optimization problems. Most practical implementations of SGDA randomly reshuffle components and sequentially use them (i.e., without-replacement sampling); however, there are few theoretical results on this approach for minimax algorithms, especially outside the easier-to-analyze (strongly-)monot…
▽ More
Stochastic gradient descent-ascent (SGDA) is one of the main workhorses for solving finite-sum minimax optimization problems. Most practical implementations of SGDA randomly reshuffle components and sequentially use them (i.e., without-replacement sampling); however, there are few theoretical results on this approach for minimax algorithms, especially outside the easier-to-analyze (strongly-)monotone setups. To narrow this gap, we study the convergence bounds of SGDA with random reshuffling (SGDA-RR) for smooth nonconvex-nonconcave objectives with Polyak-Łojasiewicz (PŁ) geometry. We analyze both simultaneous and alternating SGDA-RR for nonconvex-PŁ and primal-PŁ-PŁ objectives, and obtain convergence rates faster than with-replacement SGDA. Our rates extend to mini-batch SGDA-RR, recovering known rates for full-batch gradient descent-ascent (GDA). Lastly, we present a comprehensive lower bound for GDA with an arbitrary step-size ratio, which matches the full-batch upper bound for the primal-PŁ-PŁ case.
△ Less
Submitted 20 February, 2023; v1 submitted 12 October, 2022;
originally announced October 2022.
-
High-dimensional data segmentation in regression settings permitting temporal dependence and non-Gaussianity
Authors:
Haeran Cho,
Dom Owens
Abstract:
We propose a data segmentation methodology for the high-dimensional linear regression problem where regression parameters are allowed to undergo multiple changes. The proposed methodology, MOSEG, proceeds in two stages: first, the data are scanned for multiple change points using a moving window-based procedure, which is followed by a location refinement stage. MOSEG enjoys computational efficienc…
▽ More
We propose a data segmentation methodology for the high-dimensional linear regression problem where regression parameters are allowed to undergo multiple changes. The proposed methodology, MOSEG, proceeds in two stages: first, the data are scanned for multiple change points using a moving window-based procedure, which is followed by a location refinement stage. MOSEG enjoys computational efficiency thanks to the adoption of a coarse grid in the first stage, and achieves theoretical consistency in estimating both the total number and the locations of the change points, under general conditions permitting serial dependence and non-Gaussianity. We also propose MOSEG.MS, a multiscale extension of MOSEG which, while comparable to MOSEG in terms of computational complexity, achieves theoretical consistency for a broader parameter space where large parameter shifts over short intervals and small changes over long stretches of stationarity are simultaneously allowed. We demonstrate good performance of the proposed methods in comparative simulation studies and in an application to predicting the equity premium.
△ Less
Submitted 31 October, 2023; v1 submitted 19 September, 2022;
originally announced September 2022.
-
Capturing usage patterns in bike sharing system via multilayer network fused Lasso
Authors:
Yunjin Choi,
Haeran Cho,
Hyelim Son
Abstract:
Data collected from a bike-sharing system exhibit complex temporal and spatial features. We analyze shared-bike usage data collected in three large cities at the level of individual stations, accounting for station-specific behavior and covariate effects. For this, we adopt a penalized regression approach with a multilayer network fused Lasso penalty. These fusion penalties are imposed on networks…
▽ More
Data collected from a bike-sharing system exhibit complex temporal and spatial features. We analyze shared-bike usage data collected in three large cities at the level of individual stations, accounting for station-specific behavior and covariate effects. For this, we adopt a penalized regression approach with a multilayer network fused Lasso penalty. These fusion penalties are imposed on networks which embed spatio-temporal linkages, and capture the homogeneity in bike usage that is attributed to intricate spatio-temporal features without arbitrarily partitioning the data. On the real-life datasets, we demonstrate that the proposed approach yields competitive predictive performance and provides a new interpretation of the data.
△ Less
Submitted 25 August, 2024; v1 submitted 17 August, 2022;
originally announced August 2022.
-
Moving sum procedure for change point detection under piecewise linearity
Authors:
Joonpyo Kim,
Hee-Seok Oh,
Haeran Cho
Abstract:
We propose a computationally and statistically efficient procedure for segmenting univariate data under piecewise linearity. The proposed moving sum (MOSUM) methodology detects multiple change points where the underlying signal undergoes discontinuous jumps and/or slope changes. Theoretically, it controls the family-wise error rate at a given significance level asymptotically and achieves consiste…
▽ More
We propose a computationally and statistically efficient procedure for segmenting univariate data under piecewise linearity. The proposed moving sum (MOSUM) methodology detects multiple change points where the underlying signal undergoes discontinuous jumps and/or slope changes. Theoretically, it controls the family-wise error rate at a given significance level asymptotically and achieves consistency in multiple change point detection, as well as matching the minimax optimal rate of estimation when the signal is piecewise linear and continuous, all under weak assumptions permitting serial dependence and heavy-tailedness. Computationally, the complexity of the MOSUM procedure is $O(n)$ which, combined with its good performance on simulated datasets, making it highly attractive in comparison with the existing methods. We further demonstrate its good performance on a real data example on rolling element-bearing prognostics.
△ Less
Submitted 24 August, 2023; v1 submitted 9 August, 2022;
originally announced August 2022.
-
Rethinking Efficacy of Softmax for Lightweight Non-Local Neural Networks
Authors:
Yooshin Cho,
Youngsoo Kim,
Hanbyel Cho,
Jaesung Ahn,
Hyeong Gwon Hong,
Junmo Kim
Abstract:
Non-local (NL) block is a popular module that demonstrates the capability to model global contexts. However, NL block generally has heavy computation and memory costs, so it is impractical to apply the block to high-resolution feature maps. In this paper, to investigate the efficacy of NL block, we empirically analyze if the magnitude and direction of input feature vectors properly affect the atte…
▽ More
Non-local (NL) block is a popular module that demonstrates the capability to model global contexts. However, NL block generally has heavy computation and memory costs, so it is impractical to apply the block to high-resolution feature maps. In this paper, to investigate the efficacy of NL block, we empirically analyze if the magnitude and direction of input feature vectors properly affect the attention between vectors. The results show the inefficacy of softmax operation which is generally used to normalize the attention map of the NL block. Attention maps normalized with softmax operation highly rely upon magnitude of key vectors, and performance is degenerated if the magnitude information is removed. By replacing softmax operation with the scaling factor, we demonstrate improved performance on CIFAR-10, CIFAR-100, and Tiny-ImageNet. In Addition, our method shows robustness to embedding channel reduction and embedding weight initialization. Notably, our method makes multi-head attention employable without additional computational cost.
△ Less
Submitted 27 July, 2022;
originally announced July 2022.
-
Robust multiscale estimation of time-average variance for time series segmentation
Authors:
Euan T. McGonigle,
Haeran Cho
Abstract:
There exist several methods developed for the canonical change point problem of detecting multiple mean shifts, which search for changes over sections of the data at multiple scales. In such methods, estimation of the noise level is often required in order to distinguish genuine changes from random fluctuations due to the noise. When serial dependence is present, using a single estimator of the no…
▽ More
There exist several methods developed for the canonical change point problem of detecting multiple mean shifts, which search for changes over sections of the data at multiple scales. In such methods, estimation of the noise level is often required in order to distinguish genuine changes from random fluctuations due to the noise. When serial dependence is present, using a single estimator of the noise level may not be appropriate. Instead, it is proposed to adopt a scale-dependent time-average variance constant that depends on the length of the data section in consideration, to gauge the level of the noise therein. Accordingly, an estimator that is robust to the presence of multiple mean shifts is developed. The consistency of the proposed estimator is shown under general assumptions permitting heavy-tailedness, and its use with two widely adopted data segmentation algorithms, the moving sum and the wild binary segmentation procedures, is discussed. The performance of the proposed estimator is illustrated through extensive simulation studies and on applications to the house price index and air quality data sets.
△ Less
Submitted 10 October, 2022; v1 submitted 23 May, 2022;
originally announced May 2022.
-
High-dimensional time series segmentation via factor-adjusted vector autoregressive modelling
Authors:
Haeran Cho,
Hyeyoung Maeng,
Idris A. Eckley,
Paul Fearnhead
Abstract:
Vector autoregressive (VAR) models are popularly adopted for modelling high-dimensional time series, and their piecewise extensions allow for structural changes in the data. In VAR modelling, the number of parameters grow quadratically with the dimensionality which necessitates the sparsity assumption in high dimensions. However, it is debatable whether such an assumption is adequate for handling…
▽ More
Vector autoregressive (VAR) models are popularly adopted for modelling high-dimensional time series, and their piecewise extensions allow for structural changes in the data. In VAR modelling, the number of parameters grow quadratically with the dimensionality which necessitates the sparsity assumption in high dimensions. However, it is debatable whether such an assumption is adequate for handling datasets exhibiting strong serial and cross-sectional correlations. We propose a piecewise stationary time series model that simultaneously allows for strong correlations as well as structural changes, where pervasive serial and cross-sectional correlations are accounted for by a time-varying factor structure, and any remaining idiosyncratic dependence between the variables is handled by a piecewise stationary VAR model. We propose an accompanying two-stage data segmentation methodology which fully addresses the challenges arising from the latency of the component processes. Its consistency in estimating both the total number and the locations of the change points in the latent components, is established under conditions considerably more general than those in the existing literature. We demonstrate the competitive performance of the proposed methodology on simulated datasets and an application to US blue chip stocks data.
△ Less
Submitted 20 January, 2023; v1 submitted 6 April, 2022;
originally announced April 2022.
-
FNETS: Factor-adjusted network estimation and forecasting for high-dimensional time series
Authors:
Matteo Barigozzi,
Haeran Cho,
Dom Owens
Abstract:
We propose FNETS, a methodology for network estimation and forecasting of high-dimensional time series exhibiting strong serial- and cross-sectional correlations. We operate under a factor-adjusted vector autoregressive (VAR) model which, after accounting for pervasive co-movements of the variables by {\it common} factors, models the remaining {\it idiosyncratic} dynamic dependence between the var…
▽ More
We propose FNETS, a methodology for network estimation and forecasting of high-dimensional time series exhibiting strong serial- and cross-sectional correlations. We operate under a factor-adjusted vector autoregressive (VAR) model which, after accounting for pervasive co-movements of the variables by {\it common} factors, models the remaining {\it idiosyncratic} dynamic dependence between the variables as a sparse VAR process. Network estimation of FNETS consists of three steps: (i) factor-adjustment via dynamic principal component analysis, (ii) estimation of the latent VAR process via $\ell_1$-regularised Yule-Walker estimator, and (iii) estimation of partial correlation and long-run partial correlation matrices. In doing so, we learn three networks underpinning the VAR process, namely a directed network representing the Granger causal linkages between the variables, an undirected one embedding their contemporaneous relationships and finally, an undirected network that summarises both lead-lag and contemporaneous linkages. In addition, FNETS provides a suite of methods for forecasting the factor-driven and the idiosyncratic VAR processes. Under general conditions permitting tails heavier than the Gaussian one, we derive uniform consistency rates for the estimators in both network estimation and forecasting, which hold as the dimension of the panel and the sample size diverge. Simulation studies and real data application confirm the good performance of FNETS.
△ Less
Submitted 4 March, 2025; v1 submitted 16 January, 2022;
originally announced January 2022.
-
Improving Generalization of Batch Whitening by Convolutional Unit Optimization
Authors:
Yooshin Cho,
Hanbyel Cho,
Youngsoo Kim,
Junmo Kim
Abstract:
Batch Whitening is a technique that accelerates and stabilizes training by transforming input features to have a zero mean (Centering) and a unit variance (Scaling), and by removing linear correlation between channels (Decorrelation). In commonly used structures, which are empirically optimized with Batch Normalization, the normalization layer appears between convolution and activation function. F…
▽ More
Batch Whitening is a technique that accelerates and stabilizes training by transforming input features to have a zero mean (Centering) and a unit variance (Scaling), and by removing linear correlation between channels (Decorrelation). In commonly used structures, which are empirically optimized with Batch Normalization, the normalization layer appears between convolution and activation function. Following Batch Whitening studies have employed the same structure without further analysis; even Batch Whitening was analyzed on the premise that the input of a linear layer is whitened. To bridge the gap, we propose a new Convolutional Unit that is in line with the theory, and our method generally improves the performance of Batch Whitening. Moreover, we show the inefficacy of the original Convolutional Unit by investigating rank and correlation of features. As our method is employable off-the-shelf whitening modules, we use Iterative Normalization (IterNorm), the state-of-the-art whitening module, and obtain significantly improved performance on five image classification datasets: CIFAR-10, CIFAR-100, CUB-200-2011, Stanford Dogs, and ImageNet. Notably, we verify that our method improves stability and performance of whitening when using large learning rate, group size, and iteration number.
△ Less
Submitted 2 November, 2021; v1 submitted 24 August, 2021;
originally announced August 2021.
-
Bootstrap confidence intervals for multiple change points based on moving sum procedures
Authors:
Haeran Cho,
Claudia Kirch
Abstract:
The problem of quantifying uncertainty about the locations of multiple change points by means of confidence intervals is addressed. The asymptotic distribution of the change point estimators obtained as the local maximisers of moving sum statistics is derived, where the limit distributions differ depending on whether the corresponding size of changes is local, i.e. tends to zero as the sample size…
▽ More
The problem of quantifying uncertainty about the locations of multiple change points by means of confidence intervals is addressed. The asymptotic distribution of the change point estimators obtained as the local maximisers of moving sum statistics is derived, where the limit distributions differ depending on whether the corresponding size of changes is local, i.e. tends to zero as the sample size increases, or fixed. A bootstrap procedure for confidence interval generation is proposed which adapts to the unknown magnitude of changes and guarantees asymptotic validity both for local and fixed changes. Simulation studies show good performance of the proposed bootstrap procedure, and some discussions about how it can be extended to serially dependent errors is provided.
△ Less
Submitted 16 June, 2022; v1 submitted 24 June, 2021;
originally announced June 2021.
-
Tangent functional canonical correlation analysis for densities and shapes, with applications to multimodal imaging data
Authors:
Min Ho Cho,
Sebastian Kurtek,
Karthik Bharath
Abstract:
It is quite common for functional data arising from imaging data to assume values in infinite-dimensional manifolds. Uncovering associations between two or more such nonlinear functional data extracted from the same object across medical imaging modalities can assist development of personalized treatment strategies. We propose a method for canonical correlation analysis between paired probability…
▽ More
It is quite common for functional data arising from imaging data to assume values in infinite-dimensional manifolds. Uncovering associations between two or more such nonlinear functional data extracted from the same object across medical imaging modalities can assist development of personalized treatment strategies. We propose a method for canonical correlation analysis between paired probability densities or shapes of closed planar curves, routinely used in biomedical studies, which combines a convenient linearization and dimension reduction of the data using tangent space coordinates. Leveraging the fact that the corresponding manifolds are submanifolds of unit Hilbert spheres, we describe how finite-dimensional representations of the functional data objects can be easily computed, which then facilitates use of standard multivariate canonical correlation analysis methods. We further construct and visualize canonical variate directions directly on the space of densities or shapes. Utility of the method is demonstrated through numerical simulations and performance on a magnetic resonance imaging dataset of Glioblastoma Multiforme brain tumors.
△ Less
Submitted 24 September, 2021; v1 submitted 1 March, 2021;
originally announced March 2021.
-
A Statistician Teaches Deep Learning
Authors:
G. Jogesh Babu,
David Banks,
Hyunsoon Cho,
David Han,
Hailin Sang,
Shouyi Wang
Abstract:
Deep learning (DL) has gained much attention and become increasingly popular in modern data science. Computer scientists led the way in developing deep learning techniques, so the ideas and perspectives can seem alien to statisticians. Nonetheless, it is important that statisticians become involved -- many of our students need this expertise for their careers. In this paper, developed as part of a…
▽ More
Deep learning (DL) has gained much attention and become increasingly popular in modern data science. Computer scientists led the way in developing deep learning techniques, so the ideas and perspectives can seem alien to statisticians. Nonetheless, it is important that statisticians become involved -- many of our students need this expertise for their careers. In this paper, developed as part of a program on DL held at the Statistical and Applied Mathematical Sciences Institute, we address this culture gap and provide tips on how to teach deep learning to statistics graduate students. After some background, we list ways in which DL and statistical perspectives differ, provide a recommended syllabus that evolved from teaching two iterations of a DL graduate course, offer examples of suggested homework assignments, give an annotated list of teaching resources, and discuss DL in the context of two research areas.
△ Less
Submitted 3 February, 2021; v1 submitted 28 January, 2021;
originally announced February 2021.
-
Data Quality Measures and Efficient Evaluation Algorithms for Large-Scale High-Dimensional Data
Authors:
Hyeongmin Cho,
Sangkyun Lee
Abstract:
Machine learning has been proven to be effective in various application areas, such as object and speech recognition on mobile systems. Since a critical key to machine learning success is the availability of large training data, many datasets are being disclosed and published online. From a data consumer or manager point of view, measuring data quality is an important first step in the learning pr…
▽ More
Machine learning has been proven to be effective in various application areas, such as object and speech recognition on mobile systems. Since a critical key to machine learning success is the availability of large training data, many datasets are being disclosed and published online. From a data consumer or manager point of view, measuring data quality is an important first step in the learning process. We need to determine which datasets to use, update, and maintain. However, not many practical ways to measure data quality are available today, especially when it comes to large-scale high-dimensional data, such as images and videos. This paper proposes two data quality measures that can compute class separability and in-class variability, the two important aspects of data quality, for a given dataset. Classical data quality measures tend to focus only on class separability; however, we suggest that in-class variability is another important data quality factor. We provide efficient algorithms to compute our quality measures based on random projections and bootstrapping with statistical benefits on large-scale high-dimensional data. In experiments, we show that our measures are compatible with classical measures on small-scale data and can be computed much more efficiently on large-scale high-dimensional datasets.
△ Less
Submitted 5 January, 2021;
originally announced January 2021.
-
Data segmentation algorithms: Univariate mean change and beyond
Authors:
Haeran Cho,
Claudia Kirch
Abstract:
Data segmentation a.k.a. multiple change point analysis has received considerable attention due to its importance in time series analysis and signal processing, with applications in a variety of fields including natural and social sciences, medicine, engineering and finance.
In the first part of this survey, we review the existing literature on the canonical data segmentation problem which aims…
▽ More
Data segmentation a.k.a. multiple change point analysis has received considerable attention due to its importance in time series analysis and signal processing, with applications in a variety of fields including natural and social sciences, medicine, engineering and finance.
In the first part of this survey, we review the existing literature on the canonical data segmentation problem which aims at detecting and localising multiple change points in the mean of univariate time series. We provide an overview of popular methodologies on their computational complexity and theoretical properties. In particular, our theoretical discussion focuses on the separation rate relating to which change points are detectable by a given procedure, and the localisation rate quantifying the precision of corresponding change point estimators, and we distinguish between whether a homogeneous or multiscale viewpoint has been adopted in their derivation. We further highlight that the latter viewpoint provides the most general setting for investigating the optimality of data segmentation algorithms.
Arguably, the canonical segmentation problem has been the most popular framework to propose new data segmentation algorithms and study their efficiency in the last decades. In the second part of this survey, we motivate the importance of attaining an in-depth understanding of strengths and weaknesses of methodologies for the change point problem in a simpler, univariate setting, as a stepping stone for the development of methodologies for more complex problems. We illustrate this with a range of examples showcasing the connections between complex distributional changes and those in the mean. We also discuss extensions towards high-dimensional change point problems where we demonstrate that the challenges arising from high dimensionality are orthogonal to those in dealing with multiple change points.
△ Less
Submitted 8 July, 2021; v1 submitted 23 December, 2020;
originally announced December 2020.
-
Multi-stage optimal dynamic treatment regimes for survival outcomes with dependent censoring
Authors:
Hunyong Cho,
Shannon T. Holloway,
David J. Couper,
Michael R. Kosorok
Abstract:
We propose a reinforcement learning method for estimating an optimal dynamic treatment regime for survival outcomes with dependent censoring. The estimator allows the failure time to be conditionally independent of censoring and dependent on the treatment decision times, supports a flexible number of treatment arms and treatment stages, and can maximize either the mean survival time or the surviva…
▽ More
We propose a reinforcement learning method for estimating an optimal dynamic treatment regime for survival outcomes with dependent censoring. The estimator allows the failure time to be conditionally independent of censoring and dependent on the treatment decision times, supports a flexible number of treatment arms and treatment stages, and can maximize either the mean survival time or the survival probability at a certain time point. The estimator is constructed using generalized random survival forests and can have polynomial rates of convergence. Simulations and data analysis results suggest that the new estimator brings higher expected outcomes than existing methods in various settings. An R package dtrSurv is available on CRAN.
△ Less
Submitted 12 May, 2022; v1 submitted 6 December, 2020;
originally announced December 2020.
-
Multiple change point detection under serial dependence: Wild contrast maximisation and gappy Schwarz algorithm
Authors:
Haeran Cho,
Piotr Fryzlewicz
Abstract:
We propose a methodology for detecting multiple change points in the mean of an otherwise stationary, autocorrelated, linear time series. It combines solution path generation based on the wild contrast maximisation principle, and an information criterion-based model selection strategy termed gappy Schwarz algorithm. The former is well-suited to separating shifts in the mean from fluctuations due t…
▽ More
We propose a methodology for detecting multiple change points in the mean of an otherwise stationary, autocorrelated, linear time series. It combines solution path generation based on the wild contrast maximisation principle, and an information criterion-based model selection strategy termed gappy Schwarz algorithm. The former is well-suited to separating shifts in the mean from fluctuations due to serial correlations, while the latter simultaneously estimates the dependence structure and the number of change points without performing the difficult task of estimating the level of the noise as quantified e.g.\ by the long-run variance. We provide modular investigation into their theoretical properties and show that the combined methodology, named WCM.gSa, achieves consistency in estimating both the total number and the locations of the change points. The good performance of WCM.gSa is demonstrated via extensive simulation studies, and we further illustrate its usefulness by applying the methodology to London air quality data.
△ Less
Submitted 12 April, 2023; v1 submitted 27 November, 2020;
originally announced November 2020.
-
Neural Bootstrapper
Authors:
Minsuk Shin,
Hyungjoo Cho,
Hyun-seok Min,
Sungbin Lim
Abstract:
Bootstrapping has been a primary tool for ensemble and uncertainty quantification in machine learning and statistics. However, due to its nature of multiple training and resampling, bootstrapping deep neural networks is computationally burdensome; hence it has difficulties in practical application to the uncertainty estimation and related tasks. To overcome this computational bottleneck, we propos…
▽ More
Bootstrapping has been a primary tool for ensemble and uncertainty quantification in machine learning and statistics. However, due to its nature of multiple training and resampling, bootstrapping deep neural networks is computationally burdensome; hence it has difficulties in practical application to the uncertainty estimation and related tasks. To overcome this computational bottleneck, we propose a novel approach called \emph{Neural Bootstrapper} (NeuBoots), which learns to generate bootstrapped neural networks through single model training. NeuBoots injects the bootstrap weights into the high-level feature layers of the backbone network and outputs the bootstrapped predictions of the target, without additional parameters and the repetitive computations from scratch. We apply NeuBoots to various machine learning tasks related to uncertainty quantification, including prediction calibrations in image classification and semantic segmentation, active learning, and detection of out-of-distribution samples. Our empirical results show that NeuBoots outperforms other bagging based methods under a much lower computational cost without losing the validity of bootstrapping.
△ Less
Submitted 13 December, 2021; v1 submitted 2 October, 2020;
originally announced October 2020.
-
Discussion of 'Detecting possibly frequent change-points: Wild Binary Segmentation 2 and steepest-drop model selection'
Authors:
Haeran Cho,
Claudia Kirch
Abstract:
We discuss the theoretical guarantee provided by the WBS2.SDLL proposed in Fryzlewicz (2020) and explore an alternative, MOSUM-based candidate generation method for the SDLL.
We discuss the theoretical guarantee provided by the WBS2.SDLL proposed in Fryzlewicz (2020) and explore an alternative, MOSUM-based candidate generation method for the SDLL.
△ Less
Submitted 2 July, 2020; v1 submitted 25 June, 2020;
originally announced June 2020.
-
Survival regression with accelerated failure time model in XGBoost
Authors:
Avinash Barnwal,
Hyunsu Cho,
Toby Dylan Hocking
Abstract:
Survival regression is used to estimate the relation between time-to-event and feature variables, and is important in application domains such as medicine, marketing, risk management and sales management. Nonlinear tree based machine learning algorithms as implemented in libraries such as XGBoost, scikit-learn, LightGBM, and CatBoost are often more accurate in practice than linear models. However,…
▽ More
Survival regression is used to estimate the relation between time-to-event and feature variables, and is important in application domains such as medicine, marketing, risk management and sales management. Nonlinear tree based machine learning algorithms as implemented in libraries such as XGBoost, scikit-learn, LightGBM, and CatBoost are often more accurate in practice than linear models. However, existing state-of-the-art implementations of tree-based models have offered limited support for survival regression. In this work, we implement loss functions for learning accelerated failure time (AFT) models in XGBoost, to increase the support for survival modeling for different kinds of label censoring. We demonstrate with real and simulated experiments the effectiveness of AFT in XGBoost with respect to a number of baselines, in two respects: generalization performance and training speed. Furthermore, we take advantage of the support for NVIDIA GPUs in XGBoost to achieve substantial speedup over multi-core CPUs. To our knowledge, our work is the first implementation of AFT that utilizes the processing power of NVIDIA GPUs. Starting from the 1.2.0 release, the XGBoost package natively supports the AFT model. The addition of AFT in XGBoost has had significant impact in the open source community, and a few statistics packages now utilize the XGBoost AFT model.
△ Less
Submitted 21 August, 2021; v1 submitted 8 June, 2020;
originally announced June 2020.
-
InteractionNet: Modeling and Explaining of Noncovalent Protein-Ligand Interactions with Noncovalent Graph Neural Network and Layer-Wise Relevance Propagation
Authors:
Hyeoncheol Cho,
Eok Kyun Lee,
Insung S. Choi
Abstract:
Expanding the scope of graph-based, deep-learning models to noncovalent protein-ligand interactions has earned increasing attention in structure-based drug design. Modeling the protein-ligand interactions with graph neural networks (GNNs) has experienced difficulties in the conversion of protein-ligand complex structures into the graph representation and left questions regarding whether the traine…
▽ More
Expanding the scope of graph-based, deep-learning models to noncovalent protein-ligand interactions has earned increasing attention in structure-based drug design. Modeling the protein-ligand interactions with graph neural networks (GNNs) has experienced difficulties in the conversion of protein-ligand complex structures into the graph representation and left questions regarding whether the trained models properly learn the appropriate noncovalent interactions. Here, we proposed a GNN architecture, denoted as InteractionNet, which learns two separated molecular graphs, being covalent and noncovalent, through distinct convolution layers. We also analyzed the InteractionNet model with an explainability technique, i.e., layer-wise relevance propagation, for examination of the chemical relevance of the model's predictions. Separation of the covalent and noncovalent convolutional steps made it possible to evaluate the contribution of each step independently and analyze the graph-building strategy for noncovalent interactions. We applied InteractionNet to the prediction of protein-ligand binding affinity and showed that our model successfully predicted the noncovalent interactions in both performance and relevance in chemical interpretation.
△ Less
Submitted 12 May, 2020;
originally announced May 2020.
-
Mel-spectrogram augmentation for sequence to sequence voice conversion
Authors:
Yeongtae Hwang,
Hyemin Cho,
Hongsun Yang,
Dong-Ok Won,
Insoo Oh,
Seong-Whan Lee
Abstract:
For training the sequence-to-sequence voice conversion model, we need to handle an issue of insufficient data about the number of speech pairs which consist of the same utterance. This study experimentally investigated the effects of Mel-spectrogram augmentation on training the sequence-to-sequence voice conversion (VC) model from scratch. For Mel-spectrogram augmentation, we adopted the policies…
▽ More
For training the sequence-to-sequence voice conversion model, we need to handle an issue of insufficient data about the number of speech pairs which consist of the same utterance. This study experimentally investigated the effects of Mel-spectrogram augmentation on training the sequence-to-sequence voice conversion (VC) model from scratch. For Mel-spectrogram augmentation, we adopted the policies proposed in SpecAugment. In addition, we proposed new policies (i.e., frequency warping, loudness and time length control) for more data variations. Moreover, to find the appropriate hyperparameters of augmentation policies without training the VC model, we proposed hyperparameter search strategy and the new metric for reducing experimental cost, namely deformation per deteriorating ratio. We compared the effect of these Mel-spectrogram augmentation methods based on various sizes of training set and augmentation policies. In the experimental results, the time axis warping based policies (i.e., time length control and time warping.) showed better performance than other policies. These results indicate that the use of the Mel-spectrogram augmentation is more beneficial for training the VC model.
△ Less
Submitted 15 June, 2020; v1 submitted 6 January, 2020;
originally announced January 2020.
-
Interval censored recursive forests
Authors:
Hunyong Cho,
Nicholas P. Jewell,
Michael R. Kosorok
Abstract:
We propose the interval censored recursive forests (ICRF) which is an iterative tree ensemble method for interval censored survival data. This nonparametric regression estimator makes the best use of censored information by iteratively updating the survival estimate, and can be viewed as a self-consistent estimator with convergence monitored using out-of-bag samples. Splitting rules optimized for…
▽ More
We propose the interval censored recursive forests (ICRF) which is an iterative tree ensemble method for interval censored survival data. This nonparametric regression estimator makes the best use of censored information by iteratively updating the survival estimate, and can be viewed as a self-consistent estimator with convergence monitored using out-of-bag samples. Splitting rules optimized for interval censored data are developed and kernel-smoothing is applied. The ICRF displays the highest prediction accuracy among competing nonparametric methods in most of the simulations and in an applied example to avalanche data. An R package icrf is available for implementation.
△ Less
Submitted 20 May, 2021; v1 submitted 20 December, 2019;
originally announced December 2019.
-
Two-stage data segmentation permitting multiscale change points, heavy tails and dependence
Authors:
Haeran Cho,
Claudia Kirch
Abstract:
The segmentation of a time series into piecewise stationary segments, a.k.a. multiple change point analysis, is an important problem both in time series analysis and signal processing. In the presence of multiscale change points with both large jumps over short intervals and small changes over long stationary intervals, multiscale methods achieve good adaptivity in their localisation but at the sa…
▽ More
The segmentation of a time series into piecewise stationary segments, a.k.a. multiple change point analysis, is an important problem both in time series analysis and signal processing. In the presence of multiscale change points with both large jumps over short intervals and small changes over long stationary intervals, multiscale methods achieve good adaptivity in their localisation but at the same time, require the removal of false positives and duplicate estimators via a model selection step. In this paper, we propose a localised application of Schwarz information criterion which, as a generic methodology, is applicable with any multiscale candidate generating procedure fulfilling mild assumptions. We establish the theoretical consistency of the proposed localised pruning method in estimating the number and locations of multiple change points under general assumptions permitting heavy tails and dependence. Further, we show that combined with a MOSUM-based candidate generating procedure, it attains minimax optimality in terms of detection lower bound and localisation for i.i.d. sub-Gaussian errors. A careful comparison with the existing methods by means of (a) theoretical properties such as generality, optimality and algorithmic complexity, (b) performance on simulated datasets and run time, as well as (c) performance on real data applications, confirm the overall competitiveness of the proposed methodology.
△ Less
Submitted 3 July, 2020; v1 submitted 28 October, 2019;
originally announced October 2019.
-
Framelet Pooling Aided Deep Learning Network : The Method to Process High Dimensional Medical Data
Authors:
Chang Min Hyun,
Kang Cheol Kim,
Hyun Cheol Cho,
Jae Kyu Choi,
Jin Keun Seo
Abstract:
Machine learning-based analysis of medical images often faces several hurdles, such as the lack of training data, the curse of dimensionality problem, and the generalization issues. One of the main difficulties is that there exists computational cost problem in dealing with input data of large size matrices which represent medical images. The purpose of this paper is to introduce a framelet-poolin…
▽ More
Machine learning-based analysis of medical images often faces several hurdles, such as the lack of training data, the curse of dimensionality problem, and the generalization issues. One of the main difficulties is that there exists computational cost problem in dealing with input data of large size matrices which represent medical images. The purpose of this paper is to introduce a framelet-pooling aided deep learning method for mitigating computational bundle, caused by large dimensionality. By transforming high dimensional data into low dimensional components by filter banks with preserving detailed information, the proposed method aims to reduce the complexity of the neural network and computational costs significantly during the learning process. Various experiments show that our method is comparable to the standard unreduced learning method, while reducing computational burdens by decomposing large-sized learning tasks into several small-scale learning tasks.
△ Less
Submitted 25 July, 2019;
originally announced July 2019.
-
Scalable Neural Architecture Search for 3D Medical Image Segmentation
Authors:
Sungwoong Kim,
Ildoo Kim,
Sungbin Lim,
Woonhyuk Baek,
Chiheon Kim,
Hyungjoo Cho,
Boogeon Yoon,
Taesup Kim
Abstract:
In this paper, a neural architecture search (NAS) framework is proposed for 3D medical image segmentation, to automatically optimize a neural architecture from a large design space. Our NAS framework searches the structure of each layer including neural connectivities and operation types in both of the encoder and decoder. Since optimizing over a large discrete architecture space is difficult due…
▽ More
In this paper, a neural architecture search (NAS) framework is proposed for 3D medical image segmentation, to automatically optimize a neural architecture from a large design space. Our NAS framework searches the structure of each layer including neural connectivities and operation types in both of the encoder and decoder. Since optimizing over a large discrete architecture space is difficult due to high-resolution 3D medical images, a novel stochastic sampling algorithm based on a continuous relaxation is also proposed for scalable gradient based optimization. On the 3D medical image segmentation tasks with a benchmark dataset, an automatically designed architecture by the proposed NAS framework outperforms the human-designed 3D U-Net, and moreover this optimized architecture is well suited to be transferred for different tasks.
△ Less
Submitted 13 June, 2019;
originally announced June 2019.
-
DEEP-BO for Hyperparameter Optimization of Deep Networks
Authors:
Hyunghun Cho,
Yongjin Kim,
Eunjung Lee,
Daeyoung Choi,
Yongjae Lee,
Wonjong Rhee
Abstract:
The performance of deep neural networks (DNN) is very sensitive to the particular choice of hyper-parameters. To make it worse, the shape of the learning curve can be significantly affected when a technique like batchnorm is used. As a result, hyperparameter optimization of deep networks can be much more challenging than traditional machine learning models. In this work, we start from well known B…
▽ More
The performance of deep neural networks (DNN) is very sensitive to the particular choice of hyper-parameters. To make it worse, the shape of the learning curve can be significantly affected when a technique like batchnorm is used. As a result, hyperparameter optimization of deep networks can be much more challenging than traditional machine learning models. In this work, we start from well known Bayesian Optimization solutions and provide enhancement strategies specifically designed for hyperparameter optimization of deep networks. The resulting algorithm is named as DEEP-BO (Diversified, Early-termination-Enabled, and Parallel Bayesian Optimization). When evaluated over six DNN benchmarks, DEEP-BO easily outperforms or shows comparable performance with some of the well-known solutions including GP-Hedge, Hyperband, BOHB, Median Stopping Rule, and Learning Curve Extrapolation. The code used is made publicly available at https://github.com/snu-adsl/DEEP-BO.
△ Less
Submitted 23 May, 2019;
originally announced May 2019.
-
Block-distributed Gradient Boosted Trees
Authors:
Theodore Vasiloudis,
Hyunsu Cho,
Henrik Boström
Abstract:
The Gradient Boosted Tree (GBT) algorithm is one of the most popular machine learning algorithms used in production, for tasks that include Click-Through Rate (CTR) prediction and learning-to-rank. To deal with the massive datasets available today, many distributed GBT methods have been proposed. However, they all assume a row-distributed dataset, addressing scalability only with respect to the nu…
▽ More
The Gradient Boosted Tree (GBT) algorithm is one of the most popular machine learning algorithms used in production, for tasks that include Click-Through Rate (CTR) prediction and learning-to-rank. To deal with the massive datasets available today, many distributed GBT methods have been proposed. However, they all assume a row-distributed dataset, addressing scalability only with respect to the number of data points and not the number of features, and increasing communication cost for high-dimensional data. In order to allow for scalability across both the data point and feature dimensions, and reduce communication cost, we propose block-distributed GBTs. We achieve communication efficiency by making full use of the data sparsity and adapting the Quickscorer algorithm to the block-distributed setting. We evaluate our approach using datasets with millions of features, and demonstrate that we are able to achieve multiple orders of magnitude reduction in communication cost for sparse data, with no loss in accuracy, while providing a more scalable design. As a result, we are able to reduce the training time for high-dimensional data, and allow more cost-effective scale-out without the need for expensive network communication.
△ Less
Submitted 28 May, 2019; v1 submitted 23 April, 2019;
originally announced April 2019.
-
Aggregated Pairwise Classification of Statistical Shapes
Authors:
Min Ho Cho,
Sebastian Kurtek,
Steven N. MacEachern
Abstract:
The classification of shapes is of great interest in diverse areas ranging from medical imaging to computer vision and beyond. While many statistical frameworks have been developed for the classification problem, most are strongly tied to early formulations of the problem - with an object to be classified described as a vector in a relatively low-dimensional Euclidean space. Statistical shape data…
▽ More
The classification of shapes is of great interest in diverse areas ranging from medical imaging to computer vision and beyond. While many statistical frameworks have been developed for the classification problem, most are strongly tied to early formulations of the problem - with an object to be classified described as a vector in a relatively low-dimensional Euclidean space. Statistical shape data have two main properties that suggest a need for a novel approach: (i) shapes are inherently infinite dimensional with strong dependence among the positions of nearby points, and (ii) shape space is not Euclidean, but is fundamentally curved. To accommodate these features of the data, we work with the square-root velocity function of the curves to provide a useful formal description of the shape, pass to tangent spaces of the manifold of shapes at different projection points which effectively separate shapes for pairwise classification in the training data, and use principal components within these tangent spaces to reduce dimensionality. We illustrate the impact of the projection point and choice of subspace on the misclassification rate with a novel method of combining pairwise classifiers.
△ Less
Submitted 22 January, 2019;
originally announced January 2019.
-
Three-Dimensionally Embedded Graph Convolutional Network (3DGCN) for Molecule Interpretation
Authors:
Hyeoncheol Cho,
Insung S. Choi
Abstract:
We present a three-dimensional graph convolutional network (3DGCN), which predicts molecular properties and biochemical activities, based on 3D molecular graph. In the 3DGCN, graph convolution is unified with learning operations on the vector to handle the spatial information from molecular topology. The 3DGCN model exhibits significantly higher performance on various tasks compared with other dee…
▽ More
We present a three-dimensional graph convolutional network (3DGCN), which predicts molecular properties and biochemical activities, based on 3D molecular graph. In the 3DGCN, graph convolution is unified with learning operations on the vector to handle the spatial information from molecular topology. The 3DGCN model exhibits significantly higher performance on various tasks compared with other deep-learning models, and has the ability of generalizing a given conformer to targeted features regardless of its rotations in the 3D space. More significantly, our model also can distinguish the 3D rotations of a molecule and predict the target value, depending upon the rotation degree, in the protein-ligand docking problem, when trained with orientation-dependent datasets. The rotation distinguishability of 3DGCN, along with rotation equivariance, provides a key milestone in the implementation of three-dimensionality to the field of deep-learning chemistry that solves challenging biochemical problems.
△ Less
Submitted 16 April, 2019; v1 submitted 24 November, 2018;
originally announced November 2018.
-
Consistent estimation of high-dimensional factor models when the factor number is over-estimated
Authors:
Matteo Barigozzi,
Haeran Cho
Abstract:
A high-dimensional $r$-factor model for an $n$-dimensional vector time series is characterised by the presence of a large eigengap (increasing with $n$) between the $r$-th and the $(r+1)$-th largest eigenvalues of the covariance matrix. Consequently, Principal Component (PC) analysis is the most popular estimation method for factor models and its consistency, when $r$ is correctly estimated, is we…
▽ More
A high-dimensional $r$-factor model for an $n$-dimensional vector time series is characterised by the presence of a large eigengap (increasing with $n$) between the $r$-th and the $(r+1)$-th largest eigenvalues of the covariance matrix. Consequently, Principal Component (PC) analysis is the most popular estimation method for factor models and its consistency, when $r$ is correctly estimated, is well-established in the literature. However, popular factor number estimators often suffer from the lack of an obvious eigengap in empirical eigenvalues and tend to over-estimate $r$ due, for example, to the existence of non-pervasive factors affecting only a subset of the series. We show that the errors in the PC estimators resulting from the over-estimation of $r$ are non-negligible, which in turn lead to the violation of the conditions required for factor-based large covariance estimation. To remedy this, we propose new estimators of the factor model based on scaling the entries of the sample eigenvectors. We show both theoretically and numerically that the proposed estimators successfully control for the over-estimation error, and investigate their performance when applied to risk minimisation of a portfolio of financial time series.
△ Less
Submitted 6 July, 2020; v1 submitted 1 November, 2018;
originally announced November 2018.
-
Deep Asymmetric Networks with a Set of Node-wise Variant Activation Functions
Authors:
Jinhyeok Jang,
Hyunjoong Cho,
Jaehong Kim,
Jaeyeon Lee,
Seungjoon Yang
Abstract:
This work presents deep asymmetric networks with a set of node-wise variant activation functions. The nodes' sensitivities are affected by activation function selections such that the nodes with smaller indices become increasingly more sensitive. As a result, features learned by the nodes are sorted by the node indices in the order of their importance. Asymmetric networks not only learn input feat…
▽ More
This work presents deep asymmetric networks with a set of node-wise variant activation functions. The nodes' sensitivities are affected by activation function selections such that the nodes with smaller indices become increasingly more sensitive. As a result, features learned by the nodes are sorted by the node indices in the order of their importance. Asymmetric networks not only learn input features but also the importance of those features. Nodes of lesser importance in asymmetric networks can be pruned to reduce the complexity of the networks, and the pruned networks can be retrained without incurring performance losses. We validate the feature-sorting property using both shallow and deep asymmetric networks as well as deep asymmetric networks transferred from famous networks.
△ Less
Submitted 17 May, 2019; v1 submitted 11 September, 2018;
originally announced September 2018.
-
Large-Margin Classification in Hyperbolic Space
Authors:
Hyunghoon Cho,
Benjamin DeMeo,
Jian Peng,
Bonnie Berger
Abstract:
Representing data in hyperbolic space can effectively capture latent hierarchical relationships. With the goal of enabling accurate classification of points in hyperbolic space while respecting their hyperbolic geometry, we introduce hyperbolic SVM, a hyperbolic formulation of support vector machine classifiers, and elucidate through new theoretical work its connection to the Euclidean counterpart…
▽ More
Representing data in hyperbolic space can effectively capture latent hierarchical relationships. With the goal of enabling accurate classification of points in hyperbolic space while respecting their hyperbolic geometry, we introduce hyperbolic SVM, a hyperbolic formulation of support vector machine classifiers, and elucidate through new theoretical work its connection to the Euclidean counterpart. We demonstrate the performance improvement of hyperbolic SVM for multi-class prediction tasks on real-world complex networks as well as simulated datasets. Our work allows analytic pipelines that take the inherent hyperbolic geometry of the data into account in an end-to-end fashion without resorting to ill-fitting tools developed for Euclidean space.
△ Less
Submitted 1 June, 2018;
originally announced June 2018.
-
Confidence intervals for the area under the receiver operating characteristic curve in the presence of ignorable missing data
Authors:
Hunyong Cho,
Gregory J. Matthews,
Ofer Harel
Abstract:
Receiver operating characteristic (ROC) curves are widely used as a measure of accuracy of diagnostic tests and can be summarized using the area under the ROC curve (AUC). Often, it is useful to construct a confidence intervals for the AUC, however, since there are a number of different proposed methods to measure variance of the AUC, there are thus many different resulting methods for constructin…
▽ More
Receiver operating characteristic (ROC) curves are widely used as a measure of accuracy of diagnostic tests and can be summarized using the area under the ROC curve (AUC). Often, it is useful to construct a confidence intervals for the AUC, however, since there are a number of different proposed methods to measure variance of the AUC, there are thus many different resulting methods for constructing these intervals. In this manuscript, we compare different methods of constructing Wald-type confidence interval in the presence of missing data where the missingness mechanism is ignorable. We find that constructing confidence intervals using multiple imputation (MI) based on logistic regression (LR) gives the most robust coverage probability and the choice of CI method is less important. However, when missingness rate is less severe (e.g. less than 70%), we recommend using Newcombe's Wald method for constructing confidence intervals along with multiple imputation using predictive mean matching (PMM).
△ Less
Submitted 16 April, 2018;
originally announced April 2018.
-
Link prediction for interdisciplinary collaboration via co-authorship network
Authors:
Haeran Cho,
Yi Yu
Abstract:
We analyse the Publication and Research (PURE) data set of University of Bristol collected between $2008$ and $2013$. Using the existing co-authorship network and academic information thereof, we propose a new link prediction methodology, with the specific aim of identifying potential interdisciplinary collaboration in a university-wide collaboration network.
We analyse the Publication and Research (PURE) data set of University of Bristol collected between $2008$ and $2013$. Using the existing co-authorship network and academic information thereof, we propose a new link prediction methodology, with the specific aim of identifying potential interdisciplinary collaboration in a university-wide collaboration network.
△ Less
Submitted 16 March, 2018;
originally announced March 2018.
-
High-dimensional GARCH process segmentation with an application to Value-at-Risk
Authors:
Haeran Cho,
Karolos Korkas
Abstract:
Models for financial risk often assume that underlying asset returns are stationary. However, there is strong evidence that multivariate financial time series entail changes not only in their within-series dependence structure, but also in the cross-sectional dependence among them. In particular, the stressed Value-at-Risk of a portfolio, a popularly adopted measure of market risk, cannot be gauge…
▽ More
Models for financial risk often assume that underlying asset returns are stationary. However, there is strong evidence that multivariate financial time series entail changes not only in their within-series dependence structure, but also in the cross-sectional dependence among them. In particular, the stressed Value-at-Risk of a portfolio, a popularly adopted measure of market risk, cannot be gauged adequately unless such structural breaks are taken into account in its estimation. We propose a method for consistent detection of multiple change points in high-dimensional GARCH panel data set where both individual GARCH processes and their correlations are allowed to change over time. We prove its consistency in multiple change point estimation, and demonstrate its good performance through simulation studies and an application to the Value-at-Risk problem on a real dataset. Our methodology is implemented in the R package segMGarch, available from CRAN.
△ Less
Submitted 2 March, 2021; v1 submitted 4 June, 2017;
originally announced June 2017.
-
Simultaneous multiple change-point and factor analysis for high-dimensional time series
Authors:
Matteo Barigozzi,
Haeran Cho,
Piotr Fryzlewicz
Abstract:
We propose the first comprehensive treatment of high-dimensional time series factor models with multiple change-points in their second-order structure. We operate under the most flexible definition of piecewise stationarity, and estimate the number and locations of change-points consistently as well as identifying whether they originate in the common or idiosyncratic components. Through the use of…
▽ More
We propose the first comprehensive treatment of high-dimensional time series factor models with multiple change-points in their second-order structure. We operate under the most flexible definition of piecewise stationarity, and estimate the number and locations of change-points consistently as well as identifying whether they originate in the common or idiosyncratic components. Through the use of wavelets, we transform the problem of change-point detection in the second-order structure of a high-dimensional time series, into the (relatively easier) problem of change-point detection in the means of high-dimensional panel data. Also, our methodology circumvents the difficult issue of the accurate estimation of the true number of factors in the presence of multiple change-points by adopting a screening procedure. We further show that consistent factor analysis is achieved over each segment defined by the change-points estimated by the proposed methodology. In extensive simulation studies, we observe that factor analysis prior to change-point detection improves the detectability of change-points, and identify and describe an interesting `spillover' effect in which substantial breaks in the idiosyncratic components get, naturally enough, identified as change-points in the common components, which prompts us to regard the corresponding change-points as also acting as a form of `factors'. Our methodology is implemented in the R package {\tt factorcpt}, available from CRAN.
△ Less
Submitted 29 May, 2018; v1 submitted 20 December, 2016;
originally announced December 2016.