Search | arXiv e-print repository

Kernel Density Balancing

Authors: John Park, Ning Hao, Yue Selena Niu, Ming Hu

Abstract: High-throughput chromatin conformation capture (Hi-C) data provide insights into the 3D structure of chromosomes, with normalization being a crucial pre-processing step. A common technique for normalization is matrix balancing, which rescales rows and columns of a Hi-C matrix to equalize their sums. Despite its popularity and convenience, matrix balancing lacks statistical justification. In this p… ▽ More High-throughput chromatin conformation capture (Hi-C) data provide insights into the 3D structure of chromosomes, with normalization being a crucial pre-processing step. A common technique for normalization is matrix balancing, which rescales rows and columns of a Hi-C matrix to equalize their sums. Despite its popularity and convenience, matrix balancing lacks statistical justification. In this paper, we introduce a statistical model to analyze matrix balancing methods and propose a kernel-based estimator that leverages spatial structure. Under mild assumptions, we demonstrate that the kernel-based method is consistent, converges faster, and is more robust to data sparsity compared to existing approaches. △ Less

Submitted 14 June, 2025; originally announced June 2025.

arXiv:2108.09431 [pdf, other]

Equivariant Variance Estimation for Multiple Change-point Model

Authors: Ning Hao, Yue Selena Niu, Han Xiao

Abstract: The variance of noise plays an important role in many change-point detection procedures and the associated inferences. Most commonly used variance estimators require strong assumptions on the true mean structure or normality of the error distribution, which may not hold in applications. More importantly, the qualities of these estimators have not been discussed systematically in the literature. In… ▽ More The variance of noise plays an important role in many change-point detection procedures and the associated inferences. Most commonly used variance estimators require strong assumptions on the true mean structure or normality of the error distribution, which may not hold in applications. More importantly, the qualities of these estimators have not been discussed systematically in the literature. In this paper, we introduce a framework of equivariant variance estimation for multiple change-point models. In particular, we characterize the set of all equivariant unbiased quadratic variance estimators for a family of change-point model classes, and develop a minimax theory for such estimators. △ Less

Submitted 15 November, 2023; v1 submitted 21 August, 2021; originally announced August 2021.

Comments: 44 pages

arXiv:2003.12540 [pdf, other]

A super scalable algorithm for short segment detection

Authors: Ning Hao, Yue Selena Niu, Feifei Xiao, Heping Zhang

Abstract: In many applications such as copy number variant (CNV) detection, the goal is to identify short segments on which the observations have different means or medians from the background. Those segments are usually short and hidden in a long sequence, and hence are very challenging to find. We study a super scalable short segment (4S) detection algorithm in this paper. This nonparametric method cluste… ▽ More In many applications such as copy number variant (CNV) detection, the goal is to identify short segments on which the observations have different means or medians from the background. Those segments are usually short and hidden in a long sequence, and hence are very challenging to find. We study a super scalable short segment (4S) detection algorithm in this paper. This nonparametric method clusters the locations where the observations exceed a threshold for segment detection. It is computationally efficient and does not rely on Gaussian noise assumption. Moreover, we develop a framework to assign significance levels for detected segments. We demonstrate the advantages of our proposed method by theoretical, simulation, and real data studies. △ Less

Submitted 27 March, 2020; originally announced March 2020.

Comments: To be published in Statistics in Biosciences

arXiv:1512.04093 [pdf, other]

Multiple Change-point Detection: a Selective Overview

Authors: Yue S. Niu, Ning Hao, Heping Zhang

Abstract: Very long and noisy sequence data arise from biological sciences to social science including high throughput data in genomics and stock prices in econometrics. Often such data are collected in order to identify and understand shifts in trend, e.g., from a bull market to a bear market in finance or from a normal number of chromosome copies to an excessive number of chromosome copies in genetics. Th… ▽ More Very long and noisy sequence data arise from biological sciences to social science including high throughput data in genomics and stock prices in econometrics. Often such data are collected in order to identify and understand shifts in trend, e.g., from a bull market to a bear market in finance or from a normal number of chromosome copies to an excessive number of chromosome copies in genetics. Thus, identifying multiple change points in a long, possibly very long, sequence is an important problem. In this article, we review both classical and new multiple change-point detection strategies. Considering the long history and the extensive literature on the change-point detection, we provide an in-depth discussion on a normal mean change-point model from aspects of regression analysis, hypothesis testing, consistency and inference. In particular, we present a strategy to gather and aggregate local information for change-point detection that has become the cornerstone of several emerging methods because of its attractiveness in both computational and theoretical properties. △ Less

Submitted 14 July, 2016; v1 submitted 13 December, 2015; originally announced December 2015.

Comments: 26 pages, 2 figures

arXiv:1511.00282 [pdf, ps, other]

A New Reduced-Rank Linear Discriminant Analysis Method and Its Applications

Authors: Yue Selena Niu, Ning Hao, Bin Dong

Abstract: We consider multi-class classification problems for high dimensional data. Following the idea of reduced-rank linear discriminant analysis (LDA), we introduce a new dimension reduction tool with a flavor of supervised principal component analysis (PCA). The proposed method is computationally efficient and can incorporate the correlation structure among the features. Besides the theoretical insight… ▽ More We consider multi-class classification problems for high dimensional data. Following the idea of reduced-rank linear discriminant analysis (LDA), we introduce a new dimension reduction tool with a flavor of supervised principal component analysis (PCA). The proposed method is computationally efficient and can incorporate the correlation structure among the features. Besides the theoretical insights, we show that our method is a competitive classification tool by simulated and real data examples. △ Less

Submitted 25 March, 2017; v1 submitted 1 November, 2015; originally announced November 2015.

Comments: This is the accepted version which may be slightly different from the published version

arXiv:1210.0345 [pdf, ps, other]

doi 10.1214/12-AOAS539

The screening and ranking algorithm to detect DNA copy number variations

Authors: Yue S. Niu, Heping Zhang

Abstract: DNA Copy number variation (CNV) has recently gained considerable interest as a source of genetic variation that likely influences phenotypic differences. Many statistical and computational methods have been proposed and applied to detect CNVs based on data that generated by genome analysis platforms. However, most algorithms are computationally intensive with complexity at least $O(n^2)$, where… ▽ More DNA Copy number variation (CNV) has recently gained considerable interest as a source of genetic variation that likely influences phenotypic differences. Many statistical and computational methods have been proposed and applied to detect CNVs based on data that generated by genome analysis platforms. However, most algorithms are computationally intensive with complexity at least $O(n^2)$, where $n$ is the number of probes in the experiments. Moreover, the theoretical properties of those existing methods are not well understood. A faster and better characterized algorithm is desirable for the ultra high throughput data. In this study, we propose the Screening and Ranking algorithm (SaRa) which can detect CNVs fast and accurately with complexity down to O(n). In addition, we characterize theoretical properties and present numerical analysis for our algorithm. △ Less

Submitted 1 October, 2012; originally announced October 2012.

Comments: Published in at http://dx.doi.org/10.1214/12-AOAS539 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS539

Journal ref: Annals of Applied Statistics 2012, Vol. 6, No. 3, 1306-1326

Showing 1–6 of 6 results for author: Niu, Y S