-
CN-SBM: Categorical Block Modelling For Primary and Residual Copy Number Variation
Authors:
Kevin Lam,
William Daniels,
J Maxwell Douglas,
Daniel Lai,
Samuel Aparicio,
Benjamin Bloem-Reddy,
Yongjin Park
Abstract:
Cancer is a genetic disorder whose clonal evolution can be monitored by tracking noisy genome-wide copy number variants. We introduce the Copy Number Stochastic Block Model (CN-SBM), a probabilistic framework that jointly clusters samples and genomic regions based on discrete copy number states using a bipartite categorical block model. Unlike models relying on Gaussian or Poisson assumptions, CN-…
▽ More
Cancer is a genetic disorder whose clonal evolution can be monitored by tracking noisy genome-wide copy number variants. We introduce the Copy Number Stochastic Block Model (CN-SBM), a probabilistic framework that jointly clusters samples and genomic regions based on discrete copy number states using a bipartite categorical block model. Unlike models relying on Gaussian or Poisson assumptions, CN-SBM respects the discrete nature of CNV calls and captures subpopulation-specific patterns through block-wise structure. Using a two-stage approach, CN-SBM decomposes CNV data into primary and residual components, enabling detection of both large-scale chromosomal alterations and finer aberrations. We derive a scalable variational inference algorithm for application to large cohorts and high-resolution data. Benchmarks on simulated and real datasets show improved model fit over existing methods. Applied to TCGA low-grade glioma data, CN-SBM reveals clinically relevant subtypes and structured residual variation, aiding patient stratification in survival analysis. These results establish CN-SBM as an interpretable, scalable framework for CNV analysis with direct relevance for tumor heterogeneity and prognosis.
△ Less
Submitted 28 June, 2025;
originally announced June 2025.
-
A Bayesian hierarchical model for methane emission source apportionment
Authors:
William S. Daniels,
Douglas W. Nychka,
Dorit M. Hammerling
Abstract:
Reducing methane emissions from the oil and gas sector is a key component of short-term climate action. Emission reduction efforts are often conducted at the individual site-level, where being able to apportion emissions between a finite number of potentially emitting equipment is necessary for leak detection and repair as well as regulatory reporting of annualized emissions. We present a hierarch…
▽ More
Reducing methane emissions from the oil and gas sector is a key component of short-term climate action. Emission reduction efforts are often conducted at the individual site-level, where being able to apportion emissions between a finite number of potentially emitting equipment is necessary for leak detection and repair as well as regulatory reporting of annualized emissions. We present a hierarchical Bayesian model, referred to as the multisource detection, localization, and quantification (MDLQ) model, for performing source apportionment on oil and gas sites using methane measurements from point sensor networks. The MDLQ model accounts for autocorrelation in the sensor data and enforces sparsity in the emission rate estimates via a spike-and-slab prior, as oil and gas equipment often emit intermittently. We use the MDLQ model to apportion methane emissions on an experimental oil and gas site designed to release methane in known quantities, providing a means of model evaluation. Data from this experiment are unique in their size (i.e., the number of controlled releases) and in their close approximation of emission characteristics on real oil and gas sites. As such, this study provides a baseline level of apportionment accuracy that can be expected when using point sensor networks on operational sites.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
Memory-efficient training with streaming dimensionality reduction
Authors:
Siyuan Huang,
Brian D. Hoskins,
Matthew W. Daniels,
Mark D. Stiles,
Gina C. Adam
Abstract:
The movement of large quantities of data during the training of a Deep Neural Network presents immense challenges for machine learning workloads. To minimize this overhead, especially on the movement and calculation of gradient information, we introduce streaming batch principal component analysis as an update algorithm. Streaming batch principal component analysis uses stochastic power iterations…
▽ More
The movement of large quantities of data during the training of a Deep Neural Network presents immense challenges for machine learning workloads. To minimize this overhead, especially on the movement and calculation of gradient information, we introduce streaming batch principal component analysis as an update algorithm. Streaming batch principal component analysis uses stochastic power iterations to generate a stochastic k-rank approximation of the network gradient. We demonstrate that the low rank updates produced by streaming batch principal component analysis can effectively train convolutional neural networks on a variety of common datasets, with performance comparable to standard mini batch gradient descent. These results can lead to both improvements in the design of application specific integrated circuits for deep learning and in the speed of synchronization of machine learning models trained with data parallelism.
△ Less
Submitted 24 April, 2020;
originally announced April 2020.