Skip to main content

Showing 1–28 of 28 results for author: Peng, R

Searching in archive stat. Search in all archives.
.
  1. arXiv:2505.14725  [pdf, ps, other

    q-bio.GN cs.LG stat.AP

    HR-VILAGE-3K3M: A Human Respiratory Viral Immunization Longitudinal Gene Expression Dataset for Systems Immunity

    Authors: Xuejun Sun, Yiran Song, Xiaochen Zhou, Ruilie Cai, Yu Zhang, Xinyi Li, Rui Peng, Jialiu Xie, Yuanyuan Yan, Muyao Tang, Prem Lakshmanane, Baiming Zou, James S. Hagood, Raymond J. Pickles, Didong Li, Fei Zou, Xiaojing Zheng

    Abstract: Respiratory viral infections pose a global health burden, yet the cellular immune responses driving protection or pathology remain unclear. Natural infection cohorts often lack pre-exposure baseline data and structured temporal sampling. In contrast, inoculation and vaccination trials generate insightful longitudinal transcriptomic data. However, the scattering of these datasets across platforms,… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

  2. arXiv:2501.04296  [pdf, other

    stat.ME

    Inside Out: Externalizing Assumptions in Data Analysis as Validation Checks

    Authors: H. Sherry Zhang, Roger D. Peng

    Abstract: In data analysis, unexpected results often prompt researchers to revisit their procedures to identify potential issues. While some researchers may struggle to identify the root causes, experienced researchers can often quickly diagnose problems by checking a few key assumptions. These checked assumptions, or expectations, are typically informal, difficult to trace, and rarely discussed in publicat… ▽ More

    Submitted 8 January, 2025; originally announced January 2025.

  3. arXiv:2412.13034  [pdf, other

    stat.AP stat.ME

    Unified calibration and spatial mapping of fine particulate matter data from multiple low-cost air pollution sensor networks in Baltimore, Maryland

    Authors: Claire Heffernan, Kirsten Koehler, Drew R. Gentner, Roger D. Peng, Abhirup Datta

    Abstract: Low-cost air pollution sensor networks are increasingly being deployed globally, supplementing sparse regulatory monitoring with localized air quality data. In some areas, like Baltimore, Maryland, there are only few regulatory (reference) devices but multiple low-cost networks. While there are many available methods to calibrate data from each network individually, separate calibration of each ne… ▽ More

    Submitted 17 December, 2024; originally announced December 2024.

  4. arXiv:2412.05783  [pdf, other

    cs.LG stat.ML

    Two-way Deconfounder for Off-policy Evaluation in Causal Reinforcement Learning

    Authors: Shuguang Yu, Shuxing Fang, Ruixin Peng, Zhengling Qi, Fan Zhou, Chengchun Shi

    Abstract: This paper studies off-policy evaluation (OPE) in the presence of unmeasured confounders. Inspired by the two-way fixed effects regression model widely used in the panel data literature, we propose a two-way unmeasured confounding assumption to model the system dynamics in causal reinforcement learning and develop a two-way deconfounder algorithm that devises a neural tensor network to simultaneou… ▽ More

    Submitted 7 December, 2024; originally announced December 2024.

  5. arXiv:2406.18681  [pdf, other

    stat.ME

    Data Sketching and Stacking: A Confluence of Two Strategies for Predictive Inference in Gaussian Process Regressions with High-Dimensional Features

    Authors: Samuel Gailliot, Rajarshi Guhaniyogi, Roger D. Peng

    Abstract: This article focuses on drawing computationally-efficient predictive inference from Gaussian process (GP) regressions with a large number of features when the response is conditionally independent of the features given the projection to a noisy low dimensional manifold. Bayesian estimation of the regression relationship using Markov Chain Monte Carlo and subsequent predictive inference is computat… ▽ More

    Submitted 25 September, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

    Comments: 32 Pages, 10 Figures

  6. arXiv:2312.16607  [pdf, other

    eess.IV cs.CV stat.ML

    A Polarization and Radiomics Feature Fusion Network for the Classification of Hepatocellular Carcinoma and Intrahepatic Cholangiocarcinoma

    Authors: Jia Dong, Yao Yao, Liyan Lin, Yang Dong, Jiachen Wan, Ran Peng, Chao Li, Hui Ma

    Abstract: Classifying hepatocellular carcinoma (HCC) and intrahepatic cholangiocarcinoma (ICC) is a critical step in treatment selection and prognosis evaluation for patients with liver diseases. Traditional histopathological diagnosis poses challenges in this context. In this study, we introduce a novel polarization and radiomics feature fusion network, which combines polarization features obtained from Mu… ▽ More

    Submitted 27 December, 2023; originally announced December 2023.

  7. arXiv:2312.07616  [pdf, other

    stat.ME math.ST stat.AP

    Evaluating the Alignment of a Data Analysis between Analyst and Audience

    Authors: Lucy D'Agostino McGowan, Roger D. Peng, Stephanie C. Hicks

    Abstract: A challenge that data analysts face is building a data analysis that is useful for a given consumer. Previously, we defined a set of principles for describing data analyses that can be used to create a data analysis and to characterize the variation between analyses. Here, we introduce a concept that we call the alignment of a data analysis between the data analyst and a consumer. We define a succ… ▽ More

    Submitted 11 December, 2023; originally announced December 2023.

  8. arXiv:2310.17506  [pdf, other

    stat.AP

    Predicting Patient No-Shows in Community Health Clinics: A Case Study in Designing a Data Analytic Product

    Authors: Roger D. Peng

    Abstract: The data science revolution has highlighted the varying roles that data analytic products can play in a different industries and applications. There has been particular interest in using analytic products coupled with algorithmic prediction models to aid in human decision-making. However, detailed descriptions of the decision-making process that leads to the design and development of analytic prod… ▽ More

    Submitted 26 October, 2023; originally announced October 2023.

  9. arXiv:2309.08494  [pdf, other

    stat.ME

    Modeling Data Analytic Iteration With Probabilistic Outcome Sets

    Authors: Roger D. Peng, Stephanie C. Hicks

    Abstract: In 1977 John Tukey described how in exploratory data analysis, data analysts use tools, such as data visualizations, to separate their expectations from what they observe. In contrast to statistical theory, an underappreciated aspect of data analysis is that a data analyst must make decisions by comparing the observed data or output from a statistical tool to what the analyst previously expected f… ▽ More

    Submitted 1 February, 2024; v1 submitted 15 September, 2023; originally announced September 2023.

    Comments: 30 pages

  10. arXiv:2306.01337  [pdf, other

    cs.CL stat.ML

    MathChat: Converse to Tackle Challenging Math Problems with LLM Agents

    Authors: Yiran Wu, Feiran Jia, Shaokun Zhang, Hangyu Li, Erkang Zhu, Yue Wang, Yin Tat Lee, Richard Peng, Qingyun Wu, Chi Wang

    Abstract: Employing Large Language Models (LLMs) to address mathematical problems is an intriguing research endeavor, considering the abundance of math problems expressed in natural language across numerous science and engineering fields. LLMs, with their generalized ability, are used as a foundation model to build AI agents for different tasks. In this paper, we study the effectiveness of utilizing LLM age… ▽ More

    Submitted 28 June, 2024; v1 submitted 2 June, 2023; originally announced June 2023.

    Comments: Update version

  11. arXiv:2203.14775  [pdf, other

    stat.AP

    A dynamic spatial filtering approach to mitigate underestimation bias in field calibrated low-cost sensor air-pollution data

    Authors: Claire Heffernan, Roger Peng, Drew R. Gentner, Kirsten Koehler, Abhirup Datta

    Abstract: Low-cost air pollution sensors, offering hyper-local characterization of pollutant concentrations, are becoming increasingly prevalent in environmental and public health research. However, low-cost air pollution data can be noisy, biased by environmental conditions, and usually need to be field-calibrated by collocating low-cost sensors with reference-grade instruments. We show, theoretically and… ▽ More

    Submitted 20 February, 2023; v1 submitted 28 March, 2022; originally announced March 2022.

  12. arXiv:2105.06324  [pdf, other

    stat.OT stat.AP

    Perspective on Data Science

    Authors: Roger D. Peng, Hilary S. Parker

    Abstract: The field of data science currently enjoys a broad definition that includes a wide array of activities which borrow from many other established fields of study. Having such a vague characterization of a field in the early stages might be natural, but over time maintaining such a broad definition becomes unwieldy and impedes progress. In particular, the teaching of data science is hampered by the s… ▽ More

    Submitted 13 May, 2021; originally announced May 2021.

  13. arXiv:2103.05689  [pdf, other

    stat.ME stat.AP stat.OT

    Design Principles for Data Analysis

    Authors: Lucy D'Agostino McGowan, Roger D. Peng, Stephanie C. Hicks

    Abstract: The data science revolution has led to an increased interest in the practice of data analysis. While much has been written about statistical thinking, a complementary form of thinking that appears in the practice of data analysis is design thinking -- the problem-solving process to understand the people for whom a product is being designed. For a given problem, there can be significant or subtle d… ▽ More

    Submitted 9 March, 2021; originally announced March 2021.

    Comments: arXiv admin note: text overlap with arXiv:1903.07639

  14. arXiv:2008.02464  [pdf, other

    stat.ML cs.LG math.PR

    A Matrix Chernoff Bound for Markov Chains and Its Application to Co-occurrence Matrices

    Authors: Jiezhong Qiu, Chi Wang, Ben Liao, Richard Peng, Jie Tang

    Abstract: We prove a Chernoff-type bound for sums of matrix-valued random variables sampled via a regular (aperiodic and irreducible) finite Markov chain. Specially, consider a random walk on a regular Markov chain and a Hermitian matrix-valued function on its state space. Our result gives exponentially decreasing bounds on the tail distributions of the extreme eigenvalues of the sample mean matrix. Our pro… ▽ More

    Submitted 29 October, 2020; v1 submitted 6 August, 2020; originally announced August 2020.

    Comments: Accepted at NeurIPS'20, 25 pages

  15. arXiv:2007.12210  [pdf, ps, other

    stat.OT

    Reproducible Research: A Retrospective

    Authors: Roger D. Peng, Stephanie C. Hicks

    Abstract: Rapid advances in computing technology over the past few decades have spurred two extraordinary phenomena in science: large-scale and high-throughput data collection coupled with the creation and implementation of complex statistical algorithms for data analysis. Together, these two phenomena have brought about tremendous advances in scientific discovery but have also raised two serious concerns,… ▽ More

    Submitted 23 July, 2020; originally announced July 2020.

  16. arXiv:2007.03746  [pdf, ps, other

    eess.SP cs.HC cs.LG stat.ML

    Transfer Learning for Motor Imagery Based Brain-Computer Interfaces: A Complete Pipeline

    Authors: Dongrui Wu, Xue Jiang, Ruimin Peng, Wanzeng Kong, Jian Huang, Zhigang Zeng

    Abstract: Transfer learning (TL) has been widely used in motor imagery (MI) based brain-computer interfaces (BCIs) to reduce the calibration effort for a new subject, and demonstrated promising performance. While a closed-loop MI-based BCI system, after electroencephalogram (EEG) signal acquisition and temporal filtering, includes spatial filtering, feature engineering, and classification blocks before send… ▽ More

    Submitted 22 January, 2021; v1 submitted 3 July, 2020; originally announced July 2020.

    Journal ref: Neural Networks, 153:235-253, 2022

  17. arXiv:2007.02817  [pdf, other

    cs.LG cs.DS stat.ML

    Faster Graph Embeddings via Coarsening

    Authors: Matthew Fahrbach, Gramoz Goranci, Richard Peng, Sushant Sachdeva, Chi Wang

    Abstract: Graph embeddings are a ubiquitous tool for machine learning tasks, such as node classification and link prediction, on graph-structured data. However, computing the embeddings for large-scale graphs is prohibitively inefficient even if we are interested only in a small subset of relevant vertices. To address this, we present an efficient graph coarsening approach, based on Schur complements, for c… ▽ More

    Submitted 22 October, 2020; v1 submitted 6 July, 2020; originally announced July 2020.

    Comments: 18 pages, 2 figures, to appear in the Proceedings of the 37th International Conference on Machine Learning (ICML 2020)

    Journal ref: Proceedings of the 37th International Conference on Machine Learning (ICML 2020) 2953-2963

  18. arXiv:2003.09737  [pdf, ps, other

    cs.LG stat.ML

    BoostTree and BoostForest for Ensemble Learning

    Authors: Changming Zhao, Dongrui Wu, Jian Huang, Ye Yuan, Hai-Tao Zhang, Ruimin Peng, Zhenhua Shi

    Abstract: Bootstrap aggregating (Bagging) and boosting are two popular ensemble learning approaches, which combine multiple base learners to generate a composite model for more accurate and more reliable performance. They have been widely used in biology, engineering, healthcare, etc. This paper proposes BoostForest, which is an ensemble learning approach using BoostTree as base learners and can be used for… ▽ More

    Submitted 6 December, 2022; v1 submitted 21 March, 2020; originally announced March 2020.

    Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022

  19. arXiv:1906.01621  [pdf, ps, other

    math.OC cs.LG stat.ML

    Higher-Order Accelerated Methods for Faster Non-Smooth Optimization

    Authors: Brian Bullins, Richard Peng

    Abstract: We provide improved convergence rates for various \emph{non-smooth} optimization problems via higher-order accelerated methods. In the case of $\ell_\infty$ regression, we achieves an $O(ε^{-4/5})$ iteration complexity, breaking the $O(ε^{-1})$ barrier so far present for previous methods. We arrive at a similar rate for the problem of $\ell_1$-SVM, going beyond what is attainable by first-order me… ▽ More

    Submitted 4 June, 2019; originally announced June 2019.

  20. arXiv:1904.11907  [pdf, ps, other

    stat.OT stat.AP

    Evaluating the Success of a Data Analysis

    Authors: Stephanie C. Hicks, Roger D. Peng

    Abstract: A fundamental problem in the practice and teaching of data science is how to evaluate the quality of a given data analysis, which is different than the evaluation of the science or question underlying the data analysis. Previously, we defined a set of principles for describing data analyses that can be used to create a data analysis and to characterize the variation between data analyses. Here, we… ▽ More

    Submitted 26 April, 2019; originally announced April 2019.

    Comments: 16 pages

  21. arXiv:1903.07639  [pdf, other

    stat.AP

    Elements and Principles for Characterizing Variation between Data Analyses

    Authors: Stephanie C. Hicks, Roger D. Peng

    Abstract: The data revolution has led to an increased interest in the practice of data analysis. For a given problem, there can be significant or subtle differences in how a data analyst constructs or creates a data analysis, including differences in the choice of methods, tooling, and workflow. In addition, data analysts can prioritize (or not) certain objective characteristics in a data analysis, leading… ▽ More

    Submitted 25 July, 2019; v1 submitted 18 March, 2019; originally announced March 2019.

    Comments: 14 pages, 7 figures, 1 table

  22. arXiv:1901.06764  [pdf, ps, other

    cs.DS math.NA math.OC stat.ML

    Iterative Refinement for $\ell_p$-norm Regression

    Authors: Deeksha Adil, Rasmus Kyng, Richard Peng, Sushant Sachdeva

    Abstract: We give improved algorithms for the $\ell_{p}$-regression problem, $\min_{x} \|x\|_{p}$ such that $A x=b,$ for all $p \in (1,2) \cup (2,\infty).$ Our algorithms obtain a high accuracy solution in $\tilde{O}_{p}(m^{\frac{|p-2|}{2p + |p-2|}}) \le \tilde{O}_{p}(m^{\frac{1}{3}})$ iterations, where each iteration requires solving an $m \times m$ linear system, $m$ being the dimension of the ambient spa… ▽ More

    Submitted 20 January, 2019; originally announced January 2019.

    Comments: Published in SODA 2019. Was initially submitted to SODA on July 12, 2018

  23. arXiv:1509.08968  [pdf, other

    stat.AP

    A glass half full interpretation of the replicability of psychological science

    Authors: Jeffrey T. Leek, Prasad Patil, Roger D. Peng

    Abstract: A recent study of the replicability of key psychological findings is a major contribution toward understanding the human side of the scientific process. Despite the careful and nuanced analysis reported in the paper, mass and social media adhered to the simple narrative that only 36% of the studies replicated their original results. Here we show that 77% of the replication effect sizes reported we… ▽ More

    Submitted 29 September, 2015; originally announced September 2015.

    Comments: 6 pages, 3 figures

  24. arXiv:1502.03496  [pdf, ps, other

    cs.DS cs.DM cs.LG cs.SI stat.ML

    Spectral Sparsification of Random-Walk Matrix Polynomials

    Authors: Dehua Cheng, Yu Cheng, Yan Liu, Richard Peng, Shang-Hua Teng

    Abstract: We consider a fundamental algorithmic question in spectral graph theory: Compute a spectral sparsifier of random-walk matrix-polynomial $$L_α(G)=D-\sum_{r=1}^dα_rD(D^{-1}A)^r$$ where $A$ is the adjacency matrix of a weighted, undirected graph, $D$ is the diagonal matrix of weighted degrees, and $α=(α_1...α_d)$ are nonnegative coefficients with $\sum_{r=1}^dα_r=1$. Recall that $D^{-1}A$ is the tran… ▽ More

    Submitted 11 February, 2015; originally announced February 2015.

  25. Reproducible Research Can Still Be Wrong: Adopting a Prevention Approach

    Authors: Jeffrey T. Leek, Roger D. Peng

    Abstract: Reproducibility, the ability to recompute results, and replicability, the chances other experimenters will achieve a consistent result, are two foundational characteristics of successful scientific research. Consistent findings from independent investigators are the primary means by which scientific evidence accumulates for or against an hypothesis. And yet, of late there has been a crisis of conf… ▽ More

    Submitted 10 February, 2015; originally announced February 2015.

    Comments: 3 pages, 1 figure

    Journal ref: PNAS 112 (6) 1645-1645, 2015

  26. arXiv:1410.5392  [pdf, ps, other

    cs.DS cs.LG math.NA stat.CO stat.ML

    Scalable Parallel Factorizations of SDD Matrices and Efficient Sampling for Gaussian Graphical Models

    Authors: Dehua Cheng, Yu Cheng, Yan Liu, Richard Peng, Shang-Hua Teng

    Abstract: Motivated by a sampling problem basic to computational statistical inference, we develop a nearly optimal algorithm for a fundamental problem in spectral graph theory and numerical analysis. Given an $n\times n$ SDDM matrix ${\bf \mathbf{M}}$, and a constant $-1 \leq p \leq 1$, our algorithm gives efficient access to a sparse $n\times n$ linear operator $\tilde{\mathbf{C}}$ such that… ▽ More

    Submitted 20 October, 2014; originally announced October 2014.

  27. arXiv:1408.5099  [pdf, ps, other

    cs.DS cs.LG stat.ML

    Uniform Sampling for Matrix Approximation

    Authors: Michael B. Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng, Aaron Sidford

    Abstract: Random sampling has become a critical tool in solving massive matrix problems. For linear regression, a small, manageable set of data rows can be randomly selected to approximate a tall, skinny data matrix, improving processing time significantly. For theoretical performance guarantees, each row must be sampled with probability proportional to its statistical leverage score. Unfortunately, leverag… ▽ More

    Submitted 21 August, 2014; originally announced August 2014.

  28. arXiv:1404.5358  [pdf, other

    stat.AP

    A randomized trial in a massive online open course shows people don't know what a statistically significant relationship looks like, but they can learn

    Authors: Aaron Fisher, G. Brooke Anderson, Roger Peng, Jeff Leek

    Abstract: Scatterplots are the most common way for statisticians, scientists, and the public to visually detect relationships between measured variables. At the same time, and despite widely publicized controversy, P-values remain the most commonly used measure to statistically justify relationships identified between variables. Here we measure the ability to detect statistically significant relationships f… ▽ More

    Submitted 21 April, 2014; originally announced April 2014.

    Comments: 7 pages, including 2 figures and 1 table