Skip to main content

Showing 1–50 of 116 results for author: Yu, B

Searching in archive stat. Search in all archives.
.
  1. arXiv:2505.08784  [pdf, ps, other

    stat.ML cs.LG math.ST stat.ME

    PCS-UQ: Uncertainty Quantification via the Predictability-Computability-Stability Framework

    Authors: Abhineet Agarwal, Michael Xiao, Rebecca Barter, Omer Ronen, Boyu Fan, Bin Yu

    Abstract: As machine learning (ML) models are increasingly deployed in high-stakes domains, trustworthy uncertainty quantification (UQ) is critical for ensuring the safety and reliability of these models. Traditional UQ methods rely on specifying a true generative model and are not robust to misspecification. On the other hand, conformal inference allows for arbitrary ML models but does not consider model s… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  2. arXiv:2503.22945  [pdf

    stat.OT

    Statistics at a Crossroads; Who is for the Challenge?

    Authors: Xuming He, David Madigan, Bin Yu, Jon Wellner

    Abstract: This project was sponsored by the National Science Foundation and organized by a steering committee and a group of theme leaders. The six-member steering committee, consisting of James Berger, Xuming He, David Madigan, Susan Murphy, Bin Yu, and Jon Wellner, was responsible for the overall planning of the project. This report is designed to be accessible to the wider audience of key stakeholders… ▽ More

    Submitted 28 March, 2025; originally announced March 2025.

  3. arXiv:2502.13283  [pdf, other

    cs.LG stat.ML

    Benefits of Early Stopping in Gradient Descent for Overparameterized Logistic Regression

    Authors: Jingfeng Wu, Peter Bartlett, Matus Telgarsky, Bin Yu

    Abstract: In overparameterized logistic regression, gradient descent (GD) iterates diverge in norm while converging in direction to the maximum $\ell_2$-margin solution -- a phenomenon known as the implicit bias of GD. This work investigates additional regularization effects induced by early stopping in well-specified high-dimensional logistic regression. We first demonstrate that the excess logistic risk v… ▽ More

    Submitted 18 February, 2025; originally announced February 2025.

  4. arXiv:2412.16773  [pdf, other

    stat.ML cs.LG eess.SP q-bio.NC

    Fast Multi-Group Gaussian Process Factor Models

    Authors: Evren Gokcen, Anna I. Jasper, Adam Kohn, Christian K. Machens, Byron M. Yu

    Abstract: Gaussian processes are now commonly used in dimensionality reduction approaches tailored to neuroscience, especially to describe changes in high-dimensional neural activity over time. As recording capabilities expand to include neuronal populations across multiple brain areas, cortical layers, and cell types, interest in extending Gaussian process factor models to characterize multi-population int… ▽ More

    Submitted 21 December, 2024; originally announced December 2024.

  5. arXiv:2411.11504  [pdf, other

    cs.AI cs.CL stat.ML

    Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering

    Authors: Xinyan Guan, Yanjiang Liu, Xinyu Lu, Boxi Cao, Ben He, Xianpei Han, Le Sun, Jie Lou, Bowen Yu, Yaojie Lu, Hongyu Lin

    Abstract: The evolution of machine learning has increasingly prioritized the development of powerful models and more scalable supervision signals. However, the emergence of foundation models presents significant challenges in providing effective supervision signals necessary for further enhancing their capabilities. Consequently, there is an urgent need to explore novel supervision signals and technical app… ▽ More

    Submitted 18 November, 2024; originally announced November 2024.

  6. arXiv:2410.10246  [pdf, other

    cs.CE stat.AP

    A Geometric Model with Stochastic Error for Abnormal Motion Detection of Portal Crane Bucket Grab

    Authors: Baichen Yu, Xiao Wang, Hansheng Wang

    Abstract: Abnormal swing angle detection of bucket grabs is crucial for efficient harbor operations. In this study, we develop a practically convenient swing angle detection method for crane operation, requiring only a single standard surveillance camera at the fly-jib head, without the need for sophisticated sensors or markers on the payload. Specifically, our algorithm takes the video images from the came… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

  7. arXiv:2409.10580  [pdf, other

    cs.LG cs.AI stat.ML

    Veridical Data Science for Medical Foundation Models

    Authors: Ahmed Alaa, Bin Yu

    Abstract: The advent of foundation models (FMs) such as large language models (LLMs) has led to a cultural shift in data science, both in medicine and beyond. This shift involves moving away from specialized predictive models trained for specific, well-defined domain questions to generalist FMs pre-trained on vast amounts of unstructured data, which can then be adapted to various clinical tasks and question… ▽ More

    Submitted 15 September, 2024; originally announced September 2024.

  8. arXiv:2406.19958  [pdf, other

    stat.ML cs.LG math.ST

    The Computational Curse of Big Data for Bayesian Additive Regression Trees: A Hitting Time Analysis

    Authors: Yan Shuo Tan, Omer Ronen, Theo Saarinen, Bin Yu

    Abstract: Bayesian Additive Regression Trees (BART) is a popular Bayesian non-parametric regression model that is commonly used in causal inference and beyond. Its strong predictive performance is supported by theoretical guarantees that its posterior distribution concentrates around the true regression function at optimal rates under various data generative settings and for appropriate prior choices. In th… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    MSC Class: 62G08; 65C40

  9. arXiv:2406.09657  [pdf, other

    cs.LG stat.ML

    Mitigating over-exploration in latent space optimization using LES

    Authors: Omer Ronen, Ahmed Imtiaz Humayun, Richard Baraniuk, Randall Balestriero, Bin Yu

    Abstract: We develop Latent Exploration Score (LES) to mitigate over-exploration in Latent Space Optimization (LSO), a popular method for solving black-box discrete optimization problems. LSO utilizes continuous optimization within the latent space of a Variational Autoencoder (VAE) and is known to be susceptible to over-exploration, which manifests in unrealistic solutions that reduce its practicality. LES… ▽ More

    Submitted 21 February, 2025; v1 submitted 13 June, 2024; originally announced June 2024.

  10. arXiv:2406.08447  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    The Impact of Initialization on LoRA Finetuning Dynamics

    Authors: Soufiane Hayou, Nikhil Ghosh, Bin Yu

    Abstract: In this paper, we study the role of initialization in Low Rank Adaptation (LoRA) as originally introduced in Hu et al. (2021). Essentially, to start from the pretrained model as initialization for finetuning, one can either initialize B to zero and A to random (default initialization in PEFT package), or vice-versa. In both cases, the product BA is equal to zero at initialization, which makes fine… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: TDLR: Different Initializations lead to completely different finetuning dynamics. One initialization (set A random and B zero) is generally better than the natural opposite initialization. arXiv admin note: text overlap with arXiv:2402.12354

  11. arXiv:2406.01252  [pdf, other

    cs.CL cs.AI stat.ML

    Towards Scalable Automated Alignment of LLMs: A Survey

    Authors: Boxi Cao, Keming Lu, Xinyu Lu, Jiawei Chen, Mengjie Ren, Hao Xiang, Peilin Liu, Yaojie Lu, Ben He, Xianpei Han, Le Sun, Hongyu Lin, Bowen Yu

    Abstract: Alignment is the most critical step in building large language models (LLMs) that meet human needs. With the rapid development of LLMs gradually surpassing human capabilities, traditional alignment methods based on human-annotation are increasingly unable to meet the scalability demands. Therefore, there is an urgent need to explore new sources of automated alignment signals and technical approach… ▽ More

    Submitted 3 September, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

    Comments: Paper List: https://github.com/cascip/awesome-auto-alignment

  12. arXiv:2404.00522  [pdf, other

    cs.LG stat.ML

    Minimum-Norm Interpolation Under Covariate Shift

    Authors: Neil Mallinar, Austin Zane, Spencer Frei, Bin Yu

    Abstract: Transfer learning is a critical part of real-world machine learning deployments and has been extensively studied in experimental works with overparameterized neural networks. However, even in the simplest setting of linear regression a notable gap still exists in the theoretical understanding of transfer learning. In-distribution research on high-dimensional linear regression has led to the identi… ▽ More

    Submitted 17 July, 2024; v1 submitted 30 March, 2024; originally announced April 2024.

    Comments: The Forty-first International Conference on Machine Learning (ICML 2024)

  13. arXiv:2403.08971  [pdf, other

    stat.CO

    Designing a Data Science simulation with MERITS: A Primer

    Authors: Corrine F Elliott, James PC Duncan, Tiffany M Tang, Merle Behr, Karl Kumbier, Bin Yu

    Abstract: Simulations play a crucial role in the modern scientific process. Yet despite (or due to) this ubiquity, the Data Science community shares neither a comprehensive definition for a "high-quality" study nor a consolidated guide to designing one. Inspired by the Predictability-Computability-Stability (PCS) framework for 'veridical' Data Science, we propose six MERITS that a simulation study should sa… ▽ More

    Submitted 15 May, 2025; v1 submitted 13 March, 2024; originally announced March 2024.

    Comments: 31 pages (main text); 1 figure; 2 tables; James PC Duncan, Tiffany M Tang: Authors contributed equally to this manuscript; Merle Behr, Karl Kumbier: Authors contributed equally to this manuscript

  14. arXiv:2402.15926  [pdf, other

    cs.LG stat.ML

    Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency

    Authors: Jingfeng Wu, Peter L. Bartlett, Matus Telgarsky, Bin Yu

    Abstract: We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize $η$ is so large that the loss initially oscillates. We show that GD exits this initial oscillatory phase rapidly -- in $\mathcal{O}(η)$ steps -- and subsequently achieves an $\tilde{\mathcal{O}}(1 / (ηt) )$ convergence rate after $t$ additional steps.… ▽ More

    Submitted 9 June, 2024; v1 submitted 24 February, 2024; originally announced February 2024.

    Comments: COLT 2024 camera ready

  15. arXiv:2402.12354  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    LoRA+: Efficient Low Rank Adaptation of Large Models

    Authors: Soufiane Hayou, Nikhil Ghosh, Bin Yu

    Abstract: In this paper, we show that Low Rank Adaptation (LoRA) as originally introduced in Hu et al. (2021) leads to suboptimal finetuning of models with large width (embedding dimension). This is due to the fact that adapter matrices A and B in LoRA are updated with the same learning rate. Using scaling arguments for large width networks, we demonstrate that using the same learning rate for A and B does… ▽ More

    Submitted 4 July, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

    Comments: 27 pages

  16. arXiv:2310.02533  [pdf, other

    cs.LG stat.ML

    Quantifying and mitigating the impact of label errors on model disparity metrics

    Authors: Julius Adebayo, Melissa Hall, Bowen Yu, Bobbie Chern

    Abstract: Errors in labels obtained via human annotation adversely affect a model's performance. Existing approaches propose ways to mitigate the effect of label error on a model's downstream accuracy, yet little is known about its impact on a model's disparity metrics. Here we study the effect of label error on a model's disparity metrics. We empirically characterize how varying levels of label error, in b… ▽ More

    Submitted 3 October, 2023; originally announced October 2023.

    Comments: Conference paper at ICLR 2023

  17. arXiv:2309.10301  [pdf, other

    stat.ML cs.LG

    Prominent Roles of Conditionally Invariant Components in Domain Adaptation: Theory and Algorithms

    Authors: Keru Wu, Yuansi Chen, Wooseok Ha, Bin Yu

    Abstract: Domain adaptation (DA) is a statistical learning problem that arises when the distribution of the source data used to train a model differs from that of the target data used to evaluate the model. While many DA algorithms have demonstrated considerable empirical success, blindly applying these algorithms can often lead to worse performance on new datasets. To address this, it is crucial to clarify… ▽ More

    Submitted 8 July, 2024; v1 submitted 19 September, 2023; originally announced September 2023.

  18. arXiv:2308.16878  [pdf, other

    stat.AP physics.app-ph

    On the Role of Non-Localities in Fundamental Diagram Estimation

    Authors: Jing Liu, Fangfang Zheng, Boxi Yu, Saif Jabari

    Abstract: We consider the role of non-localities in speed-density data used to fit fundamental diagrams from vehicle trajectories. We demonstrate that the use of anticipated densities results in a clear classification of speed-density data into stationary and non-stationary points, namely, acceleration and deceleration regimes and their separating boundary. The separating boundary represents a locus of stat… ▽ More

    Submitted 31 August, 2023; originally announced August 2023.

  19. arXiv:2308.03215  [pdf, other

    stat.ML cs.LG

    The Effect of SGD Batch Size on Autoencoder Learning: Sparsity, Sharpness, and Feature Learning

    Authors: Nikhil Ghosh, Spencer Frei, Wooseok Ha, Bin Yu

    Abstract: In this work, we investigate the dynamics of stochastic gradient descent (SGD) when training a single-neuron autoencoder with linear or ReLU activation on orthogonal data. We show that for this non-convex problem, randomly initialized SGD with a constant step size successfully finds a global minimum for any batch size choice. However, the particular global minimum found depends upon the batch size… ▽ More

    Submitted 6 August, 2023; originally announced August 2023.

  20. arXiv:2307.01932  [pdf, other

    stat.ME cs.AI cs.LG stat.ML

    MDI+: A Flexible Random Forest-Based Feature Importance Framework

    Authors: Abhineet Agarwal, Ana M. Kenney, Yan Shuo Tan, Tiffany M. Tang, Bin Yu

    Abstract: Mean decrease in impurity (MDI) is a popular feature importance measure for random forests (RFs). We show that the MDI for a feature $X_k$ in each tree in an RF is equivalent to the unnormalized $R^2$ value in a linear regression of the response on the collection of decision stumps that split on $X_k$. We use this interpretation to propose a flexible feature importance framework called MDI+. Speci… ▽ More

    Submitted 4 July, 2023; originally announced July 2023.

  21. arXiv:2307.00190  [pdf

    stat.AP

    Estimands in Real-World Evidence Studies

    Authors: Jie Chen, Daniel Scharfstein, Hongwei Wang, Binbing Yu, Yang Song, Weili He, John Scott, Xiwu Lin, Hana Lee

    Abstract: A Real-World Evidence (RWE) Scientific Working Group (SWG) of the American Statistical Association Biopharmaceutical Section (ASA BIOP) has been reviewing statistical considerations for the generation of RWE to support regulatory decision-making. As part of the effort, the working group is addressing estimands in RWE studies. Constructing the right estimand -- the target of estimation -- which ref… ▽ More

    Submitted 30 June, 2023; originally announced July 2023.

  22. arXiv:2210.09352  [pdf, other

    stat.ML cs.AI cs.LG math.ST

    A Mixing Time Lower Bound for a Simplified Version of BART

    Authors: Omer Ronen, Theo Saarinen, Yan Shuo Tan, James Duncan, Bin Yu

    Abstract: Bayesian Additive Regression Trees (BART) is a popular Bayesian non-parametric regression algorithm. The posterior is a distribution over sums of decision trees, and predictions are made by averaging approximate samples from the posterior. The combination of strong predictive performance and the ability to provide uncertainty measures has led BART to be commonly used in the social sciences, bios… ▽ More

    Submitted 17 October, 2022; originally announced October 2022.

  23. arXiv:2207.14481  [pdf, other

    econ.EM stat.ME

    Same Root Different Leaves: Time Series and Cross-Sectional Methods in Panel Data

    Authors: Dennis Shen, Peng Ding, Jasjeet Sekhon, Bin Yu

    Abstract: A central goal in social science is to evaluate the causal effect of a policy. One dominant approach is through panel data analysis in which the behaviors of multiple units are observed over time. The information across time and space motivates two general approaches: (i) horizontal regression (i.e., unconfoundedness), which exploits time series patterns, and (ii) vertical regression (e.g., synthe… ▽ More

    Submitted 8 October, 2022; v1 submitted 29 July, 2022; originally announced July 2022.

  24. arXiv:2205.15135  [pdf, other

    cs.LG cs.AI stat.AP stat.ME stat.ML

    Group Probability-Weighted Tree Sums for Interpretable Modeling of Heterogeneous Data

    Authors: Keyan Nasseri, Chandan Singh, James Duncan, Aaron Kornblith, Bin Yu

    Abstract: Machine learning in high-stakes domains, such as healthcare, faces two critical challenges: (1) generalizing to diverse data distributions given limited training data while (2) maintaining interpretability. To address these challenges, we propose an instance-weighted tree-sum method that effectively pools data across diverse groups to output a concise, rule-based model. Given distinct groups of in… ▽ More

    Submitted 30 May, 2022; originally announced May 2022.

    Comments: arXiv admin note: substantial text overlap with arXiv:2201.11931

  25. arXiv:2202.00858  [pdf, other

    cs.LG cs.AI stat.AP stat.ME stat.ML

    Hierarchical Shrinkage: improving the accuracy and interpretability of tree-based methods

    Authors: Abhineet Agarwal, Yan Shuo Tan, Omer Ronen, Chandan Singh, Bin Yu

    Abstract: Tree-based models such as decision trees and random forests (RF) are a cornerstone of modern machine-learning practice. To mitigate overfitting, trees are typically regularized by a variety of techniques that modify their structure (e.g. pruning). We introduce Hierarchical Shrinkage (HS), a post-hoc algorithm that does not modify the tree structure, and instead regularizes the tree by shrinking th… ▽ More

    Submitted 1 February, 2022; originally announced February 2022.

  26. arXiv:2201.11931  [pdf, other

    cs.LG cs.AI stat.AP stat.ME stat.ML

    Fast Interpretable Greedy-Tree Sums

    Authors: Yan Shuo Tan, Chandan Singh, Keyan Nasseri, Abhineet Agarwal, James Duncan, Omer Ronen, Matthew Epland, Aaron Kornblith, Bin Yu

    Abstract: Modern machine learning has achieved impressive prediction performance, but often sacrifices interpretability, a critical consideration in high-stakes domains such as medicine. In such settings, practitioners often use highly interpretable decision tree models, but these suffer from inductive bias against additive structure. To overcome this bias, we propose Fast Interpretable Greedy-Tree Sums (FI… ▽ More

    Submitted 8 July, 2023; v1 submitted 27 January, 2022; originally announced January 2022.

  27. arXiv:2111.10734  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    Deep Probability Estimation

    Authors: Sheng Liu, Aakash Kaku, Weicheng Zhu, Matan Leibovich, Sreyas Mohan, Boyang Yu, Haoxiang Huang, Laure Zanna, Narges Razavian, Jonathan Niles-Weed, Carlos Fernandez-Granda

    Abstract: Reliable probability estimation is of crucial importance in many real-world applications where there is inherent (aleatoric) uncertainty. Probability-estimation models are trained on observed outcomes (e.g. whether it has rained or not, or whether a patient has died or not), because the ground-truth probabilities of the events of interest are typically unknown. The problem is therefore analogous t… ▽ More

    Submitted 11 October, 2022; v1 submitted 20 November, 2021; originally announced November 2021.

    Comments: SL, AK, WZ, ML, SM contributed equally to this work; 36 pages, 17 figures, 12 tables

    Journal ref: Proceedings of the 39th International Conference on Machine Learning, PMLR 162:13746-13781, 2022

  28. arXiv:2111.07167  [pdf, other

    stat.ML cs.LG math.ST

    The Three Stages of Learning Dynamics in High-Dimensional Kernel Methods

    Authors: Nikhil Ghosh, Song Mei, Bin Yu

    Abstract: To understand how deep learning works, it is crucial to understand the training dynamics of neural networks. Several interesting hypotheses about these dynamics have been made based on empirically observed phenomena, but there exists a limited theoretical understanding of when and why such phenomena occur. In this paper, we consider the training dynamics of gradient flow on kernel least-squares… ▽ More

    Submitted 13 November, 2021; originally announced November 2021.

  29. arXiv:2110.09626  [pdf, other

    stat.ML cs.IT cs.LG

    A cautionary tale on fitting decision trees to data from additive models: generalization lower bounds

    Authors: Yan Shuo Tan, Abhineet Agarwal, Bin Yu

    Abstract: Decision trees are important both as interpretable models amenable to high-stakes decision-making, and as building blocks of ensemble methods such as random forests and gradient boosting. Their statistical properties, however, are not well understood. The most cited prior works have focused on deriving pointwise consistency guarantees for CART in a classical nonparametric regression setting. We ta… ▽ More

    Submitted 18 October, 2021; originally announced October 2021.

  30. arXiv:2110.08634  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Towards Robust Waveform-Based Acoustic Models

    Authors: Dino Oglic, Zoran Cvetkovic, Peter Sollich, Steve Renals, Bin Yu

    Abstract: We study the problem of learning robust acoustic models in adverse environments, characterized by a significant mismatch between training and test conditions. This problem is of paramount importance for the deployment of speech recognition systems that need to perform well in unseen environments. First, we characterize data augmentation theoretically as an instance of vicinal risk minimization, wh… ▽ More

    Submitted 29 June, 2022; v1 submitted 16 October, 2021; originally announced October 2021.

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022

  31. arXiv:2108.08445  [pdf, ps, other

    stat.AP

    Seven Principles for Rapid-Response Data Science: Lessons Learned from Covid-19 Forecasting

    Authors: Bin Yu, Chandan Singh

    Abstract: In this article, we take a step back to distill seven principles out of our experience in the spring of 2020, when our 12-person rapid-response team used skills of data science and beyond to help distribute Covid PPE. This process included tapping into domain knowledge of epidemiology and medical logistics chains, curating a relevant data repository, developing models for short-term county-level d… ▽ More

    Submitted 29 March, 2022; v1 submitted 18 August, 2021; originally announced August 2021.

    Comments: 4 pages, accepted in special issue of "Statistical Science" on COVID-19 Response

  32. arXiv:2108.06847  [pdf, other

    stat.ML cs.LG

    Interpreting and improving deep-learning models with reality checks

    Authors: Chandan Singh, Wooseok Ha, Bin Yu

    Abstract: Recent deep-learning models have achieved impressive predictive performance by learning complex functions of many variables, often at the cost of interpretability. This chapter covers recent work aiming to interpret models by attributing importance to features and feature groups for a single prediction. Importantly, the proposed attributions assign importance to interactions between features, in a… ▽ More

    Submitted 18 August, 2021; v1 submitted 15 August, 2021; originally announced August 2021.

  33. arXiv:2108.02422  [pdf

    stat.AP

    Divergent Effects of Factors on Crashes under Autonomous and Conventional Driving Modes Using A Hierarchical Bayesian Approach

    Authors: Weixi Ren, Bo Yu, Yuren Chen, Kun Gao, Shan Bao

    Abstract: Influencing factors on crashes involved with autonomous vehicles (AVs) have been paid increasing attention. However, there is a lack of comparative analyses between influencing factors on crashes of AVs and human-driven vehicles. To fill this research gap, the study aims to explore the divergent effects of factors on crashes under autonomous and conventional driving modes. This study obtained 154… ▽ More

    Submitted 7 April, 2022; v1 submitted 5 August, 2021; originally announced August 2021.

    Comments: 42 pages,10 figures

    MSC Class: 62P30 ACM Class: G.3.1

  34. arXiv:2107.09145  [pdf, other

    stat.ML cs.LG

    Adaptive wavelet distillation from neural networks through interpretations

    Authors: Wooseok Ha, Chandan Singh, Francois Lanusse, Srigokul Upadhyayula, Bin Yu

    Abstract: Recent deep-learning models have achieved impressive prediction performance, but often sacrifice interpretability and computational efficiency. Interpretability is crucial in many disciplines, such as science and medicine, where models must be carefully vetted or where interpretation is the goal itself. Moreover, interpretable models are concise and often yield computational efficiency. Here, we p… ▽ More

    Submitted 26 August, 2021; v1 submitted 19 July, 2021; originally announced July 2021.

  35. arXiv:2106.02096  [pdf, ps, other

    stat.ML cs.LG

    Shape-Preserving Dimensionality Reduction : An Algorithm and Measures of Topological Equivalence

    Authors: Byeongsu Yu, Kisung You

    Abstract: We introduce a linear dimensionality reduction technique preserving topological features via persistent homology. The method is designed to find linear projection $L$ which preserves the persistent diagram of a point cloud $\mathbb{X}$ via simulated annealing. The projection $L$ induces a set of canonical simplicial maps from the Rips (or ÄŒech) filtration of $\mathbb{X}$ to that of $L\mathbb{X}$.… ▽ More

    Submitted 13 June, 2021; v1 submitted 3 June, 2021; originally announced June 2021.

    Comments: 18 pages, 2 figures

  36. arXiv:2011.06593  [pdf, other

    q-bio.QM stat.AP

    A stability-driven protocol for drug response interpretable prediction (staDRIP)

    Authors: Xiao Li, Tiffany M. Tang, Xuewei Wang, Jean-Pierre A. Kocher, Bin Yu

    Abstract: Modern cancer -omics and pharmacological data hold great promise in precision cancer medicine for developing individualized patient treatments. However, high heterogeneity and noise in such data pose challenges for predicting the response of cancer cell lines to therapeutic drugs accurately. As a result, arbitrary human judgment calls are rampant throughout the predictive modeling pipeline. In thi… ▽ More

    Submitted 16 November, 2020; v1 submitted 12 November, 2020; originally announced November 2020.

    Comments: Machine Learning for Health (ML4H) at NeurIPS 2020 - Extended Abstract

  37. arXiv:2008.10109  [pdf, other

    stat.ME cs.LG stat.AP

    Stable discovery of interpretable subgroups via calibration in causal studies

    Authors: Raaz Dwivedi, Yan Shuo Tan, Briton Park, Mian Wei, Kevin Horgan, David Madigan, Bin Yu

    Abstract: Building on Yu and Kumbier's PCS framework and for randomized experiments, we introduce a novel methodology for Stable Discovery of Interpretable Subgroups via Calibration (StaDISC), with large heterogeneous treatment effects. StaDISC was developed during our re-analysis of the 1999-2000 VIGOR study, an 8076 patient randomized controlled trial (RCT), that compared the risk of adverse events from a… ▽ More

    Submitted 28 September, 2020; v1 submitted 23 August, 2020; originally announced August 2020.

    Comments: Raaz Dwivedi and Yan Shuo Tan are joint first authors and contributed equally to this work. 52 pages, 8 Figures, 9 Tables. To appear in International Statistical Review, 2020

  38. arXiv:2006.10189  [pdf, other

    cs.LG cs.IT math.ST stat.ML

    Revisiting minimum description length complexity in overparameterized models

    Authors: Raaz Dwivedi, Chandan Singh, Bin Yu, Martin J. Wainwright

    Abstract: Complexity is a fundamental concept underlying statistical learning theory that aims to inform generalization performance. Parameter count, while successful in low-dimensional settings, is not well-justified for overparameterized settings when the number of parameters is more than the number of training samples. We revisit complexity measures based on Rissanen's principle of minimum description le… ▽ More

    Submitted 12 October, 2023; v1 submitted 17 June, 2020; originally announced June 2020.

    Comments: First two authors contributed equally

  39. arXiv:2006.07841  [pdf, other

    cs.LG stat.ML

    Classify and Generate Reciprocally: Simultaneous Positive-Unlabelled Learning and Conditional Generation with Extra Data

    Authors: Bing Yu, Ke Sun, He Wang, Zhouchen Lin, Zhanxing Zhu

    Abstract: The scarcity of class-labeled data is a ubiquitous bottleneck in many machine learning problems. While abundant unlabeled data typically exist and provide a potential solution, it is highly challenging to exploit them. In this paper, we address this problem by leveraging Positive-Unlabeled~(PU) classification and the conditional generation with extra unlabeled data \emph{simultaneously}. In partic… ▽ More

    Submitted 8 February, 2024; v1 submitted 14 June, 2020; originally announced June 2020.

  40. Knowledge Distillation: A Survey

    Authors: Jianping Gou, Baosheng Yu, Stephen John Maybank, Dacheng Tao

    Abstract: In recent years, deep neural networks have been successful in both industry and academia, especially for computer vision tasks. The great success of deep learning is mainly due to its scalability to encode large-scale data and to maneuver billions of model parameters. However, it is a challenge to deploy these cumbersome deep models on devices with limited resources, e.g., mobile phones and embedd… ▽ More

    Submitted 20 May, 2021; v1 submitted 9 June, 2020; originally announced June 2020.

    Comments: It has been accepted for publication in International Journal of Computer Vision (2021)

  41. arXiv:2005.12781  [pdf, other

    cs.LG cs.IR stat.ML

    How to Grow a (Product) Tree: Personalized Category Suggestions for eCommerce Type-Ahead

    Authors: Jacopo Tagliabue, Bingqing Yu, Marie Beaulieu

    Abstract: In an attempt to balance precision and recall in the search page, leading digital shops have been effectively nudging users into select category facets as early as in the type-ahead suggestions. In this work, we present SessionPath, a novel neural network model that improves facet suggestions on two counts: first, the model is able to leverage session embeddings to provide scalable personalization… ▽ More

    Submitted 26 May, 2020; originally announced May 2020.

  42. arXiv:2005.11411  [pdf, other

    cs.LG math.ST stat.ML

    Instability, Computational Efficiency and Statistical Accuracy

    Authors: Nhat Ho, Koulik Khamaru, Raaz Dwivedi, Martin J. Wainwright, Michael I. Jordan, Bin Yu

    Abstract: Many statistical estimators are defined as the fixed point of a data-dependent operator, with estimators based on minimizing a cost function being an important special case. The limiting performance of such estimators depends on the properties of the population-level operator in the idealized limit of infinitely many samples. We develop a general framework that yields bounds on statistical accurac… ▽ More

    Submitted 20 March, 2022; v1 submitted 22 May, 2020; originally announced May 2020.

    Comments: 68 pages, 6 Figures, 2 Tables. First three authors contributed equally

  43. Curating a COVID-19 data repository and forecasting county-level death counts in the United States

    Authors: Nick Altieri, Rebecca L. Barter, James Duncan, Raaz Dwivedi, Karl Kumbier, Xiao Li, Robert Netzorg, Briton Park, Chandan Singh, Yan Shuo Tan, Tiffany Tang, Yu Wang, Chao Zhang, Bin Yu

    Abstract: As the COVID-19 outbreak evolves, accurate forecasting continues to play an extremely important role in informing policy decisions. In this paper, we present our continuous curation of a large data repository containing COVID-19 information from a range of sources. We use this data to develop predictions and corresponding prediction intervals for the short-term trajectory of COVID-19 cumulative de… ▽ More

    Submitted 9 August, 2020; v1 submitted 16 May, 2020; originally announced May 2020.

    Comments: Authors ordered alphabetically. All authors contributed significantly to this work. All collected data, modeling code, forecasts, and visualizations are updated daily and available at \url{https://github.com/Yu-Group/covid19-severity-prediction}

    Journal ref: Published in Harvard Data Science Review, 2020

  44. arXiv:2003.07160  [pdf, other

    cs.IR cs.LG stat.ML

    "An Image is Worth a Thousand Features": Scalable Product Representations for In-Session Type-Ahead Personalization

    Authors: Bingqing Yu, Jacopo Tagliabue, Ciro Greco, Federico Bianchi

    Abstract: We address the problem of personalizing query completion in a digital commerce setting, in which the bounce rate is typically high and recurring users are rare. We focus on in-session personalization and improve a standard noisy channel model by injecting dense vectors computed from product images at query time. We argue that image-based personalization displays several advantages over alternative… ▽ More

    Submitted 11 March, 2020; originally announced March 2020.

    ACM Class: I.2.6; I.2.7

  45. arXiv:2003.01926  [pdf, other

    stat.ML astro-ph.IM cs.LG

    Transformation Importance with Applications to Cosmology

    Authors: Chandan Singh, Wooseok Ha, Francois Lanusse, Vanessa Boehm, Jia Liu, Bin Yu

    Abstract: Machine learning lies at the heart of new possibilities for scientific discovery, knowledge generation, and artificial intelligence. Its potential benefits to these fields requires going beyond predictive accuracy and focusing on interpretability. In particular, many scientific problems require interpretations in a domain-specific interpretable feature space (e.g. the frequency domain) whereas att… ▽ More

    Submitted 14 June, 2021; v1 submitted 4 March, 2020; originally announced March 2020.

    Comments: Published in ICLR 2020 Workshop on Fundamental Science in the era of AI

  46. arXiv:1912.07254  [pdf, other

    cs.LG stat.ML

    VLSI Mask Optimization: From Shallow To Deep Learning

    Authors: Haoyu Yang, Wei Zhong, Yuzhe Ma, Hao Geng, Ran Chen, Wanli Chen, Bei Yu

    Abstract: VLSI mask optimization is one of the most critical stages in manufacturability aware design, which is costly due to the complicated mask optimization and lithography simulation. Recent researches have shown prominent advantages of machine learning techniques dealing with complicated and big data problems, which bring potential of dedicated machine learning solution for DFM problems and facilitate… ▽ More

    Submitted 16 December, 2019; originally announced December 2019.

    Comments: 6 pages; accepted by 25th Asia and South Pacific Design Automation Conference (ASP-DAC 2020)

  47. arXiv:1912.05796  [pdf, other

    cs.LG cs.AI stat.ML

    Automatic Layout Generation with Applications in Machine Learning Engine Evaluation

    Authors: Haoyu Yang, Wen Chen, Piyush Pathak, Frank Gennari, Ya-Chieh Lai, Bei Yu

    Abstract: Machine learning-based lithography hotspot detection has been deeply studied recently, from varies feature extraction techniques to efficient learning models. It has been observed that such machine learning-based frameworks are providing satisfactory metal layer hotspot prediction results on known public metal layer benchmarks. In this work, we seek to evaluate how these machine learning-based hot… ▽ More

    Submitted 12 December, 2019; originally announced December 2019.

    Comments: 6 pages, submitted to 1st ACM/IEEE Workshop on Machine Learning for CAD (MLCAD) for review

  48. arXiv:1911.09307  [pdf, other

    cs.LG stat.ML

    Patch-level Neighborhood Interpolation: A General and Effective Graph-based Regularization Strategy

    Authors: Ke Sun, Bing Yu, Zhouchen Lin, Zhanxing Zhu

    Abstract: Regularization plays a crucial role in machine learning models, especially for deep neural networks. The existing regularization techniques mainly rely on the i.i.d. assumption and only consider the knowledge from the current sample, without the leverage of the neighboring relationship between samples. In this work, we propose a general regularizer called \textbf{Patch-level Neighborhood Interpola… ▽ More

    Submitted 22 October, 2023; v1 submitted 21 November, 2019; originally announced November 2019.

    Comments: Accepted in ACML 2023 conference track

  49. arXiv:1911.02549  [pdf, other

    cs.LG cs.PF stat.ML

    MLPerf Inference Benchmark

    Authors: Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, Ramesh Chukka, Cody Coleman, Sam Davis, Pan Deng, Greg Diamos, Jared Duke, Dave Fick, J. Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B. Jablin, Jeff Jiao, Tom St. John, Pankaj Kanwar, David Lee , et al. (22 additional authors not shown)

    Abstract: Machine-learning (ML) hardware and software system demand is burgeoning. Driven by ML applications, the number of different ML inference systems has exploded. Over 100 organizations are building ML inference chips, and the systems that incorporate existing models span at least three orders of magnitude in power consumption and five orders of magnitude in performance; they range from embedded devic… ▽ More

    Submitted 9 May, 2020; v1 submitted 6 November, 2019; originally announced November 2019.

    Comments: ISCA 2020

  50. arXiv:1909.13584  [pdf, other

    cs.LG cs.CV stat.ML

    Interpretations are useful: penalizing explanations to align neural networks with prior knowledge

    Authors: Laura Rieger, Chandan Singh, W. James Murdoch, Bin Yu

    Abstract: For an explanation of a deep learning model to be effective, it must provide both insight into a model and suggest a corresponding action in order to achieve some objective. Too often, the litany of proposed explainable deep learning methods stop at the first step, providing practitioners with insight into a model, but no way to act on it. In this paper, we propose contextual decomposition explana… ▽ More

    Submitted 8 October, 2020; v1 submitted 30 September, 2019; originally announced September 2019.

    Comments: 18 pages; published in ICML2020; Erratum: numbers in table 1 were too high (now corrected) with the trend remaining the same