Search | arXiv e-print repository

arXiv:2502.19367 [pdf, other]

dCMF: Learning interpretable evolving patterns from temporal multiway data

Authors: Christos Chatzis, Carla Schenker, Jérémy E. Cohen, Evrim Acar

Abstract: Multiway datasets are commonly analyzed using unsupervised matrix and tensor factorization methods to reveal underlying patterns. Frequently, such datasets include timestamps and could correspond to, for example, health-related measurements of subjects collected over time. The temporal dimension is inherently different from the other dimensions, requiring methods that account for its intrinsic pro… ▽ More Multiway datasets are commonly analyzed using unsupervised matrix and tensor factorization methods to reveal underlying patterns. Frequently, such datasets include timestamps and could correspond to, for example, health-related measurements of subjects collected over time. The temporal dimension is inherently different from the other dimensions, requiring methods that account for its intrinsic properties. Linear Dynamical Systems (LDS) are specifically designed to capture sequential dependencies in the observed data. In this work, we bridge the gap between tensor factorizations and dynamical modeling by exploring the relationship between LDS, Coupled Matrix Factorizations (CMF) and the PARAFAC2 model. We propose a time-aware coupled factorization model called d(ynamical)CMF that constrains the temporal evolution of the latent factors to adhere to a specific LDS structure. Using synthetic datasets, we compare the performance of dCMF with PARAFAC2 and t(emporal)PARAFAC2 which incorporates temporal smoothness. Our results show that dCMF and PARAFAC2-based approaches perform similarly when capturing smoothly evolving patterns that adhere to the PARAFAC2 structure. However, dCMF outperforms alternatives when the patterns evolve smoothly but deviate from the PARAFAC2 structure. Furthermore, we demonstrate that the proposed dCMF method enables to capture more complex dynamics when additional prior information about the temporal evolution is incorporated. △ Less

Submitted 26 February, 2025; originally announced February 2025.

arXiv:2501.10202 [pdf, ps, other]

Provably Safeguarding a Classifier from OOD and Adversarial Samples: an Extreme Value Theory Approach

Authors: Nicolas Atienza, Christophe Labreuche, Johanne Cohen, Michele Sebag

Abstract: This paper introduces a novel method, Sample-efficient Probabilistic Detection using Extreme Value Theory (SPADE), which transforms a classifier into an abstaining classifier, offering provable protection against out-of-distribution and adversarial samples. The approach is based on a Generalized Extreme Value (GEV) model of the training distribution in the classifier's latent space, enabling the f… ▽ More This paper introduces a novel method, Sample-efficient Probabilistic Detection using Extreme Value Theory (SPADE), which transforms a classifier into an abstaining classifier, offering provable protection against out-of-distribution and adversarial samples. The approach is based on a Generalized Extreme Value (GEV) model of the training distribution in the classifier's latent space, enabling the formal characterization of OOD samples. Interestingly, under mild assumptions, the GEV model also allows for formally characterizing adversarial samples. The abstaining classifier, which rejects samples based on their assessment by the GEV model, provably avoids OOD and adversarial samples. The empirical validation of the approach, conducted on various neural architectures (ResNet, VGG, and Vision Transformer) and medium and large-sized datasets (CIFAR-10, CIFAR-100, and ImageNet), demonstrates its frugality, stability, and efficiency compared to the state of the art. △ Less

Submitted 17 January, 2025; originally announced January 2025.

Comments: under review

arXiv:2410.24206 [pdf, other]

Understanding Optimization in Deep Learning with Central Flows

Authors: Jeremy M. Cohen, Alex Damian, Ameet Talwalkar, Zico Kolter, Jason D. Lee

Abstract: Optimization in deep learning remains poorly understood, even in the simple setting of deterministic (i.e. full-batch) training. A key difficulty is that much of an optimizer's behavior is implicitly determined by complex oscillatory dynamics, referred to as the "edge of stability." The main contribution of this paper is to show that an optimizer's implicit behavior can be explicitly captured by a… ▽ More Optimization in deep learning remains poorly understood, even in the simple setting of deterministic (i.e. full-batch) training. A key difficulty is that much of an optimizer's behavior is implicitly determined by complex oscillatory dynamics, referred to as the "edge of stability." The main contribution of this paper is to show that an optimizer's implicit behavior can be explicitly captured by a "central flow:" a differential equation which models the time-averaged optimization trajectory. We show that these flows can empirically predict long-term optimization trajectories of generic neural networks with a high degree of numerical accuracy. By interpreting these flows, we reveal for the first time 1) the precise sense in which RMSProp adapts to the local loss landscape, and 2) an "acceleration via regularization" mechanism, wherein adaptive optimizers implicitly navigate towards low-curvature regions in which they can take larger steps. This mechanism is key to the efficacy of these adaptive optimizers. Overall, we believe that central flows constitute a promising tool for reasoning about optimization in deep learning. △ Less

Submitted 31 October, 2024; originally announced October 2024.

Comments: first two authors contributed equally; author order determined by coin flip

arXiv:2408.16023 [pdf, other]

Inferring the parameters of Taylor's law in ecology

Authors: Lionel Truquet, Joel E. Cohen, Paul Doukhan

Abstract: Taylor's power law (TL) or fluctuation scaling has been verified empirically for the abundances of many species, human and non-human, and in many other fields including physics, meteorology, computer science, and finance. TL asserts that the variance is directly proportional to a power of the mean, exactly for population moments and, whether or not population moments exist, approximately for sampl… ▽ More Taylor's power law (TL) or fluctuation scaling has been verified empirically for the abundances of many species, human and non-human, and in many other fields including physics, meteorology, computer science, and finance. TL asserts that the variance is directly proportional to a power of the mean, exactly for population moments and, whether or not population moments exist, approximately for sample moments. In many papers, linear regression of log variance as a function of log mean is used to estimate TL's parameters. We provide some statistical guarantees with large-sample asymptotics for this kind of inference under general conditions, and we derive confidence intervals for the parameters. In many ecological applications, the means and variances are estimated over time or across space from arrays of abundance data collected at different locations and time points. When the ratio between the time-series length and the number of spatial points converges to a constant as both become large, the usual normalized statistics are asymptotically biased. We provide a bias correction to get correct confidence intervals. TL, widely studied in multiple sciences, is a source of challenging new statistical problems in a nonstationary spatiotemporal framework. We illustrate our results with both simulated and real data sets. △ Less

Submitted 27 August, 2024; originally announced August 2024.

arXiv:2408.15805 [pdf, other]

Investigating Complex HPV Dynamics Using Emulation and History Matching

Authors: Andrew Iskauskas, Jamie A. Cohen, Danny Scarponi, Ian Vernon, Michael Goldstein, Daniel Klein, Richard G. White, Nicky McCreesh

Abstract: The study of transmission and progression of human papillomavirus (HPV) is crucial for understanding the incidence of cervical cancers, and has been identified as a priority worldwide. The complexity of the disease necessitates a detailed model of HPV transmission and its progression to cancer; to infer properties of the above we require a careful process that can match to imperfect or incomplete… ▽ More The study of transmission and progression of human papillomavirus (HPV) is crucial for understanding the incidence of cervical cancers, and has been identified as a priority worldwide. The complexity of the disease necessitates a detailed model of HPV transmission and its progression to cancer; to infer properties of the above we require a careful process that can match to imperfect or incomplete observational data. In this paper, we describe the HPVsim simulator to satisfy the former requirement; to satisfy the latter we couple this stochastic simulator to a process of emulation and history matching using the R package hmer. With these tools, we are able to obtain a comprehensive collection of parameter combinations that could give rise to observed cancer data, and explore the implications of the variability of these parameter sets as it relates to future health interventions. △ Less

Submitted 28 August, 2024; originally announced August 2024.

Comments: 21 pages, 15 figures; submitted to Epidemics

arXiv:2304.00195 [pdf, other]

Abstractors and relational cross-attention: An inductive bias for explicit relational reasoning in Transformers

Authors: Awni Altabaa, Taylor Webb, Jonathan Cohen, John Lafferty

Abstract: An extension of Transformers is proposed that enables explicit relational reasoning through a novel module called the Abstractor. At the core of the Abstractor is a variant of attention called relational cross-attention. The approach is motivated by an architectural inductive bias for relational learning that disentangles relational information from object-level features. This enables explicit rel… ▽ More An extension of Transformers is proposed that enables explicit relational reasoning through a novel module called the Abstractor. At the core of the Abstractor is a variant of attention called relational cross-attention. The approach is motivated by an architectural inductive bias for relational learning that disentangles relational information from object-level features. This enables explicit relational reasoning, supporting abstraction and generalization from limited data. The Abstractor is first evaluated on simple discriminative relational tasks and compared to existing relational architectures. Next, the Abstractor is evaluated on purely relational sequence-to-sequence tasks, where dramatic improvements are seen in sample efficiency compared to standard Transformers. Finally, Abstractors are evaluated on a collection of tasks based on mathematical problem solving, where consistent improvements in performance and sample efficiency are observed. △ Less

Submitted 12 April, 2024; v1 submitted 31 March, 2023; originally announced April 2023.

Comments: Published at ICLR 2024

arXiv:2209.10666 [pdf, other]

doi 10.1038/s41467-023-38874-y

Adaptive Bias Correction for Improved Subseasonal Forecasting

Authors: Soukayna Mouatadid, Paulo Orenstein, Genevieve Flaspohler, Judah Cohen, Miruna Oprescu, Ernest Fraenkel, Lester Mackey

Abstract: Subseasonal forecasting -- predicting temperature and precipitation 2 to 6 weeks ahead -- is critical for effective water allocation, wildfire management, and drought and flood mitigation. Recent international research efforts have advanced the subseasonal capabilities of operational dynamical models, yet temperature and precipitation prediction skills remain poor, partly due to stubborn errors in… ▽ More Subseasonal forecasting -- predicting temperature and precipitation 2 to 6 weeks ahead -- is critical for effective water allocation, wildfire management, and drought and flood mitigation. Recent international research efforts have advanced the subseasonal capabilities of operational dynamical models, yet temperature and precipitation prediction skills remain poor, partly due to stubborn errors in representing atmospheric dynamics and physics inside dynamical models. Here, to counter these errors, we introduce an adaptive bias correction (ABC) method that combines state-of-the-art dynamical forecasts with observations using machine learning. We show that, when applied to the leading subseasonal model from the European Centre for Medium-Range Weather Forecasts (ECMWF), ABC improves temperature forecasting skill by 60-90% (over baseline skills of 0.18-0.25) and precipitation forecasting skill by 40-69% (over baseline skills of 0.11-0.15) in the contiguous U.S. We couple these performance improvements with a practical workflow to explain ABC skill gains and identify higher-skill windows of opportunity based on specific climate conditions. △ Less

Submitted 15 May, 2023; v1 submitted 21 September, 2022; originally announced September 2022.

arXiv:2206.10654 [pdf, other]

On the Maximum Hessian Eigenvalue and Generalization

Authors: Simran Kaur, Jeremy Cohen, Zachary C. Lipton

Abstract: The mechanisms by which certain training interventions, such as increasing learning rates and applying batch normalization, improve the generalization of deep networks remains a mystery. Prior works have speculated that "flatter" solutions generalize better than "sharper" solutions to unseen data, motivating several metrics for measuring flatness (particularly $λ_{max}$, the largest eigenvalue of… ▽ More The mechanisms by which certain training interventions, such as increasing learning rates and applying batch normalization, improve the generalization of deep networks remains a mystery. Prior works have speculated that "flatter" solutions generalize better than "sharper" solutions to unseen data, motivating several metrics for measuring flatness (particularly $λ_{max}$, the largest eigenvalue of the Hessian of the loss); and algorithms, such as Sharpness-Aware Minimization (SAM) [1], that directly optimize for flatness. Other works question the link between $λ_{max}$ and generalization. In this paper, we present findings that call $λ_{max}$'s influence on generalization further into question. We show that: (1) while larger learning rates reduce $λ_{max}$ for all batch sizes, generalization benefits sometimes vanish at larger batch sizes; (2) by scaling batch size and learning rate simultaneously, we can change $λ_{max}$ without affecting generalization; (3) while SAM produces smaller $λ_{max}$ for all batch sizes, generalization benefits (also) vanish with larger batch sizes; (4) for dropout, excessively high dropout probabilities can degrade generalization, even as they promote smaller $λ_{max}$; and (5) while batch-normalization does not consistently produce smaller $λ_{max}$, it nevertheless confers generalization benefits. While our experiments affirm the generalization benefits of large learning rates and SAM for minibatch SGD, the GD-SGD discrepancy demonstrates limits to $λ_{max}$'s ability to explain generalization in neural networks. △ Less

Submitted 23 May, 2023; v1 submitted 21 June, 2022; originally announced June 2022.

Comments: Proceedings on "I Can't Believe It's Not Better! - Understanding Deep Learning Through Empirical Falsification" at NeurIPS 2022 Workshops, PMLR 187:51-65, 2023

arXiv:2202.01650 [pdf, other]

Exposure Effects on Count Outcomes with Observational Data, with Application to Incarcerated Women

Authors: Bonnie E. Shook-Sa, Michael G. Hudgens, Andrea K. Knittel, Andrew Edmonds, Catalina Ramirez, Stephen R. Cole, Mardge Cohen, Adebola Adedimeji, Tonya Taylor, Katherine G. Michel, Andrea Kovacs, Jennifer Cohen, Jessica Donohue, Antonina Foster, Margaret A. Fischl, Dustin Long, Adaora A. Adimora

Abstract: Causal inference methods can be applied to estimate the effect of a point exposure or treatment on an outcome of interest using data from observational studies. For example, in the Women's Interagency HIV Study, it is of interest to understand the effects of incarceration on the number of sexual partners and the number of cigarettes smoked after incarceration. In settings like this where the outco… ▽ More Causal inference methods can be applied to estimate the effect of a point exposure or treatment on an outcome of interest using data from observational studies. For example, in the Women's Interagency HIV Study, it is of interest to understand the effects of incarceration on the number of sexual partners and the number of cigarettes smoked after incarceration. In settings like this where the outcome is a count, the estimand is often the causal mean ratio, i.e., the ratio of the counterfactual mean count under exposure to the counterfactual mean count under no exposure. This paper considers estimators of the causal mean ratio based on inverse probability of treatment weights, the parametric g-formula, and doubly robust estimation, each of which can account for overdispersion, zero-inflation, and heaping in the measured outcome. Methods are compared in simulations and are applied to data from the Women's Interagency HIV Study. △ Less

Submitted 6 November, 2023; v1 submitted 3 February, 2022; originally announced February 2022.

arXiv:2111.12399 [pdf, other]

Dictionary-based Low-Rank Approximations and the Mixed Sparse Coding problem

Authors: Jeremy E. Cohen

Abstract: Constrained tensor and matrix factorization models allow to extract interpretable patterns from multiway data. Therefore identifiability properties and efficient algorithms for constrained low-rank approximations are nowadays important research topics. This work deals with columns of factor matrices of a low-rank approximation being sparse in a known and possibly overcomplete basis, a model coined… ▽ More Constrained tensor and matrix factorization models allow to extract interpretable patterns from multiway data. Therefore identifiability properties and efficient algorithms for constrained low-rank approximations are nowadays important research topics. This work deals with columns of factor matrices of a low-rank approximation being sparse in a known and possibly overcomplete basis, a model coined as Dictionary-based Low-Rank Approximation (DLRA). While earlier contributions focused on finding factor columns inside a dictionary of candidate columns, i.e. one-sparse approximations, this work is the first to tackle DLRA with sparsity larger than one. I propose to focus on the sparse-coding subproblem coined Mixed Sparse-Coding (MSC) that emerges when solving DLRA with an alternating optimization strategy. Several algorithms based on sparse-coding heuristics (greedy methods, convex relaxations) are provided to solve MSC. The performance of these heuristics is evaluated on simulated data. Then, I show how to adapt an efficient MSC solver based on the LASSO to compute Dictionary-based Matrix Factorization and Canonical Polyadic Decomposition in the context of hyperspectral image processing and chemometrics. These experiments suggest that DLRA extends the modeling capabilities of low-rank approximations, helps reducing estimation variance and enhances the identifiability and interpretability of estimated factors. △ Less

Submitted 21 January, 2022; v1 submitted 24 November, 2021; originally announced November 2021.

arXiv:2110.01278 [pdf, other]

doi 10.1137/21M1450033

An AO-ADMM approach to constraining PARAFAC2 on all modes

Authors: Marie Roald, Carla Schenker, Vince D. Calhoun, Tülay Adalı, Rasmus Bro, Jeremy E. Cohen, Evrim Acar

Abstract: Analyzing multi-way measurements with variations across one mode of the dataset is a challenge in various fields including data mining, neuroscience and chemometrics. For example, measurements may evolve over time or have unaligned time profiles. The PARAFAC2 model has been successfully used to analyze such data by allowing the underlying factor matrices in one mode (i.e., the evolving mode) to ch… ▽ More Analyzing multi-way measurements with variations across one mode of the dataset is a challenge in various fields including data mining, neuroscience and chemometrics. For example, measurements may evolve over time or have unaligned time profiles. The PARAFAC2 model has been successfully used to analyze such data by allowing the underlying factor matrices in one mode (i.e., the evolving mode) to change across slices. The traditional approach to fit a PARAFAC2 model is to use an alternating least squares-based algorithm, which handles the constant cross-product constraint of the PARAFAC2 model by implicitly estimating the evolving factor matrices. This approach makes imposing regularization on these factor matrices challenging. There is currently no algorithm to flexibly impose such regularization with general penalty functions and hard constraints. In order to address this challenge and to avoid the implicit estimation, in this paper, we propose an algorithm for fitting PARAFAC2 based on alternating optimization with the alternating direction method of multipliers (AO-ADMM). With numerical experiments on simulated data, we show that the proposed PARAFAC2 AO-ADMM approach allows for flexible constraints, recovers the underlying patterns accurately, and is computationally efficient compared to the state-of-the-art. We also apply our model to two real-world datasets from neuroscience and chemometrics, and show that constraining the evolving mode improves the interpretability of the extracted patterns. △ Less

Submitted 8 July, 2022; v1 submitted 4 October, 2021; originally announced October 2021.

MSC Class: 15A69; 90C26

Journal ref: SIAM J. Math. Data Sci. 4 (2022) 1191-1222

arXiv:2109.10399 [pdf, other]

SubseasonalClimateUSA: A Dataset for Subseasonal Forecasting and Benchmarking

Authors: Soukayna Mouatadid, Paulo Orenstein, Genevieve Flaspohler, Miruna Oprescu, Judah Cohen, Franklyn Wang, Sean Knight, Maria Geogdzhayeva, Sam Levang, Ernest Fraenkel, Lester Mackey

Abstract: Subseasonal forecasting of the weather two to six weeks in advance is critical for resource allocation and advance disaster notice but poses many challenges for the forecasting community. At this forecast horizon, physics-based dynamical models have limited skill, and the targets for prediction depend in a complex manner on both local weather variables and global climate variables. Recently, machi… ▽ More Subseasonal forecasting of the weather two to six weeks in advance is critical for resource allocation and advance disaster notice but poses many challenges for the forecasting community. At this forecast horizon, physics-based dynamical models have limited skill, and the targets for prediction depend in a complex manner on both local weather variables and global climate variables. Recently, machine learning methods have shown promise in advancing the state of the art but only at the cost of complex data curation, integrating expert knowledge with aggregation across multiple relevant data sources, file formats, and temporal and spatial resolutions. To streamline this process and accelerate future development, we introduce SubseasonalClimateUSA, a curated dataset for training and benchmarking subseasonal forecasting models in the United States. We use this dataset to benchmark a diverse suite of models, including operational dynamical models, classical meteorological baselines, and ten state-of-the-art machine learning and deep learning-based methods from the literature. Overall, our benchmarks suggest simple and effective ways to extend the accuracy of current operational models. SubseasonalClimateUSA is regularly updated and accessible via the https://github.com/microsoft/subseasonal_data/ Python package. △ Less

Submitted 16 January, 2024; v1 submitted 21 September, 2021; originally announced September 2021.

arXiv:2106.06885 [pdf, other]

Online Learning with Optimism and Delay

Authors: Genevieve Flaspohler, Francesco Orabona, Judah Cohen, Soukayna Mouatadid, Miruna Oprescu, Paulo Orenstein, Lester Mackey

Abstract: Inspired by the demands of real-time climate and weather forecasting, we develop optimistic online learning algorithms that require no parameter tuning and have optimal regret guarantees under delayed feedback. Our algorithms -- DORM, DORM+, and AdaHedgeD -- arise from a novel reduction of delayed online learning to optimistic online learning that reveals how optimistic hints can mitigate the regr… ▽ More Inspired by the demands of real-time climate and weather forecasting, we develop optimistic online learning algorithms that require no parameter tuning and have optimal regret guarantees under delayed feedback. Our algorithms -- DORM, DORM+, and AdaHedgeD -- arise from a novel reduction of delayed online learning to optimistic online learning that reveals how optimistic hints can mitigate the regret penalty caused by delay. We pair this delay-as-optimism perspective with a new analysis of optimistic learning that exposes its robustness to hinting errors and a new meta-algorithm for learning effective hinting strategies in the presence of delay. We conclude by benchmarking our algorithms on four subseasonal climate forecasting tasks, demonstrating low regret relative to state-of-the-art forecasting models. △ Less

Submitted 12 July, 2021; v1 submitted 12 June, 2021; originally announced June 2021.

Comments: ICML 2021. 9 pages of main paper and 26 pages of appendix text

arXiv:2105.00773 [pdf, other]

Approximate Bayesian Computation for an Explicit-Duration Hidden Markov Model of COVID-19 Hospital Trajectories

Authors: Gian Marco Visani, Alexandra Hope Lee, Cuong Nguyen, David M. Kent, John B. Wong, Joshua T. Cohen, Michael C. Hughes

Abstract: We address the problem of modeling constrained hospital resources in the midst of the COVID-19 pandemic in order to inform decision-makers of future demand and assess the societal value of possible interventions. For broad applicability, we focus on the common yet challenging scenario where patient-level data for a region of interest are not available. Instead, given daily admissions counts, we mo… ▽ More We address the problem of modeling constrained hospital resources in the midst of the COVID-19 pandemic in order to inform decision-makers of future demand and assess the societal value of possible interventions. For broad applicability, we focus on the common yet challenging scenario where patient-level data for a region of interest are not available. Instead, given daily admissions counts, we model aggregated counts of observed resource use, such as the number of patients in the general ward, in the intensive care unit, or on a ventilator. In order to explain how individual patient trajectories produce these counts, we propose an aggregate count explicit-duration hidden Markov model, nicknamed the ACED-HMM, with an interpretable, compact parameterization. We develop an Approximate Bayesian Computation approach that draws samples from the posterior distribution over the model's transition and duration parameters given aggregate counts from a specific location, thus adapting the model to a region or individual hospital site of interest. Samples from this posterior can then be used to produce future forecasts of any counts of interest. Using data from the United States and the United Kingdom, we show our mechanistic approach provides competitive probabilistic forecasts for the future even as the dynamics of the pandemic shift. Furthermore, we show how our model provides insight about recovery probabilities or length of stay distributions, and we suggest its potential to answer challenging what-if questions about the societal value of possible interventions. △ Less

Submitted 28 July, 2021; v1 submitted 28 April, 2021; originally announced May 2021.

Comments: To appear in the Proceedings of the Machine Learning for Healthcare (MLHC) conference, 2021. 20 pages, 7 figures and 1 table. 26 additional pages of supplementary material

arXiv:2104.09327 [pdf, other]

Forecasting COVID-19 Counts At A Single Hospital: A Hierarchical Bayesian Approach

Authors: Alexandra Hope Lee, Panagiotis Lymperopoulos, Joshua T. Cohen, John B. Wong, Michael C. Hughes

Abstract: We consider the problem of forecasting the daily number of hospitalized COVID-19 patients at a single hospital site, in order to help administrators with logistics and planning. We develop several candidate hierarchical Bayesian models which directly capture the count nature of data via a generalized Poisson likelihood, model time-series dependencies via autoregressive and Gaussian process latent… ▽ More We consider the problem of forecasting the daily number of hospitalized COVID-19 patients at a single hospital site, in order to help administrators with logistics and planning. We develop several candidate hierarchical Bayesian models which directly capture the count nature of data via a generalized Poisson likelihood, model time-series dependencies via autoregressive and Gaussian process latent processes, and share statistical strength across related sites. We demonstrate our approach on public datasets for 8 hospitals in Massachusetts, U.S.A. and 10 hospitals in the United Kingdom. Further prospective evaluation compares our approach favorably to baselines currently used by stakeholders at 3 related hospitals to forecast 2-week-ahead demand by rescaling state-level forecasts. △ Less

Submitted 14 April, 2021; originally announced April 2021.

Comments: In ICLR 2021 Workshop on Machine Learning for Preventing and Combating Pandemics

arXiv:2103.00065 [pdf, other]

Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability

Authors: Jeremy M. Cohen, Simran Kaur, Yuanzhi Li, J. Zico Kolter, Ameet Talwalkar

Abstract: We empirically demonstrate that full-batch gradient descent on neural network training objectives typically operates in a regime we call the Edge of Stability. In this regime, the maximum eigenvalue of the training loss Hessian hovers just above the numerical value $2 / \text{(step size)}$, and the training loss behaves non-monotonically over short timescales, yet consistently decreases over long… ▽ More We empirically demonstrate that full-batch gradient descent on neural network training objectives typically operates in a regime we call the Edge of Stability. In this regime, the maximum eigenvalue of the training loss Hessian hovers just above the numerical value $2 / \text{(step size)}$, and the training loss behaves non-monotonically over short timescales, yet consistently decreases over long timescales. Since this behavior is inconsistent with several widespread presumptions in the field of optimization, our findings raise questions as to whether these presumptions are relevant to neural network training. We hope that our findings will inspire future efforts aimed at rigorously understanding optimization at the Edge of Stability. Code is available at https://github.com/locuslab/edge-of-stability. △ Less

Submitted 23 November, 2022; v1 submitted 26 February, 2021; originally announced March 2021.

Comments: ICLR 2021. v3 moves several figures from the appendix into the main text, and adds more discussion regarding Jastrzębski et al (2020): https://doi.org/10.48550/arXiv.2002.09572

arXiv:2102.02087 [pdf, other]

doi 10.23919/EUSIPCO54536.2021.9615927

PARAFAC2 AO-ADMM: Constraints in all modes

Authors: Marie Roald, Carla Schenker, Jeremy E. Cohen, Evrim Acar

Abstract: The PARAFAC2 model provides a flexible alternative to the popular CANDECOMP/PARAFAC (CP) model for tensor decompositions. Unlike CP, PARAFAC2 allows factor matrices in one mode (i.e., evolving mode) to change across tensor slices, which has proven useful for applications in different domains such as chemometrics, and neuroscience. However, the evolving mode of the PARAFAC2 model is traditionally m… ▽ More The PARAFAC2 model provides a flexible alternative to the popular CANDECOMP/PARAFAC (CP) model for tensor decompositions. Unlike CP, PARAFAC2 allows factor matrices in one mode (i.e., evolving mode) to change across tensor slices, which has proven useful for applications in different domains such as chemometrics, and neuroscience. However, the evolving mode of the PARAFAC2 model is traditionally modelled implicitly, which makes it challenging to regularise it. Currently, the only way to apply regularisation on that mode is with a flexible coupling approach, which finds the solution through regularised least-squares subproblems. In this work, we instead propose an alternating direction method of multipliers (ADMM)-based algorithm for fitting PARAFAC2 and widen the possible regularisation penalties to any proximable function. Our numerical experiments demonstrate that the proposed ADMM-based approach for PARAFAC2 can accurately recover the underlying components from simulated data while being both computationally efficient and flexible in terms of imposing constraints. △ Less

Submitted 3 February, 2021; originally announced February 2021.

Comments: 5 pages, 4 figures, submitted to EUSIPCO21

arXiv:2011.11066 [pdf, other]

doi 10.1007/s10994-022-06260-2

Matrix-wise $\ell_0$-constrained Sparse Nonnegative Least Squares

Authors: Nicolas Nadisic, Jeremy E Cohen, Arnaud Vandaele, Nicolas Gillis

Abstract: Nonnegative least squares problems with multiple right-hand sides (MNNLS) arise in models that rely on additive linear combinations. In particular, they are at the core of most nonnegative matrix factorization algorithms and have many applications. The nonnegativity constraint is known to naturally favor sparsity, that is, solutions with few non-zero entries. However, it is often useful to further… ▽ More Nonnegative least squares problems with multiple right-hand sides (MNNLS) arise in models that rely on additive linear combinations. In particular, they are at the core of most nonnegative matrix factorization algorithms and have many applications. The nonnegativity constraint is known to naturally favor sparsity, that is, solutions with few non-zero entries. However, it is often useful to further enhance this sparsity, as it improves the interpretability of the results and helps reducing noise, which leads to the sparse MNNLS problem. In this paper, as opposed to most previous works that enforce sparsity column- or row-wise, we first introduce a novel formulation for sparse MNNLS, with a matrix-wise sparsity constraint. Then, we present a two-step algorithm to tackle this problem. The first step divides sparse MNNLS in subproblems, one per column of the original problem. It then uses different algorithms to produce, either exactly or approximately, a Pareto front for each subproblem, that is, to produce a set of solutions representing different tradeoffs between reconstruction error and sparsity. The second step selects solutions among these Pareto fronts in order to build a sparsity-constrained matrix that minimizes the reconstruction error. We perform experiments on facial and hyperspectral images, and we show that our proposed two-step approach provides more accurate results than state-of-the-art sparse coding heuristics applied both column-wise and globally. △ Less

Submitted 22 June, 2022; v1 submitted 22 November, 2020; originally announced November 2020.

Comments: 25 pages + 18 pages supplementary material. This is the new version of a work originally called "A Homotopy-based Algorithm for Sparse Multiple Right-hand Sides Nonnegative Least Squares". Although the central concept is the same, the paper has been almost completely rewritten

Journal ref: Machine Learning 111, pp. 4453-4495, 2022

arXiv:2007.10527 [pdf, ps, other]

Navigating the Trade-Off between Multi-Task Learning and Learning to Multitask in Deep Neural Networks

Authors: Sachin Ravi, Sebastian Musslick, Maia Hamin, Theodore L. Willke, Jonathan D. Cohen

Abstract: The terms multi-task learning and multitasking are easily confused. Multi-task learning refers to a paradigm in machine learning in which a network is trained on various related tasks to facilitate the acquisition of tasks. In contrast, multitasking is used to indicate, especially in the cognitive science literature, the ability to execute multiple tasks simultaneously. While multi-task learning e… ▽ More The terms multi-task learning and multitasking are easily confused. Multi-task learning refers to a paradigm in machine learning in which a network is trained on various related tasks to facilitate the acquisition of tasks. In contrast, multitasking is used to indicate, especially in the cognitive science literature, the ability to execute multiple tasks simultaneously. While multi-task learning exploits the discovery of common structure between tasks in the form of shared representations, multitasking is promoted by separating representations between tasks to avoid processing interference. Here, we build on previous work involving shallow networks and simple task settings suggesting that there is a trade-off between multi-task learning and multitasking, mediated by the use of shared versus separated representations. We show that the same tension arises in deep networks and discuss a meta-learning algorithm for an agent to manage this trade-off in an unfamiliar environment. We display through different experiments that the agent is able to successfully optimize its training strategy as a function of the environment. △ Less

Submitted 5 January, 2021; v1 submitted 20 July, 2020; originally announced July 2020.

arXiv:2007.09605 [pdf, other]

doi 10.1109/JSTSP.2020.3045848

A Flexible Optimization Framework for Regularized Matrix-Tensor Factorizations with Linear Couplings

Authors: Carla Schenker, Jeremy E. Cohen, Evrim Acar

Abstract: Coupled matrix and tensor factorizations (CMTF) are frequently used to jointly analyze data from multiple sources, also called data fusion. However, different characteristics of datasets stemming from multiple sources pose many challenges in data fusion and require to employ various regularizations, constraints, loss functions and different types of coupling structures between datasets. In this pa… ▽ More Coupled matrix and tensor factorizations (CMTF) are frequently used to jointly analyze data from multiple sources, also called data fusion. However, different characteristics of datasets stemming from multiple sources pose many challenges in data fusion and require to employ various regularizations, constraints, loss functions and different types of coupling structures between datasets. In this paper, we propose a flexible algorithmic framework for coupled matrix and tensor factorizations which utilizes Alternating Optimization (AO) and the Alternating Direction Method of Multipliers (ADMM). The framework facilitates the use of a variety of constraints, loss functions and couplings with linear transformations in a seamless way. Numerical experiments on simulated and real datasets demonstrate that the proposed approach is accurate, and computationally efficient with comparable or better performance than available CMTF methods for Frobenius norm loss, while being more flexible. Using Kullback-Leibler divergence on count data, we demonstrate that the algorithm yields accurate results also for other loss functions. △ Less

Submitted 19 July, 2020; originally announced July 2020.

arXiv:2007.04250 [pdf, other]

A Benchmark of Medical Out of Distribution Detection

Authors: Tianshi Cao, Chin-Wei Huang, David Yu-Tung Hui, Joseph Paul Cohen

Abstract: Motivation: Deep learning models deployed for use on medical tasks can be equipped with Out-of-Distribution Detection (OoDD) methods in order to avoid erroneous predictions. However it is unclear which OoDD method should be used in practice. Specific Problem: Systems trained for one particular domain of images cannot be expected to perform accurately on images of a different domain. These images s… ▽ More Motivation: Deep learning models deployed for use on medical tasks can be equipped with Out-of-Distribution Detection (OoDD) methods in order to avoid erroneous predictions. However it is unclear which OoDD method should be used in practice. Specific Problem: Systems trained for one particular domain of images cannot be expected to perform accurately on images of a different domain. These images should be flagged by an OoDD method prior to diagnosis. Our approach: This paper defines 3 categories of OoD examples and benchmarks popular OoDD methods in three domains of medical imaging: chest X-ray, fundus imaging, and histology slides. Results: Our experiments show that despite methods yielding good results on some categories of out-of-distribution samples, they fail to recognize images close to the training distribution. Conclusion: We find a simple binary classifier on the feature representation has the best accuracy and AUPRC on average. Users of diagnostic tools which employ these OoDD methods should still remain vigilant that images very close to the training distribution yet not in it could yield unexpected results. △ Less

Submitted 4 August, 2020; v1 submitted 8 July, 2020; originally announced July 2020.

Comments: Submitted to Machine Learning for Biomedical Imaging Journal (MELBA)

arXiv:2006.07553 [pdf, other]

Sparse Separable Nonnegative Matrix Factorization

Authors: Nicolas Nadisic, Arnaud Vandaele, Jeremy E. Cohen, Nicolas Gillis

Abstract: We propose a new variant of nonnegative matrix factorization (NMF), combining separability and sparsity assumptions. Separability requires that the columns of the first NMF factor are equal to columns of the input matrix, while sparsity requires that the columns of the second NMF factor are sparse. We call this variant sparse separable NMF (SSNMF), which we prove to be NP-complete, as opposed to s… ▽ More We propose a new variant of nonnegative matrix factorization (NMF), combining separability and sparsity assumptions. Separability requires that the columns of the first NMF factor are equal to columns of the input matrix, while sparsity requires that the columns of the second NMF factor are sparse. We call this variant sparse separable NMF (SSNMF), which we prove to be NP-complete, as opposed to separable NMF which can be solved in polynomial time. The main motivation to consider this new model is to handle underdetermined blind source separation problems, such as multispectral image unmixing. We introduce an algorithm to solve SSNMF, based on the successive nonnegative projection algorithm (SNPA, an effective algorithm for separable NMF), and an exact sparse nonnegative least squares solver. We prove that, in noiseless settings and under mild assumptions, our algorithm recovers the true underlying sources. This is illustrated by experiments on synthetic data sets and the unmixing of a multispectral image. △ Less

Submitted 12 June, 2020; originally announced June 2020.

Comments: 20 pages, accepted in ECML 2020

arXiv:2005.11856 [pdf, other]

Predicting COVID-19 Pneumonia Severity on Chest X-ray with Deep Learning

Authors: Joseph Paul Cohen, Lan Dao, Paul Morrison, Karsten Roth, Yoshua Bengio, Beiyi Shen, Almas Abbasi, Mahsa Hoshmand-Kochi, Marzyeh Ghassemi, Haifang Li, Tim Q Duong

Abstract: Purpose: The need to streamline patient management for COVID-19 has become more pressing than ever. Chest X-rays provide a non-invasive (potentially bedside) tool to monitor the progression of the disease. In this study, we present a severity score prediction model for COVID-19 pneumonia for frontal chest X-ray images. Such a tool can gauge severity of COVID-19 lung infections (and pneumonia in ge… ▽ More Purpose: The need to streamline patient management for COVID-19 has become more pressing than ever. Chest X-rays provide a non-invasive (potentially bedside) tool to monitor the progression of the disease. In this study, we present a severity score prediction model for COVID-19 pneumonia for frontal chest X-ray images. Such a tool can gauge severity of COVID-19 lung infections (and pneumonia in general) that can be used for escalation or de-escalation of care as well as monitoring treatment efficacy, especially in the ICU. Methods: Images from a public COVID-19 database were scored retrospectively by three blinded experts in terms of the extent of lung involvement as well as the degree of opacity. A neural network model that was pre-trained on large (non-COVID-19) chest X-ray datasets is used to construct features for COVID-19 images which are predictive for our task. Results: This study finds that training a regression model on a subset of the outputs from an this pre-trained chest X-ray model predicts our geographic extent score (range 0-8) with 1.14 mean absolute error (MAE) and our lung opacity score (range 0-6) with 0.78 MAE. Conclusions: These results indicate that our model's ability to gauge severity of COVID-19 lung infections could be used for escalation or de-escalation of care as well as monitoring treatment efficacy, especially in the intensive care unit (ICU). A proper clinical trial is needed to evaluate efficacy. To enable this we make our code, labels, and data available online at https://github.com/mlmed/torchxrayvision/tree/master/scripts/covid-severity and https://github.com/ieee8023/covid-chestxray-dataset △ Less

Submitted 30 June, 2020; v1 submitted 24 May, 2020; originally announced May 2020.

arXiv:2002.02582 [pdf, other]

Quantifying the Value of Lateral Views in Deep Learning for Chest X-rays

Authors: Mohammad Hashir, Hadrien Bertrand, Joseph Paul Cohen

Abstract: Most deep learning models in chest X-ray prediction utilize the posteroanterior (PA) view due to the lack of other views available. PadChest is a large-scale chest X-ray dataset that has almost 200 labels and multiple views available. In this work, we use PadChest to explore multiple approaches to merging the PA and lateral views for predicting the radiological labels associated with the X-ray ima… ▽ More Most deep learning models in chest X-ray prediction utilize the posteroanterior (PA) view due to the lack of other views available. PadChest is a large-scale chest X-ray dataset that has almost 200 labels and multiple views available. In this work, we use PadChest to explore multiple approaches to merging the PA and lateral views for predicting the radiological labels associated with the X-ray image. We find that different methods of merging the model utilize the lateral view differently. We also find that including the lateral view increases performance for 32 labels in the dataset, while being neutral for the others. The increase in overall performance is comparable to the one obtained by using only the PA view with twice the amount of patients in the training set. △ Less

Submitted 6 February, 2020; originally announced February 2020.

Comments: Under review at MIDL 2020

arXiv:2002.02497 [pdf, other]

On the limits of cross-domain generalization in automated X-ray prediction

Authors: Joseph Paul Cohen, Mohammad Hashir, Rupert Brooks, Hadrien Bertrand

Abstract: This large scale study focuses on quantifying what X-rays diagnostic prediction tasks generalize well across multiple different datasets. We present evidence that the issue of generalization is not due to a shift in the images but instead a shift in the labels. We study the cross-domain performance, agreement between models, and model representations. We find interesting discrepancies between perf… ▽ More This large scale study focuses on quantifying what X-rays diagnostic prediction tasks generalize well across multiple different datasets. We present evidence that the issue of generalization is not due to a shift in the images but instead a shift in the labels. We study the cross-domain performance, agreement between models, and model representations. We find interesting discrepancies between performance and agreement where models which both achieve good performance disagree in their predictions as well as models which agree yet achieve poor performance. We also test for concept similarity by regularizing a network to group tasks across multiple datasets together and observe variation across the tasks. All code is made available online and data is publicly available: https://github.com/mlmed/torchxrayvision △ Less

Submitted 24 May, 2020; v1 submitted 6 February, 2020; originally announced February 2020.

Comments: Full paper at MIDL2020

arXiv:2001.04321 [pdf, other]

doi 10.1002/nla.2373

Accelerating Block Coordinate Descent for Nonnegative Tensor Factorization

Authors: Andersen Man Shun Ang, Jeremy E. Cohen, Nicolas Gillis, Le Thi Khanh Hien

Abstract: This paper is concerned with improving the empirical convergence speed of block-coordinate descent algorithms for approximate nonnegative tensor factorization (NTF). We propose an extrapolation strategy in-between block updates, referred to as heuristic extrapolation with restarts (HER). HER significantly accelerates the empirical convergence speed of most existing block-coordinate algorithms for… ▽ More This paper is concerned with improving the empirical convergence speed of block-coordinate descent algorithms for approximate nonnegative tensor factorization (NTF). We propose an extrapolation strategy in-between block updates, referred to as heuristic extrapolation with restarts (HER). HER significantly accelerates the empirical convergence speed of most existing block-coordinate algorithms for dense NTF, in particular for challenging computational scenarios, while requiring a negligible additional computational budget. △ Less

Submitted 20 November, 2020; v1 submitted 13 January, 2020; originally announced January 2020.

Comments: 32 pages, 24 figures

Journal ref: Numerical Linear Algebra with Applications, e2373, 2021

arXiv:1910.09570 [pdf, other]

Icentia11K: An Unsupervised Representation Learning Dataset for Arrhythmia Subtype Discovery

Authors: Shawn Tan, Guillaume Androz, Ahmad Chamseddine, Pierre Fecteau, Aaron Courville, Yoshua Bengio, Joseph Paul Cohen

Abstract: We release the largest public ECG dataset of continuous raw signals for representation learning containing 11 thousand patients and 2 billion labelled beats. Our goal is to enable semi-supervised ECG models to be made as well as to discover unknown subtypes of arrhythmia and anomalous ECG signal events. To this end, we propose an unsupervised representation learning task, evaluated in a semi-super… ▽ More We release the largest public ECG dataset of continuous raw signals for representation learning containing 11 thousand patients and 2 billion labelled beats. Our goal is to enable semi-supervised ECG models to be made as well as to discover unknown subtypes of arrhythmia and anomalous ECG signal events. To this end, we propose an unsupervised representation learning task, evaluated in a semi-supervised fashion. We provide a set of baselines for different feature extractors that can be built upon. Additionally, we perform qualitative evaluations on results from PCA embeddings, where we identify some clustering of known subtypes indicating the potential for representation learning in arrhythmia sub-type discovery. △ Less

Submitted 21 October, 2019; originally announced October 2019.

Comments: Under Review

arXiv:1910.08640 [pdf, other]

Are Perceptually-Aligned Gradients a General Property of Robust Classifiers?

Authors: Simran Kaur, Jeremy Cohen, Zachary C. Lipton

Abstract: For a standard convolutional neural network, optimizing over the input pixels to maximize the score of some target class will generally produce a grainy-looking version of the original image. However, Santurkar et al. (2019) demonstrated that for adversarially-trained neural networks, this optimization produces images that uncannily resemble the target class. In this paper, we show that these "per… ▽ More For a standard convolutional neural network, optimizing over the input pixels to maximize the score of some target class will generally produce a grainy-looking version of the original image. However, Santurkar et al. (2019) demonstrated that for adversarially-trained neural networks, this optimization produces images that uncannily resemble the target class. In this paper, we show that these "perceptually-aligned gradients" also occur under randomized smoothing, an alternative means of constructing adversarially-robust classifiers. Our finding supports the hypothesis that perceptually-aligned gradients may be a general property of robust classifiers. We hope that our results will inspire research aimed at explaining this link between perceptually-aligned gradients and adversarial robustness. △ Less

Submitted 23 October, 2019; v1 submitted 18 October, 2019; originally announced October 2019.

Comments: To appear in the "Science Meets Engineering of Deep Learning" Workshop at NeurIPS 2019

arXiv:1910.08636 [pdf, other]

The TCGA Meta-Dataset Clinical Benchmark

Authors: Mandana Samiei, Tobias Würfl, Tristan Deleu, Martin Weiss, Francis Dutil, Thomas Fevens, Geneviève Boucher, Sebastien Lemieux, Joseph Paul Cohen

Abstract: Machine learning is bringing a paradigm shift to healthcare by changing the process of disease diagnosis and prognosis in clinics and hospitals. This development equips doctors and medical staff with tools to evaluate their hypotheses and hence make more precise decisions. Although most current research in the literature seeks to develop techniques and methods for predicting one particular clinica… ▽ More Machine learning is bringing a paradigm shift to healthcare by changing the process of disease diagnosis and prognosis in clinics and hospitals. This development equips doctors and medical staff with tools to evaluate their hypotheses and hence make more precise decisions. Although most current research in the literature seeks to develop techniques and methods for predicting one particular clinical outcome, this approach is far from the reality of clinical decision making in which you have to consider several factors simultaneously. In addition, it is difficult to follow the recent progress concretely as there is a lack of consistency in benchmark datasets and task definitions in the field of Genomics. To address the aforementioned issues, we provide a clinical Meta-Dataset derived from the publicly available data hub called The Cancer Genome Atlas Program (TCGA) that contains 174 tasks. We believe those tasks could be good proxy tasks to develop methods which can work on a few samples of gene expression data. Also, learning to predict multiple clinical variables using gene-expression data is an important task due to the variety of phenotypes in clinical problems and lack of samples for some of the rare variables. The defined tasks cover a wide range of clinical problems including predicting tumor tissue site, white cell count, histological type, family history of cancer, gender, and many others which we explain later in the paper. Each task represents an independent dataset. We use regression and neural network baselines for all the tasks using only 150 samples and compare their performance. △ Less

Submitted 18 October, 2019; originally announced October 2019.

Comments: 5 Pages, Submitted to MLCB 2019

arXiv:1909.06576 [pdf, ps, other]

Torchmeta: A Meta-Learning library for PyTorch

Authors: Tristan Deleu, Tobias Würfl, Mandana Samiei, Joseph Paul Cohen, Yoshua Bengio

Abstract: The constant introduction of standardized benchmarks in the literature has helped accelerating the recent advances in meta-learning research. They offer a way to get a fair comparison between different algorithms, and the wide range of datasets available allows full control over the complexity of this evaluation. However, for a large majority of code available online, the data pipeline is often sp… ▽ More The constant introduction of standardized benchmarks in the literature has helped accelerating the recent advances in meta-learning research. They offer a way to get a fair comparison between different algorithms, and the wide range of datasets available allows full control over the complexity of this evaluation. However, for a large majority of code available online, the data pipeline is often specific to one dataset, and testing on another dataset requires significant rework. We introduce Torchmeta, a library built on top of PyTorch that enables seamless and consistent evaluation of meta-learning algorithms on multiple datasets, by providing data-loaders for most of the standard benchmarks in few-shot classification and regression, with a new meta-dataset abstraction. It also features some extensions for PyTorch to simplify the development of models compatible with meta-learning algorithms. The code is available here: https://github.com/tristandeleu/pytorch-meta △ Less

Submitted 14 September, 2019; originally announced September 2019.

arXiv:1905.11286 [pdf, other]

Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks

Authors: Boris Ginsburg, Patrice Castonguay, Oleksii Hrinchuk, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li, Huyen Nguyen, Yang Zhang, Jonathan M. Cohen

Abstract: We propose NovoGrad, an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay. In our experiments on neural networks for image classification, speech recognition, machine translation, and language modeling, it performs on par or better than well tuned SGD with momentum and Adam or AdamW. Additionally, NovoGrad (1) is robust to the choice of l… ▽ More We propose NovoGrad, an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay. In our experiments on neural networks for image classification, speech recognition, machine translation, and language modeling, it performs on par or better than well tuned SGD with momentum and Adam or AdamW. Additionally, NovoGrad (1) is robust to the choice of learning rate and weight initialization, (2) works well in a large batch setting, and (3) has two times smaller memory footprint than Adam. △ Less

Submitted 6 February, 2020; v1 submitted 27 May, 2019; originally announced May 2019.

Comments: Preprint, under review

arXiv:1904.04861 [pdf, other]

Universal Lipschitz Approximation in Bounded Depth Neural Networks

Authors: Jeremy E. J. Cohen, Todd Huster, Ra Cohen

Abstract: Adversarial attacks against machine learning models are a rather hefty obstacle to our increasing reliance on these models. Due to this, provably robust (certified) machine learning models are a major topic of interest. Lipschitz continuous models present a promising approach to solving this problem. By leveraging the expressive power of a variant of neural networks which maintain low Lipschitz co… ▽ More Adversarial attacks against machine learning models are a rather hefty obstacle to our increasing reliance on these models. Due to this, provably robust (certified) machine learning models are a major topic of interest. Lipschitz continuous models present a promising approach to solving this problem. By leveraging the expressive power of a variant of neural networks which maintain low Lipschitz constants, we prove that three layer neural networks using the FullSort activation function are Universal Lipschitz function Approximators (ULAs). This both explains experimental results and paves the way for the creation of better certified models going forward. We conclude by presenting experimental results that suggest that ULAs are a not just a novelty, but a competitive approach to providing certified classifiers, using these results to motivate several potential topics of further research. △ Less

Submitted 9 April, 2019; originally announced April 2019.

arXiv:1902.02918 [pdf, other]

Certified Adversarial Robustness via Randomized Smoothing

Authors: Jeremy M Cohen, Elan Rosenfeld, J. Zico Kolter

Abstract: We show how to turn any classifier that classifies well under Gaussian noise into a new classifier that is certifiably robust to adversarial perturbations under the $\ell_2$ norm. This "randomized smoothing" technique has been proposed recently in the literature, but existing guarantees are loose. We prove a tight robustness guarantee in $\ell_2$ norm for smoothing with Gaussian noise. We use rand… ▽ More We show how to turn any classifier that classifies well under Gaussian noise into a new classifier that is certifiably robust to adversarial perturbations under the $\ell_2$ norm. This "randomized smoothing" technique has been proposed recently in the literature, but existing guarantees are loose. We prove a tight robustness guarantee in $\ell_2$ norm for smoothing with Gaussian noise. We use randomized smoothing to obtain an ImageNet classifier with e.g. a certified top-1 accuracy of 49% under adversarial perturbations with $\ell_2$ norm less than 0.5 (=127/255). No certified defense has been shown feasible on ImageNet except for smoothing. On smaller-scale datasets where competing approaches to certified $\ell_2$ robustness are viable, smoothing delivers higher certified accuracies. Our strong empirical results suggest that randomized smoothing is a promising direction for future research into adversarially robust classification. Code and models are available at http://github.com/locuslab/smoothing. △ Less

Submitted 15 June, 2019; v1 submitted 7 February, 2019; originally announced February 2019.

Comments: ICML 2019

arXiv:1810.03442 [pdf, other]

Towards the Latent Transcriptome

Authors: Assya Trofimov, Francis Dutil, Claude Perreault, Sebastien Lemieux, Yoshua Bengio, Joseph Paul Cohen

Abstract: In this work we propose a method to compute continuous embeddings for kmers from raw RNA-seq data, without the need for alignment to a reference genome. The approach uses an RNN to transform kmers of the RNA-seq reads into a 2 dimensional representation that is used to predict abundance of each kmer. We report that our model captures information of both DNA sequence similarity as well as DNA seque… ▽ More In this work we propose a method to compute continuous embeddings for kmers from raw RNA-seq data, without the need for alignment to a reference genome. The approach uses an RNN to transform kmers of the RNA-seq reads into a 2 dimensional representation that is used to predict abundance of each kmer. We report that our model captures information of both DNA sequence similarity as well as DNA sequence abundance in the embedding latent space, that we call the Latent Transcriptome. We confirm the quality of these vectors by comparing them to known gene sub-structures and report that the latent space recovers exon information from raw RNA-Seq data from acute myeloid leukemia patients. Furthermore we show that this latent space allows the detection of genomic abnormalities such as translocations as well as patient-specific mutations, making this representation space both useful for visualization as well as analysis. △ Less

Submitted 10 December, 2018; v1 submitted 8 October, 2018; originally announced October 2018.

Comments: 7 figures

arXiv:1810.00045 [pdf, other]

Adversarial Domain Adaptation for Stable Brain-Machine Interfaces

Authors: Ali Farshchian, Juan A. Gallego, Joseph P. Cohen, Yoshua Bengio, Lee E. Miller, Sara A. Solla

Abstract: Brain-Machine Interfaces (BMIs) have recently emerged as a clinically viable option to restore voluntary movements after paralysis. These devices are based on the ability to extract information about movement intent from neural signals recorded using multi-electrode arrays chronically implanted in the motor cortices of the brain. However, the inherent loss and turnover of recorded neurons requires… ▽ More Brain-Machine Interfaces (BMIs) have recently emerged as a clinically viable option to restore voluntary movements after paralysis. These devices are based on the ability to extract information about movement intent from neural signals recorded using multi-electrode arrays chronically implanted in the motor cortices of the brain. However, the inherent loss and turnover of recorded neurons requires repeated recalibrations of the interface, which can potentially alter the day-to-day user experience. The resulting need for continued user adaptation interferes with the natural, subconscious use of the BMI. Here, we introduce a new computational approach that decodes movement intent from a low-dimensional latent representation of the neural data. We implement various domain adaptation methods to stabilize the interface over significantly long times. This includes Canonical Correlation Analysis used to align the latent variables across days; this method requires prior point-to-point correspondence of the time series across domains. Alternatively, we match the empirical probability distributions of the latent variables across days through the minimization of their Kullback-Leibler divergence. These two methods provide a significant and comparable improvement in the performance of the interface. However, implementation of an Adversarial Domain Adaptation Network trained to match the empirical probability distribution of the residuals of the reconstructed neural signals outperforms the two methods based on latent variables, while requiring remarkably few data points to solve the domain adaptation problem. △ Less

Submitted 15 January, 2019; v1 submitted 28 September, 2018; originally announced October 2018.

Comments: 14 pages, 6 figures

arXiv:1809.07394 [pdf, other]

Improving Subseasonal Forecasting in the Western U.S. with Machine Learning

Authors: Jessica Hwang, Paulo Orenstein, Judah Cohen, Karl Pfeiffer, Lester Mackey

Abstract: Water managers in the western United States (U.S.) rely on longterm forecasts of temperature and precipitation to prepare for droughts and other wet weather extremes. To improve the accuracy of these longterm forecasts, the U.S. Bureau of Reclamation and the National Oceanic and Atmospheric Administration (NOAA) launched the Subseasonal Climate Forecast Rodeo, a year-long real-time forecasting cha… ▽ More Water managers in the western United States (U.S.) rely on longterm forecasts of temperature and precipitation to prepare for droughts and other wet weather extremes. To improve the accuracy of these longterm forecasts, the U.S. Bureau of Reclamation and the National Oceanic and Atmospheric Administration (NOAA) launched the Subseasonal Climate Forecast Rodeo, a year-long real-time forecasting challenge in which participants aimed to skillfully predict temperature and precipitation in the western U.S. two to four weeks and four to six weeks in advance. Here we present and evaluate our machine learning approach to the Rodeo and release our SubseasonalRodeo dataset, collected to train and evaluate our forecasting system. Our system is an ensemble of two regression models. The first integrates the diverse collection of meteorological measurements and dynamic model forecasts in the SubseasonalRodeo dataset and prunes irrelevant predictors using a customized multitask model selection procedure. The second uses only historical measurements of the target variable (temperature or precipitation) and introduces multitask nearest neighbor features into a weighted local linear regression. Each model alone is significantly more accurate than the debiased operational U.S. Climate Forecasting System (CFSv2), and our ensemble skill exceeds that of the top Rodeo competitor for each target variable and forecast horizon. Moreover, over 2011-2018, an ensemble of our regression models and debiased CFSv2 improves debiased CFSv2 skill by 40-50% for temperature and 129-169% for precipitation. We hope that both our dataset and our methods will help to advance the state of the art in subseasonal forecasting. △ Less

Submitted 22 May, 2019; v1 submitted 19 September, 2018; originally announced September 2018.

arXiv:1808.08765 [pdf, ps, other]

doi 10.1137/18M1233339

Identifiability of Complete Dictionary Learning

Authors: Jérémy E. Cohen, Nicolas Gillis

Abstract: Sparse component analysis (SCA), also known as complete dictionary learning, is the following problem: Given an input matrix $M$ and an integer $r$, find a dictionary $D$ with $r$ columns and a matrix $B$ with $k$-sparse columns (that is, each column of $B$ has at most $k$ non-zero entries) such that $M \approx DB$. A key issue in SCA is identifiability, that is, characterizing the conditions unde… ▽ More Sparse component analysis (SCA), also known as complete dictionary learning, is the following problem: Given an input matrix $M$ and an integer $r$, find a dictionary $D$ with $r$ columns and a matrix $B$ with $k$-sparse columns (that is, each column of $B$ has at most $k$ non-zero entries) such that $M \approx DB$. A key issue in SCA is identifiability, that is, characterizing the conditions under which $D$ and $B$ are essentially unique (that is, they are unique up to permutation and scaling of the columns of $D$ and rows of $B$). Although SCA has been vastly investigated in the last two decades, only a few works have tackled this issue in the deterministic scenario, and no work provides reasonable bounds in the minimum number of samples (that is, columns of $M$) that leads to identifiability. In this work, we provide new results in the deterministic scenario when the data has a low-rank structure, that is, when $D$ is (under)complete. While previous bounds feature a combinatorial term $r \choose k$, we exhibit a sufficient condition involving $\mathcal{O}(r^3/(r-k)^2)$ samples that yields an essentially unique decomposition, as long as these data points are well spread among the subspaces spanned by $r-1$ columns of $D$. We also exhibit a necessary lower bound on the number of samples that contradicts previous results in the literature when $k$ equals $r-1$. Our bounds provide a drastic improvement compared to the state of the art, and imply for example that for a fixed proportion of zeros (constant and independent of $r$, e.g., 10\% of zero entries in $B$), one only requires $\mathcal{O}(r)$ data points to guarantee identifiability. △ Less

Submitted 28 March, 2019; v1 submitted 27 August, 2018; originally announced August 2018.

Comments: 19 pages, 2 figures, new title, added references and discussions

Journal ref: SIAM Journal on Mathematics of Data Science 1 (3), pp. 518-536, 2019

arXiv:1806.06975 [pdf, other]

Towards Gene Expression Convolutions using Gene Interaction Graphs

Authors: Francis Dutil, Joseph Paul Cohen, Martin Weiss, Georgy Derevyanko, Yoshua Bengio

Abstract: We study the challenges of applying deep learning to gene expression data. We find experimentally that there exists non-linear signal in the data, however is it not discovered automatically given the noise and low numbers of samples used in most research. We discuss how gene interaction graphs (same pathway, protein-protein, co-expression, or research paper text association) can be used to impose… ▽ More We study the challenges of applying deep learning to gene expression data. We find experimentally that there exists non-linear signal in the data, however is it not discovered automatically given the noise and low numbers of samples used in most research. We discuss how gene interaction graphs (same pathway, protein-protein, co-expression, or research paper text association) can be used to impose a bias on a deep model similar to the spatial bias imposed by convolutions on an image. We explore the usage of Graph Convolutional Neural Networks coupled with dropout and gene embeddings to utilize the graph information. We find this approach provides an advantage for particular tasks in a low data regime but is very dependent on the quality of the graph used. We conclude that more work should be done in this direction. We design experiments that show why existing methods fail to capture signal that is present in the data when features are added which clearly isolates the problem that needs to be addressed. △ Less

Submitted 18 June, 2018; originally announced June 2018.

Comments: 4 pages +1 page references, To appear in the International Conference on Machine Learning Workshop on Computational Biology, 2018

arXiv:1806.01984 [pdf, other]

Learning to rank for censored survival data

Authors: Margaux Luck, Tristan Sylvain, Joseph Paul Cohen, Heloise Cardinal, Andrea Lodi, Yoshua Bengio

Abstract: Survival analysis is a type of semi-supervised ranking task where the target output (the survival time) is often right-censored. Utilizing this information is a challenge because it is not obvious how to correctly incorporate these censored examples into a model. We study how three categories of loss functions, namely partial likelihood methods, rank methods, and our classification method based on… ▽ More Survival analysis is a type of semi-supervised ranking task where the target output (the survival time) is often right-censored. Utilizing this information is a challenge because it is not obvious how to correctly incorporate these censored examples into a model. We study how three categories of loss functions, namely partial likelihood methods, rank methods, and our classification method based on a Wasserstein metric (WM) and the non-parametric Kaplan Meier estimate of the probability density to impute the labels of censored examples, can take advantage of this information. The proposed method allows us to have a model that predict the probability distribution of an event. If a clinician had access to the detailed probability of an event over time this would help in treatment planning. For example, determining if the risk of kidney graft rejection is constant or peaked after some time. Also, we demonstrate that this approach directly optimizes the expected C-index which is the most common evaluation metric for ranking survival models. △ Less

Submitted 8 June, 2018; v1 submitted 5 June, 2018; originally announced June 2018.

arXiv:1802.05035 [pdf, other]

Nonnegative PARAFAC2: a flexible coupling approach

Authors: Jeremy E. Cohen, Rasmus Bro

Abstract: Modeling variability in tensor decomposition methods is one of the challenges of source separation. One possible solution to account for variations from one data set to another, jointly analysed, is to resort to the PARAFAC2 model. However, so far imposing constraints on the mode with variability has not been possible. In the following manuscript, a relaxation of the PARAFAC2 model is introduced,… ▽ More Modeling variability in tensor decomposition methods is one of the challenges of source separation. One possible solution to account for variations from one data set to another, jointly analysed, is to resort to the PARAFAC2 model. However, so far imposing constraints on the mode with variability has not been possible. In the following manuscript, a relaxation of the PARAFAC2 model is introduced, that allows for imposing nonnegativity constraints on the varying mode. An algorithm to compute the proposed flexible PARAFAC2 model is derived, and its performance is studied on both synthetic and chemometrics data. △ Less

Submitted 14 February, 2018; originally announced February 2018.

arXiv:1802.03203 [pdf, other]

Curve Registered Coupled Low Rank Factorization

Authors: Jeremy Emile Cohen, Rodrigo Cabral Farias, Bertrand Rivet

Abstract: We propose an extension of the canonical polyadic (CP) tensor model where one of the latent factors is allowed to vary through data slices in a constrained way. The components of the latent factors, which we want to retrieve from data, can vary from one slice to another up to a diffeomorphism. We suppose that the diffeomorphisms are also unknown, thus merging curve registration and tensor decompos… ▽ More We propose an extension of the canonical polyadic (CP) tensor model where one of the latent factors is allowed to vary through data slices in a constrained way. The components of the latent factors, which we want to retrieve from data, can vary from one slice to another up to a diffeomorphism. We suppose that the diffeomorphisms are also unknown, thus merging curve registration and tensor decomposition in one model, which we call registered CP. We present an algorithm to retrieve both the latent factors and the diffeomorphism, which is assumed to be in a parametrized form. At the end of the paper, we show simulation results comparing registered CP with other models from the literature. △ Less

Submitted 9 February, 2018; originally announced February 2018.

arXiv:1712.04120 [pdf, other]

GibbsNet: Iterative Adversarial Inference for Deep Graphical Models

Authors: Alex Lamb, Devon Hjelm, Yaroslav Ganin, Joseph Paul Cohen, Aaron Courville, Yoshua Bengio

Abstract: Directed latent variable models that formulate the joint distribution as $p(x,z) = p(z) p(x \mid z)$ have the advantage of fast and exact sampling. However, these models have the weakness of needing to specify $p(z)$, often with a simple fixed prior that limits the expressiveness of the model. Undirected latent variable models discard the requirement that $p(z)$ be specified with a prior, yet samp… ▽ More Directed latent variable models that formulate the joint distribution as $p(x,z) = p(z) p(x \mid z)$ have the advantage of fast and exact sampling. However, these models have the weakness of needing to specify $p(z)$, often with a simple fixed prior that limits the expressiveness of the model. Undirected latent variable models discard the requirement that $p(z)$ be specified with a prior, yet sampling from them generally requires an iterative procedure such as blocked Gibbs-sampling that may require many steps to draw samples from the joint distribution $p(x, z)$. We propose a novel approach to learning the joint distribution between the data and a latent code which uses an adversarially learned iterative procedure to gradually refine the joint distribution, $p(x, z)$, to better match with the data distribution on each step. GibbsNet is the best of both worlds both in theory and in practice. Achieving the speed and simplicity of a directed latent variable model, it is guaranteed (assuming the adversarial game reaches the virtual training criteria global minimum) to produce samples from $p(x, z)$ with only a few sampling iterations. Achieving the expressiveness and flexibility of an undirected latent variable model, GibbsNet does away with the need for an explicit $p(z)$ and has the ability to do attribute prediction, class-conditional generation, and joint image-attribute modeling in a single model which is not trained for any of these specific tasks. We show empirically that GibbsNet is able to learn a more complex $p(z)$ and show that this leads to improved inpainting and iterative refinement of $p(x, z)$ for dozens of steps and stable generation without collapse for thousands of steps, despite being trained on only a few steps. △ Less

Submitted 11 December, 2017; originally announced December 2017.

Comments: NIPS 2017

arXiv:1711.03058 [pdf, other]

Matrix-normal models for fMRI analysis

Authors: Michael Shvartsman, Narayanan Sundaram, Mikio C. Aoi, Adam Charles, Theodore C. Wilke, Jonathan D. Cohen

Abstract: Multivariate analysis of fMRI data has benefited substantially from advances in machine learning. Most recently, a range of probabilistic latent variable models applied to fMRI data have been successful in a variety of tasks, including identifying similarity patterns in neural data (Representational Similarity Analysis and its empirical Bayes variant, RSA and BRSA; Intersubject Functional Connecti… ▽ More Multivariate analysis of fMRI data has benefited substantially from advances in machine learning. Most recently, a range of probabilistic latent variable models applied to fMRI data have been successful in a variety of tasks, including identifying similarity patterns in neural data (Representational Similarity Analysis and its empirical Bayes variant, RSA and BRSA; Intersubject Functional Connectivity, ISFC), combining multi-subject datasets (Shared Response Mapping; SRM), and mapping between brain and behavior (Joint Modeling). Although these methods share some underpinnings, they have been developed as distinct methods, with distinct algorithms and software tools. We show how the matrix-variate normal (MN) formalism can unify some of these methods into a single framework. In doing so, we gain the ability to reuse noise modeling assumptions, algorithms, and code across models. Our primary theoretical contribution shows how some of these methods can be written as instantiations of the same model, allowing us to generalize them to flexibly modeling structured noise covariances. Our formalism permits novel model variants and improved estimation strategies: in contrast to SRM, the number of parameters for MN-SRM does not scale with the number of voxels or subjects; in contrast to BRSA, the number of parameters for MN-RSA scales additively rather than multiplicatively in the number of voxels. We empirically demonstrate advantages of two new methods derived in the formalism: for MN-RSA, we show up to 10x improvement in runtime, up to 6x improvement in RMSE, and more conservative behavior under the null. For MN-SRM, our method grants a modest improvement to out-of-sample reconstruction while relaxing an orthonormality constraint of SRM. We also provide a software prototyping tool for MN models that can flexibly reuse noise covariance assumptions and algorithms across models. △ Less

Submitted 9 November, 2017; v1 submitted 8 November, 2017; originally announced November 2017.

arXiv:1704.00541 [pdf, other]

doi 10.1109/TSP.2017.2777393

Dictionary-based Tensor Canonical Polyadic Decomposition

Authors: Jérémy E. Cohen, Nicolas Gillis

Abstract: To ensure interpretability of extracted sources in tensor decomposition, we introduce in this paper a dictionary-based tensor canonical polyadic decomposition which enforces one factor to belong exactly to a known dictionary. A new formulation of sparse coding is proposed which enables high dimensional tensors dictionary-based canonical polyadic decomposition. The benefits of using a dictionary in… ▽ More To ensure interpretability of extracted sources in tensor decomposition, we introduce in this paper a dictionary-based tensor canonical polyadic decomposition which enforces one factor to belong exactly to a known dictionary. A new formulation of sparse coding is proposed which enables high dimensional tensors dictionary-based canonical polyadic decomposition. The benefits of using a dictionary in tensor decomposition models are explored both in terms of parameter identifiability and estimation accuracy. Performances of the proposed algorithms are evaluated on the decomposition of simulated data and the unmixing of hyperspectral images. △ Less

Submitted 8 November, 2017; v1 submitted 3 April, 2017; originally announced April 2017.

Journal ref: IEEE Trans. on Signal Processing 66 (7), pp. 1876-1889, 2018

arXiv:1703.08710 [pdf, other]

Count-ception: Counting by Fully Convolutional Redundant Counting

Authors: Joseph Paul Cohen, Genevieve Boucher, Craig A. Glastonbury, Henry Z. Lo, Yoshua Bengio

Abstract: Counting objects in digital images is a process that should be replaced by machines. This tedious task is time consuming and prone to errors due to fatigue of human annotators. The goal is to have a system that takes as input an image and returns a count of the objects inside and justification for the prediction in the form of object localization. We repose a problem, originally posed by Lempitsky… ▽ More Counting objects in digital images is a process that should be replaced by machines. This tedious task is time consuming and prone to errors due to fatigue of human annotators. The goal is to have a system that takes as input an image and returns a count of the objects inside and justification for the prediction in the form of object localization. We repose a problem, originally posed by Lempitsky and Zisserman, to instead predict a count map which contains redundant counts based on the receptive field of a smaller regression network. The regression network predicts a count of the objects that exist inside this frame. By processing the image in a fully convolutional way each pixel is going to be accounted for some number of times, the number of windows which include it, which is the size of each window, (i.e., 32x32 = 1024). To recover the true count we take the average over the redundant predictions. Our contribution is redundant counting instead of predicting a density map in order to average over errors. We also propose a novel deep neural network architecture adapted from the Inception family of networks called the Count-ception network. Together our approach results in a 20% relative improvement (2.9 to 2.3 MAE) over the state of the art method by Xie, Noble, and Zisserman in 2016. △ Less

Submitted 23 July, 2017; v1 submitted 25 March, 2017; originally announced March 2017.

Comments: Under Review

arXiv:1702.00261 [pdf, other]

Phenomenological forecasting of disease incidence using heteroskedastic Gaussian processes: a dengue case study

Authors: Leah R. Johnson, Robert B. Gramacy, Jeremy Cohen, Erin Mordecai, Courtney Murdock, Jason Rohr, Sadie J. Ryan, Anna M. Stewart-Ibarra, Daniel Weikel

Abstract: In 2015 the US federal government sponsored a dengue forecasting competition using historical case data from Iquitos, Peru and San Juan, Puerto Rico. Competitors were evaluated on several aspects of out-of-sample forecasts including the targets of peak week, peak incidence during that week and total season incidence across each of several seasons. Our team was one of the top performers of that com… ▽ More In 2015 the US federal government sponsored a dengue forecasting competition using historical case data from Iquitos, Peru and San Juan, Puerto Rico. Competitors were evaluated on several aspects of out-of-sample forecasts including the targets of peak week, peak incidence during that week and total season incidence across each of several seasons. Our team was one of the top performers of that competition, outperforming all other teams in multiple targets/locals. In this paper we report on our methodology, a large component of which, surprisingly, ignores the known biology of epidemics at large---in particular relationships between dengue transmission and environmental factors---and instead relies on flexible nonparametric nonlinear Gaussian process (GP) regression fits that "memorize" the trajectories of past seasons, and then "match" the dynamics of the unfolding season to past ones in real-time. Our phenomenological approach has advantages in situations where disease dynamics are less well understood, e.g., at sites with shorter histories of disease (such as Iquitos), or where measurements and forecasts of ancillary covariates like precipitation are unavailable and/or where the strength of association with cases are as yet unknown. In particular, we show that the GP approach generally outperforms a more classical generalized linear (autoregressive) model (GLM) that we developed to utilize abundant covariate information. We illustrate variations of our method(s) on the two benchmark locales alongside a full summary of results submitted by other contest competitors. △ Less

Submitted 1 August, 2017; v1 submitted 1 February, 2017; originally announced February 2017.

Comments: 39 pages, 13 figures, 4 tables, including appendices

Showing 1–46 of 46 results for author: Cohen, J