Skip to main content

Showing 1–7 of 7 results for author: Grinsztajn, L

.
  1. arXiv:2407.04491  [pdf, other

    cs.LG

    Better by Default: Strong Pre-Tuned MLPs and Boosted Trees on Tabular Data

    Authors: David Holzmüller, Léo Grinsztajn, Ingo Steinwart

    Abstract: For classification and regression on tabular data, the dominance of gradient-boosted decision trees (GBDTs) has recently been challenged by often much slower deep learning methods with extensive hyperparameter tuning. We address this discrepancy by introducing (a) RealMLP, an improved multilayer perceptron (MLP), and (b) strong meta-tuned default parameters for GBDTs and RealMLP. We tune RealMLP a… ▽ More

    Submitted 15 January, 2025; v1 submitted 5 July, 2024; originally announced July 2024.

    Comments: NeurIPS 2024. Changes in v3: mention bug in XGBoost results, mention original name of he+5 method. Code is available at github.com/dholzmueller/pytabkit

  2. arXiv:2402.16785  [pdf, other

    cs.LG

    CARTE: Pretraining and Transfer for Tabular Learning

    Authors: Myung Jun Kim, Léo Grinsztajn, Gaël Varoquaux

    Abstract: Pretrained deep-learning models are the go-to solution for images or text. However, for tabular data the standard is still to train tree-based models. Indeed, transfer learning on tables hits the challenge of data integration: finding correspondences, correspondences in the entries (entity matching) where different words may denote the same entity, correspondences across columns (schema matching),… ▽ More

    Submitted 31 May, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

  3. arXiv:2312.09634  [pdf, other

    stat.ML cs.LG

    Vectorizing string entries for data processing on tables: when are larger language models better?

    Authors: Léo Grinsztajn, Edouard Oyallon, Myung Jun Kim, Gaël Varoquaux

    Abstract: There are increasingly efficient data processing pipelines that work on vectors of numbers, for instance most machine learning models, or vector databases for fast similarity search. These require converting the data to numbers. While this conversion is easy for simple numerical and categorical entries, databases are strife with text entries, such as names or descriptions. In the age of large lang… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

  4. arXiv:2211.12312  [pdf, other

    cs.LG cs.AI

    Interpreting Neural Networks through the Polytope Lens

    Authors: Sid Black, Lee Sharkey, Leo Grinsztajn, Eric Winsor, Dan Braun, Jacob Merizian, Kip Parker, Carlos Ramón Guevara, Beren Millidge, Gabriel Alfour, Connor Leahy

    Abstract: Mechanistic interpretability aims to explain what a neural network has learned at a nuts-and-bolts level. What are the fundamental primitives of neural network representations? Previous mechanistic descriptions have used individual neurons or their linear combinations to understand the representations a network has learned. But there are clues that neurons and their linear combinations are not the… ▽ More

    Submitted 22 November, 2022; originally announced November 2022.

    Comments: 22/11/22 initial upload

  5. arXiv:2207.08815  [pdf, other

    cs.LG cs.AI stat.ME stat.ML

    Why do tree-based models still outperform deep learning on tabular data?

    Authors: Léo Grinsztajn, Edouard Oyallon, Gaël Varoquaux

    Abstract: While deep learning has enabled tremendous progress on text and image datasets, its superiority on tabular data is not clear. We contribute extensive benchmarks of standard and novel deep learning methods as well as tree-based models such as XGBoost and Random Forests, across a large number of datasets and hyperparameter combinations. We define a standard set of 45 datasets from varied domains wit… ▽ More

    Submitted 18 July, 2022; originally announced July 2022.

  6. arXiv:2006.02985  [pdf, other

    stat.CO q-bio.QM stat.AP

    Bayesian workflow for disease transmission modeling in Stan

    Authors: Léo Grinsztajn, Elizaveta Semenova, Charles C. Margossian, Julien Riou

    Abstract: This tutorial shows how to build, fit, and criticize disease transmission models in Stan, and should be useful to researchers interested in modeling the SARS-CoV-2 pandemic and other infectious diseases in a Bayesian framework. Bayesian modeling provides a principled way to quantify uncertainty and incorporate both data and prior knowledge into the model estimates. Stan is an expressive probabilis… ▽ More

    Submitted 30 September, 2021; v1 submitted 23 May, 2020; originally announced June 2020.

  7. arXiv:2002.12253  [pdf, other

    stat.ML cs.LG stat.CO

    MetFlow: A New Efficient Method for Bridging the Gap between Markov Chain Monte Carlo and Variational Inference

    Authors: Achille Thin, Nikita Kotelevskii, Jean-Stanislas Denain, Leo Grinsztajn, Alain Durmus, Maxim Panov, Eric Moulines

    Abstract: In this contribution, we propose a new computationally efficient method to combine Variational Inference (VI) with Markov Chain Monte Carlo (MCMC). This approach can be used with generic MCMC kernels, but is especially well suited to \textit{MetFlow}, a novel family of MCMC algorithms we introduce, in which proposals are obtained using Normalizing Flows. The marginal distribution produced by such… ▽ More

    Submitted 27 February, 2020; originally announced February 2020.