Skip to main content

Showing 1–2 of 2 results for author: Gsteiger, V

Searching in archive cs. Search in all archives.
.
  1. arXiv:2502.19790  [pdf, other

    cs.LG cs.AI cs.DB

    Mixtera: A Data Plane for Foundation Model Training

    Authors: Maximilian Böther, Xiaozhe Yao, Tolga Kerimoglu, Dan Graur, Viktor Gsteiger, Ana Klimovic

    Abstract: State-of-the-art large language and vision models are trained over trillions of tokens that are aggregated from a large variety of sources. As training data collections grow, manually managing the samples becomes time-consuming, tedious, and prone to errors. Yet recent research shows that the data mixture and the order in which samples are visited during training can significantly influence model… ▽ More

    Submitted 3 April, 2025; v1 submitted 27 February, 2025; originally announced February 2025.

    Comments: under submission

  2. arXiv:2312.06254  [pdf, other

    cs.LG cs.AI cs.DB cs.DC stat.ML

    Modyn: Data-Centric Machine Learning Pipeline Orchestration

    Authors: Maximilian Böther, Ties Robroek, Viktor Gsteiger, Robin Holzinger, Xianzhe Ma, Pınar Tözün, Ana Klimovic

    Abstract: In real-world machine learning (ML) pipelines, datasets are continuously growing. Models must incorporate this new training data to improve generalization and adapt to potential distribution shifts. The cost of model retraining is proportional to how frequently the model is retrained and how much data it is trained on, which makes the naive approach of retraining from scratch each time impractical… ▽ More

    Submitted 24 January, 2025; v1 submitted 11 December, 2023; originally announced December 2023.

    Comments: final version published at SIGMOD'25; 30 pages