-
From Data to Diagnosis: A Large, Comprehensive Bone Marrow Dataset and AI Methods for Childhood Leukemia Prediction
Authors:
Henning Höfener,
Farina Kock,
Martina Pontones,
Tabita Ghete,
David Pfrang,
Nicholas Dickel,
Meik Kunz,
Daniela P. Schacherer,
David A. Clunie,
Andrey Fedorov,
Max Westphal,
Markus Metzler
Abstract:
Leukemia diagnosis primarily relies on manual microscopic analysis of bone marrow morphology supported by additional laboratory parameters, making it complex and time consuming. While artificial intelligence (AI) solutions have been proposed, most utilize private datasets and only cover parts of the diagnostic pipeline. Therefore, we present a large, high-quality, publicly available leukemia bone…
▽ More
Leukemia diagnosis primarily relies on manual microscopic analysis of bone marrow morphology supported by additional laboratory parameters, making it complex and time consuming. While artificial intelligence (AI) solutions have been proposed, most utilize private datasets and only cover parts of the diagnostic pipeline. Therefore, we present a large, high-quality, publicly available leukemia bone marrow dataset spanning the entire diagnostic process, from cell detection to diagnosis. Using this dataset, we further propose methods for cell detection, cell classification, and diagnosis prediction. The dataset comprises 246 pediatric patients with diagnostic, clinical and laboratory information, over 40 000 cells with bounding box annotations and more than 28 000 of these with high-quality class labels, making it the most comprehensive dataset publicly available. Evaluation of the AI models yielded an average precision of 0.96 for the cell detection, an area under the curve of 0.98, and an F1-score of 0.61 for the 33-class cell classification, and a mean F1-score of 0.90 for the diagnosis prediction using predicted cell counts. While the proposed approaches demonstrate their usefulness for AI-assisted diagnostics, the dataset will foster further research and development in the field, ultimately contributing to more precise diagnoses and improved patient outcomes.
△ Less
Submitted 19 September, 2025;
originally announced September 2025.
-
Hierarchical Time Series Forecasting Via Latent Mean Encoding
Authors:
Alessandro Salatiello,
Stefan Birr,
Manuel Kunz
Abstract:
Coherently forecasting the behaviour of a target variable across both coarse and fine temporal scales is crucial for profit-optimized decision-making in several business applications, and remains an open research problem in temporal hierarchical forecasting. Here, we propose a new hierarchical architecture that tackles this problem by leveraging modules that specialize in forecasting the different…
▽ More
Coherently forecasting the behaviour of a target variable across both coarse and fine temporal scales is crucial for profit-optimized decision-making in several business applications, and remains an open research problem in temporal hierarchical forecasting. Here, we propose a new hierarchical architecture that tackles this problem by leveraging modules that specialize in forecasting the different temporal aggregation levels of interest. The architecture, which learns to encode the average behaviour of the target variable within its hidden layers, makes accurate and coherent forecasts across the target temporal hierarchies. We validate our architecture on the challenging, real-world M5 dataset and show that it outperforms established methods, such as the TSMixer model.
△ Less
Submitted 24 June, 2025;
originally announced June 2025.
-
Anchors no more: Using peculiar velocities to constrain $H_0$ and the primordial Universe without calibrators
Authors:
Davide Piras,
Francesco Sorrenti,
Ruth Durrer,
Martin Kunz
Abstract:
We develop a novel approach to constrain the Hubble parameter $H_0$ and the primordial power spectrum amplitude $A_\mathrm{s}$ using type Ia supernovae (SNIa) data. By considering SNIa as tracers of the peculiar velocity field, we can model their distance and their covariance as a function of cosmological parameters without the need of calibrators like Cepheids; this yields a new independent probe…
▽ More
We develop a novel approach to constrain the Hubble parameter $H_0$ and the primordial power spectrum amplitude $A_\mathrm{s}$ using type Ia supernovae (SNIa) data. By considering SNIa as tracers of the peculiar velocity field, we can model their distance and their covariance as a function of cosmological parameters without the need of calibrators like Cepheids; this yields a new independent probe of the large-scale structure based on SNIa data without distance anchors. Crucially, we implement a differentiable pipeline in JAX, including efficient emulators and affine sampling, reducing inference time from years to hours on a single GPU. We first validate our method on mock datasets, demonstrating that we can constrain $H_0$ and $\log 10^{10}A_\mathrm{s}$ within $10\%$ and $15\%$, respectively, using $\mathcal{O}(10^3)$ SNIa. We then test our pipeline with SNIa from an $N$-body simulation, obtaining $6\%$-level unbiased constraints on $H_0$ with a moderate noise level. We finally apply our method to Pantheon+ data, constraining $H_0$ at the $15\%$ level without Cepheids when fixing $A_\mathrm{s}$ to its $\it{Planck}$ value. On the other hand, we obtain $20\%$-level constraints on $\log 10^{10}A_\mathrm{s}$ in agreement with $\it{Planck}$ when including Cepheids in the analysis. In light of upcoming observations of low redshift SNIa from the Zwicky Transient Facility and the Vera Rubin Legacy Survey of Space and Time, surveys for which our method will develop its full potential, we make our code publicly available.
△ Less
Submitted 2 September, 2025; v1 submitted 14 April, 2025;
originally announced April 2025.
-
Causal Forecasting for Pricing
Authors:
Douglas Schultz,
Johannes Stephan,
Julian Sieber,
Trudie Yeh,
Manuel Kunz,
Patrick Doupe,
Tim Januschowski
Abstract:
This paper proposes a novel method for demand forecasting in a pricing context. Here, modeling the causal relationship between price as an input variable to demand is crucial because retailers aim to set prices in a (profit) optimal manner in a downstream decision making problem. Our methods bring together the Double Machine Learning methodology for causal inference and state-of-the-art transforme…
▽ More
This paper proposes a novel method for demand forecasting in a pricing context. Here, modeling the causal relationship between price as an input variable to demand is crucial because retailers aim to set prices in a (profit) optimal manner in a downstream decision making problem. Our methods bring together the Double Machine Learning methodology for causal inference and state-of-the-art transformer-based forecasting models. In extensive empirical experiments, we show on the one hand that our method estimates the causal effect better in a fully controlled setting via synthetic, yet realistic data. On the other hand, we demonstrate on real-world data that our method outperforms forecasting methods in off-policy settings (i.e., when there's a change in the pricing policy) while only slightly trailing in the on-policy setting.
△ Less
Submitted 30 January, 2024; v1 submitted 23 December, 2023;
originally announced December 2023.
-
Euclid: Identification of asteroid streaks in simulated images using deep learning
Authors:
M. Pöntinen,
M. Granvik,
A. A. Nucita,
L. Conversi,
B. Altieri,
B. Carry,
C. M. O'Riordan,
D. Scott,
N. Aghanim,
A. Amara,
L. Amendola,
N. Auricchio,
M. Baldi,
D. Bonino,
E. Branchini,
M. Brescia,
S. Camera,
V. Capobianco,
C. Carbone,
J. Carretero,
M. Castellano,
S. Cavuoti,
A. Cimatti,
R. Cledassou,
G. Congedo
, et al. (92 additional authors not shown)
Abstract:
Up to 150000 asteroids will be visible in the images of the ESA Euclid space telescope, and the instruments of Euclid offer multiband visual to near-infrared photometry and slitless spectra of these objects. Most asteroids will appear as streaks in the images. Due to the large number of images and asteroids, automated detection methods are needed. A non-machine-learning approach based on the Strea…
▽ More
Up to 150000 asteroids will be visible in the images of the ESA Euclid space telescope, and the instruments of Euclid offer multiband visual to near-infrared photometry and slitless spectra of these objects. Most asteroids will appear as streaks in the images. Due to the large number of images and asteroids, automated detection methods are needed. A non-machine-learning approach based on the StreakDet software was previously tested, but the results were not optimal for short and/or faint streaks. We set out to improve the capability to detect asteroid streaks in Euclid images by using deep learning.
We built, trained, and tested a three-step machine-learning pipeline with simulated Euclid images. First, a convolutional neural network (CNN) detected streaks and their coordinates in full images, aiming to maximize the completeness (recall) of detections. Then, a recurrent neural network (RNN) merged snippets of long streaks detected in several parts by the CNN. Lastly, gradient-boosted trees (XGBoost) linked detected streaks between different Euclid exposures to reduce the number of false positives and improve the purity (precision) of the sample.
The deep-learning pipeline surpasses the completeness and reaches a similar level of purity of a non-machine-learning pipeline based on the StreakDet software. Additionally, the deep-learning pipeline can detect asteroids 0.25-0.5 magnitudes fainter than StreakDet. The deep-learning pipeline could result in a 50% increase in the number of detected asteroids compared to the StreakDet software. There is still scope for further refinement, particularly in improving the accuracy of streak coordinates and enhancing the completeness of the final stage of the pipeline, which involves linking detections across multiple exposures.
△ Less
Submitted 5 October, 2023;
originally announced October 2023.
-
Deep Learning based Forecasting: a case study from the online fashion industry
Authors:
Manuel Kunz,
Stefan Birr,
Mones Raslan,
Lei Ma,
Zhen Li,
Adele Gouttes,
Mateusz Koren,
Tofigh Naghibi,
Johannes Stephan,
Mariia Bulycheva,
Matthias Grzeschik,
Armin Kekić,
Michael Narodovitch,
Kashif Rasul,
Julian Sieber,
Tim Januschowski
Abstract:
Demand forecasting in the online fashion industry is particularly amendable to global, data-driven forecasting models because of the industry's set of particular challenges. These include the volume of data, the irregularity, the high amount of turn-over in the catalog and the fixed inventory assumption. While standard deep learning forecasting approaches cater for many of these, the fixed invento…
▽ More
Demand forecasting in the online fashion industry is particularly amendable to global, data-driven forecasting models because of the industry's set of particular challenges. These include the volume of data, the irregularity, the high amount of turn-over in the catalog and the fixed inventory assumption. While standard deep learning forecasting approaches cater for many of these, the fixed inventory assumption requires a special treatment via controlling the relationship between price and demand closely. In this case study, we describe the data and our modelling approach for this forecasting problem in detail and present empirical results that highlight the effectiveness of our approach.
△ Less
Submitted 23 May, 2023;
originally announced May 2023.
-
A Priori Calibration of Transient Kinetics Data via Machine Learning
Authors:
M. Ross Kunz,
Adam Yonge,
Rakesh Batchu,
Zongtang Fang,
Yixiao Wang,
Gregory Yablonsky,
Andrew J. Medford,
Rebecca Fushimi
Abstract:
The temporal analysis of products reactor provides a vast amount of transient kinetic information that may be used to describe a variety of chemical features including the residence time distribution, kinetic coefficients, number of active sites, and the reaction mechanism. However, as with any measurement device, the TAP reactor signal is convoluted with noise. To reduce the uncertainty of the ki…
▽ More
The temporal analysis of products reactor provides a vast amount of transient kinetic information that may be used to describe a variety of chemical features including the residence time distribution, kinetic coefficients, number of active sites, and the reaction mechanism. However, as with any measurement device, the TAP reactor signal is convoluted with noise. To reduce the uncertainty of the kinetic measurement and any derived parameters or mechanisms, proper preprocessing must be performed prior to any advanced analysis. This preprocessing consists of baseline correction, i.e., a shift in the voltage response, and calibration, i.e., a scaling of the flux response based on prior experiments. The current methodology of preprocessing requires significant user discretion and reliance on previous experiments that may drift over time. Herein we use machine learning techniques combined with physical constraints to convert the raw instrument signal to chemical information. As such, the proposed methodology demonstrates clear benefits over the traditional preprocessing in the calibration of the inert and feed mixture products without need of prior calibration experiments or heuristic input from the user.
△ Less
Submitted 27 September, 2021;
originally announced September 2021.
-
TAPsolver: A Python package for the simulation and analysis of TAP reactor experiments
Authors:
Adam Yonge,
M. Ross Kunz,
Rakesh Batchu,
Zongtang Fang,
Tobin Issac,
Rebecca Fushimi,
Andrew J. Medford
Abstract:
An open-source, Python-based Temporal Analysis of Products (TAP) reactor simulation and processing program is introduced. TAPsolver utilizes algorithmic differentiation for the calculation of highly accurate derivatives, which are used to perform sensitivity analyses and PDE-constrained optimization. The tool supports constraints to ensure thermodynamic consistency, which can lead to more accurate…
▽ More
An open-source, Python-based Temporal Analysis of Products (TAP) reactor simulation and processing program is introduced. TAPsolver utilizes algorithmic differentiation for the calculation of highly accurate derivatives, which are used to perform sensitivity analyses and PDE-constrained optimization. The tool supports constraints to ensure thermodynamic consistency, which can lead to more accurate parameters and assist in mechanism discrimination. The mathematical and structural details of TAPsolver are outlined, as well as validation of the forward and inverse problems against well-studied prototype problems. Benchmarks of the code are presented, and a case study for extracting thermodynamically-consistent kinetic parameters from experimental TAP measurements of CO oxidation on supported platinum particles is presented. TAPsolver will act as a foundation for future development and dissemination of TAP data processing techniques.
△ Less
Submitted 26 August, 2020;
originally announced August 2020.
-
A Flexible Framework for Anomaly Detection via Dimensionality Reduction
Authors:
Alireza Vafaei Sadr,
Bruce A. Bassett,
Martin Kunz
Abstract:
Anomaly detection is challenging, especially for large datasets in high dimensions. Here we explore a general anomaly detection framework based on dimensionality reduction and unsupervised clustering. We release DRAMA, a general python package that implements the general framework with a wide range of built-in options. We test DRAMA on a wide variety of simulated and real datasets, in up to 3000 d…
▽ More
Anomaly detection is challenging, especially for large datasets in high dimensions. Here we explore a general anomaly detection framework based on dimensionality reduction and unsupervised clustering. We release DRAMA, a general python package that implements the general framework with a wide range of built-in options. We test DRAMA on a wide variety of simulated and real datasets, in up to 3000 dimensions, and find it robust and highly competitive with commonly-used anomaly detection algorithms, especially in high dimensions. The flexibility of the DRAMA framework allows for significant optimization once some examples of anomalies are available, making it ideal for online anomaly detection, active learning and highly unbalanced datasets.
△ Less
Submitted 9 September, 2019;
originally announced September 2019.