-
Dataset of artefacts for machine learning applications in astronomy
Authors:
Sreevarsha Sreejith,
Maria V. Pruzhinskaya,
Alina A. Volnova,
Vadim V. Krushinsky,
Konstantin L. Malanchev,
Emille E. O. Ishida,
Anastasia D. Lavrukhina,
Timofey A. Semenikhin,
Emmanuel Gangler,
Matwey V. Kornilov,
Vladimir S. Korolev
Abstract:
Accurate photometry in astronomical surveys is challenged by image artefacts, which affect measurements and degrade data quality. Due to the large amount of available data, this task is increasingly handled using machine learning algorithms, which often require a labelled training set to learn data patterns. We present an expert-labelled dataset of 1127 artefacts with 1213 labels from 26 fields in…
▽ More
Accurate photometry in astronomical surveys is challenged by image artefacts, which affect measurements and degrade data quality. Due to the large amount of available data, this task is increasingly handled using machine learning algorithms, which often require a labelled training set to learn data patterns. We present an expert-labelled dataset of 1127 artefacts with 1213 labels from 26 fields in ZTF DR3, along with a complementary set of nominal objects. The artefact dataset was compiled using the active anomaly detection algorithm PineForest, developed by the SNAD team. These datasets can serve as valuable resources for real-bogus classification, catalogue cleaning, anomaly detection, and educational purposes. Both artefacts and nominal images are provided in FITS format in two sizes (28 x 28 and 63 x 63 pixels). The datasets are publicly available for further scientific applications.
△ Less
Submitted 10 April, 2025;
originally announced April 2025.
-
Coniferest: a complete active anomaly detection framework
Authors:
M. V. Kornilov,
V. S. Korolev,
K. L. Malanchev,
A. D. Lavrukhina,
E. Russeil,
T. A. Semenikhin,
E. Gangler,
E. E. O. Ishida,
M. V. Pruzhinskaya,
A. A. Volnova,
S. Sreejith
Abstract:
We present coniferest, an open source generic purpose active anomaly detection framework written in Python. The package design and implemented algorithms are described. Currently, static outlier detection analysis is supported via the Isolation forest algorithm. Moreover, Active Anomaly Discovery (AAD) and Pineforest algorithms are available to tackle active anomaly detection problems. The algorit…
▽ More
We present coniferest, an open source generic purpose active anomaly detection framework written in Python. The package design and implemented algorithms are described. Currently, static outlier detection analysis is supported via the Isolation forest algorithm. Moreover, Active Anomaly Discovery (AAD) and Pineforest algorithms are available to tackle active anomaly detection problems. The algorithms and package performance are evaluated on a series of synthetic datasets. We also describe a few success cases which resulted from applying the package to real astronomical data in active anomaly detection tasks within the SNAD project.
△ Less
Submitted 15 November, 2024; v1 submitted 22 October, 2024;
originally announced October 2024.
-
Real-bogus scores for active anomaly detection
Authors:
T. A. Semenikhin,
M. V. Kornilov,
M. V. Pruzhinskaya,
A. D. Lavrukhina,
E. Russeil,
E. Gangler,
E. E. O. Ishida,
V. S. Korolev,
K. L. Malanchev,
A. A. Volnova,
S. Sreejith
Abstract:
In the task of anomaly detection in modern time-domain photometric surveys, the primary goal is to identify astrophysically interesting, rare, and unusual objects among a large volume of data. Unfortunately, artifacts -- such as plane or satellite tracks, bad columns on CCDs, and ghosts -- often constitute significant contaminants in results from anomaly detection analysis. In such contexts, the A…
▽ More
In the task of anomaly detection in modern time-domain photometric surveys, the primary goal is to identify astrophysically interesting, rare, and unusual objects among a large volume of data. Unfortunately, artifacts -- such as plane or satellite tracks, bad columns on CCDs, and ghosts -- often constitute significant contaminants in results from anomaly detection analysis. In such contexts, the Active Anomaly Discovery (AAD) algorithm allows tailoring the output of anomaly detection pipelines according to what the expert judges to be scientifically interesting. We demonstrate how the introduction real-bogus scores, obtained from a machine learning classifier, improves the results from AAD. Using labeled data from the SNAD ZTF knowledge database, we train four real-bogus classifiers: XGBoost, CatBoost, Random Forest, and Extremely Randomized Trees. All the models perform real-bogus classification with similar effectiveness, achieving ROC-AUC scores ranging from 0.93 to 0.95. Consequently, we select the Random Forest model as the main model due to its simplicity and interpretability. The Random Forest classifier is applied to 67 million light curves from ZTF DR17. The output real-bogus score is used as an additional feature for two anomaly detection algorithms: static Isolation Forest and AAD. While results from Isolation Forest remained unchanged, the number of artifacts detected by the active approach decreases significantly with the inclusion of the real-bogus score, from 27 to 3 out of 100. We conclude that incorporating the real-bogus classifier result as an additional feature in the active anomaly detection pipeline significantly reduces the number of artifacts in the outputs, thereby increasing the incidence of astrophysically interesting objects presented to human experts.
△ Less
Submitted 20 December, 2024; v1 submitted 16 September, 2024;
originally announced September 2024.
-
SNAD catalogue of M-dwarf flares from the Zwicky Transient Facility
Authors:
A. S. Voloshina,
A. D. Lavrukhina,
M. V. Pruzhinskaya,
K. L. Malanchev,
E. E. O. Ishida,
V. V. Krushinsky,
P. D. Aleo,
E. Gangler,
M. V. Kornilov,
V. S. Korolev,
E. Russeil,
T. A. Semenikhin,
S. Sreejith,
A. A. Volnova
Abstract:
Most of the stars in the Universe are M spectral class dwarfs, which are known to be the source of bright and frequent stellar flares. In this paper, we propose new approaches to discover M-dwarf flares in ground-based photometric surveys. We employ two approaches: a modification of a traditional method of parametric fit search and a machine learning algorithm based on active anomaly detection. Th…
▽ More
Most of the stars in the Universe are M spectral class dwarfs, which are known to be the source of bright and frequent stellar flares. In this paper, we propose new approaches to discover M-dwarf flares in ground-based photometric surveys. We employ two approaches: a modification of a traditional method of parametric fit search and a machine learning algorithm based on active anomaly detection. The algorithms are applied to Zwicky Transient Facility (ZTF) data release 8, which includes the data from the ZTF high-cadence survey, allowing us to reveal flares lasting from minutes to hours. We analyze over 35 million ZTF light curves and visually scrutinize 1168 candidates suggested by the algorithms to filter out artifacts, occultations of a star by an asteroid, and other types of known variable objects. The result of this analysis is the largest catalogue of ZTF flaring stars to date, representing 134 flares with amplitudes ranging from -0.2 to -4.6 magnitudes, including repeated flares. Using Pan-STARRS DR2 colors, we assign a spectral subclass to each object in the sample. For 13 flares with well-sampled light curves and available geometric distances from Gaia DR3, we estimate the bolometric energy. This research shows that the proposed methods combined with the ZTF's cadence strategy are suitable for identifying M-dwarf flares and other fast transients, allowing for the extraction of significant astrophysical information from their light curves.
△ Less
Submitted 29 September, 2024; v1 submitted 11 April, 2024;
originally announced April 2024.
-
Rainbow: a colorful approach on multi-passband light curve estimation
Authors:
E. Russeil,
K. L. Malanchev,
P. D. Aleo,
E. E. O. Ishida,
M. V. Pruzhinskaya,
E. Gangler,
A. D. Lavrukhina,
A. A. Volnova,
A. Voloshina,
T. Semenikhin,
S. Sreejith,
M. V. Kornilov,
V. S. Korolev
Abstract:
We present Rainbow, a physically motivated framework which enables simultaneous multi-band light curve fitting. It allows the user to construct a 2-dimensional continuous surface across wavelength and time, even in situations where the number of observations in each filter is significantly limited. Assuming the electromagnetic radiation emission from the transient can be approximated by a black-bo…
▽ More
We present Rainbow, a physically motivated framework which enables simultaneous multi-band light curve fitting. It allows the user to construct a 2-dimensional continuous surface across wavelength and time, even in situations where the number of observations in each filter is significantly limited. Assuming the electromagnetic radiation emission from the transient can be approximated by a black-body, we combined an expected temperature evolution and a parametric function describing its bolometric light curve. These three ingredients allow the information available in one passband to guide the reconstruction in the others, thus enabling a proper use of multi-survey data. We demonstrate the effectiveness of our method by applying it to simulated data from the Photometric LSST Astronomical Time-series Classification Challenge (PLAsTiCC) as well as real data from the Young Supernova Experiment (YSE DR1). We evaluate the quality of the estimated light curves according to three different tests: goodness of fit, time of peak prediction and ability to transfer information to machine learning (ML) based classifiers. Results confirm that Rainbow leads to equivalent (SNII) or up to 75% better (SN Ibc) goodness of fit when compared to the Monochromatic approach. Similarly, accuracy when using Rainbow best-fit values as a parameter space in multi-class ML classification improves for all classes in our sample. An efficient implementation of Rainbow has been publicly released as part of the light curve package at https://github.com/light-curve/light-curve-python. Our approach enables straight forward light curve estimation for objects with observations in multiple filters and from multiple experiments. It is particularly well suited for situations where light curve sampling is sparse.
△ Less
Submitted 5 October, 2023; v1 submitted 4 October, 2023;
originally announced October 2023.