-
SLEEPYLAND: trust begins with fair evaluation of automatic sleep staging models
Authors:
Alvise Dei Rossi,
Matteo Metaldi,
Michal Bechny,
Irina Filchenko,
Julia van der Meer,
Markus H. Schmidt,
Claudio L. A. Bassetti,
Athina Tzovara,
Francesca D. Faraci,
Luigi Fiorillo
Abstract:
Despite advances in deep learning for automatic sleep staging, clinical adoption remains limited due to challenges in fair model evaluation, generalization across diverse datasets, model bias, and variability in human annotations. We present SLEEPYLAND, an open-source sleep staging evaluation framework designed to address these barriers. It includes more than 220'000 hours in-domain (ID) sleep rec…
▽ More
Despite advances in deep learning for automatic sleep staging, clinical adoption remains limited due to challenges in fair model evaluation, generalization across diverse datasets, model bias, and variability in human annotations. We present SLEEPYLAND, an open-source sleep staging evaluation framework designed to address these barriers. It includes more than 220'000 hours in-domain (ID) sleep recordings, and more than 84'000 hours out-of-domain (OOD) sleep recordings, spanning a broad range of ages, sleep-wake disorders, and hardware setups. We release pre-trained models based on high-performing SoA architectures and evaluate them under standardized conditions across single- and multi-channel EEG/EOG configurations. We introduce SOMNUS, an ensemble combining models across architectures and channel setups via soft voting. SOMNUS achieves robust performance across twenty-four different datasets, with macro-F1 scores between 68.7% and 87.2%, outperforming individual models in 94.9% of cases. Notably, SOMNUS surpasses previous SoA methods, even including cases where compared models were trained ID while SOMNUS treated the same data as OOD. Using a subset of the BSWR (N=6'633), we quantify model biases linked to age, gender, AHI, and PLMI, showing that while ensemble improves robustness, no model architecture consistently minimizes bias in performance and clinical markers estimation. In evaluations on OOD multi-annotated datasets (DOD-H, DOD-O), SOMNUS exceeds the best human scorer, i.e., MF1 85.2% vs 80.8% on DOD-H, and 80.2% vs 75.9% on DOD-O, better reproducing the scorer consensus than any individual expert (k = 0.89/0.85 and ACS = 0.95/0.94 for healthy/OSA cohorts). Finally, we introduce ensemble disagreement metrics - entropy and inter-model divergence based - predicting regions of scorer disagreement with ROC AUCs up to 0.828, offering a data-driven proxy for human uncertainty.
△ Less
Submitted 11 June, 2025; v1 submitted 10 June, 2025;
originally announced June 2025.
-
Bridging AI and Clinical Practice: Integrating Automated Sleep Scoring Algorithm with Uncertainty-Guided Physician Review
Authors:
Michal Bechny,
Giuliana Monachino,
Luigi Fiorillo,
Julia van der Meer,
Markus H. Schmidt,
Claudio L. A. Bassetti,
Athina Tzovara,
Francesca D. Faraci
Abstract:
Purpose: This study aims to enhance the clinical use of automated sleep-scoring algorithms by incorporating an uncertainty estimation approach to efficiently assist clinicians in the manual review of predicted hypnograms, a necessity due to the notable inter-scorer variability inherent in polysomnography (PSG) databases. Our efforts target the extent of review required to achieve predefined agreem…
▽ More
Purpose: This study aims to enhance the clinical use of automated sleep-scoring algorithms by incorporating an uncertainty estimation approach to efficiently assist clinicians in the manual review of predicted hypnograms, a necessity due to the notable inter-scorer variability inherent in polysomnography (PSG) databases. Our efforts target the extent of review required to achieve predefined agreement levels, examining both in-domain and out-of-domain data, and considering subjects diagnoses. Patients and methods: Total of 19578 PSGs from 13 open-access databases were used to train U-Sleep, a state-of-the-art sleep-scoring algorithm. We leveraged a comprehensive clinical database of additional 8832 PSGs, covering a full spectrum of ages and sleep-disorders, to refine the U-Sleep, and to evaluate different uncertainty-quantification approaches, including our novel confidence network. The ID data consisted of PSGs scored by over 50 physicians, and the two OOD sets comprised recordings each scored by a unique senior physician. Results: U-Sleep demonstrated robust performance, with Cohen's kappa (K) at 76.2% on ID and 73.8-78.8% on OOD data. The confidence network excelled at identifying uncertain predictions, achieving AUROC scores of 85.7% on ID and 82.5-85.6% on OOD data. Independently of sleep-disorder status, statistical evaluations revealed significant differences in confidence scores between aligning vs discording predictions, and significant correlations of confidence scores with classification performance metrics. To achieve K of at least 90% with physician intervention, examining less than 29.0% of uncertain epochs was required, substantially reducing physicians workload, and facilitating near-perfect agreement.
△ Less
Submitted 22 December, 2023;
originally announced December 2023.
-
Towards socially-competent and culturally-adaptive artificial agents Expressive order, interactional disruptions and recovery strategies
Authors:
Chiara Bassetti,
Enrico Blanzieri,
Stefano Borgo,
Sofia Marangon
Abstract:
The development of artificial agents for social interaction pushes to enrich robots with social skills and knowledge about (local) social norms. One possibility is to distinguish the expressive and the functional orders during a human-robot interaction. The overarching aim of this work is to set a framework to make the artificial agent socially-competent beyond dyadic interaction-interaction in va…
▽ More
The development of artificial agents for social interaction pushes to enrich robots with social skills and knowledge about (local) social norms. One possibility is to distinguish the expressive and the functional orders during a human-robot interaction. The overarching aim of this work is to set a framework to make the artificial agent socially-competent beyond dyadic interaction-interaction in varying multi-party social situations-and beyond individual-based user personalization, thereby enlarging the current conception of "culturally-adaptive". The core idea is to provide the artificial agent with the capability to handle different kinds of interactional disruptions, and associated recovery strategies, in microsociology. The result is obtained by classifying functional and social disruptions, and by investigating the requirements a robot's architecture should satisfy to exploit such knowledge. The paper also highlights how this level of competence is achieved by focusing on just three dimensions: (i) social capability, (ii) relational role, and (iii) proximity, leaving aside the further complexity of full-fledged human-human interactions. Without going into technical aspects, End-to-end Data-driven Architectures and Modular Architectures are discussed to evaluate the degree to which they can exploit this new set of social and cultural knowledge. Finally, a list of general requirements for such agents is proposed.
△ Less
Submitted 6 August, 2023;
originally announced August 2023.
-
U-Sleep's resilience to AASM guidelines
Authors:
Luigi Fiorillo,
Giuliana Monachino,
Julia van der Meer,
Marco Pesce,
Jan D. Warncke,
Markus H. Schmidt,
Claudio L. A. Bassetti,
Athina Tzovara,
Paolo Favaro,
Francesca D. Faraci
Abstract:
AASM guidelines are the result of decades of efforts aiming at standardizing sleep scoring procedure, with the final goal of sharing a worldwide common methodology. The guidelines cover several aspects from the technical/digital specifications,e.g., recommended EEG derivations, to detailed sleep scoring rules accordingly to age. Automated sleep scoring systems have always largely exploited the sta…
▽ More
AASM guidelines are the result of decades of efforts aiming at standardizing sleep scoring procedure, with the final goal of sharing a worldwide common methodology. The guidelines cover several aspects from the technical/digital specifications,e.g., recommended EEG derivations, to detailed sleep scoring rules accordingly to age. Automated sleep scoring systems have always largely exploited the standards as fundamental guidelines. In this context, deep learning has demonstrated better performance compared to classical machine learning. Our present work shows that a deep learning based sleep scoring algorithm may not need to fully exploit the clinical knowledge or to strictly adhere to the AASM guidelines. Specifically, we demonstrate that U-Sleep, a state-of-the-art sleep scoring algorithm, can be strong enough to solve the scoring task even using clinically non-recommended or non-conventional derivations, and with no need to exploit information about the chronological age of the subjects. We finally strengthen a well-known finding that using data from multiple data centers always results in a better performing model compared with training on a single cohort. Indeed, we show that this latter statement is still valid even by increasing the size and the heterogeneity of the single data cohort. In all our experiments we used 28528 polysomnography studies from 13 different clinical studies.
△ Less
Submitted 13 March, 2023; v1 submitted 19 September, 2022;
originally announced September 2022.
-
A hybrid machine learning/deep learning COVID-19 severity predictive model from CT images and clinical data
Authors:
Matteo Chieregato,
Fabio Frangiamore,
Mauro Morassi,
Claudia Baresi,
Stefania Nici,
Chiara Bassetti,
Claudio Bnà ,
Marco Galelli
Abstract:
COVID-19 clinical presentation and prognosis are highly variable, ranging from asymptomatic and paucisymptomatic cases to acute respiratory distress syndrome and multi-organ involvement. We developed a hybrid machine learning/deep learning model to classify patients in two outcome categories, non-ICU and ICU (intensive care admission or death), using 558 patients admitted in a northern Italy hospi…
▽ More
COVID-19 clinical presentation and prognosis are highly variable, ranging from asymptomatic and paucisymptomatic cases to acute respiratory distress syndrome and multi-organ involvement. We developed a hybrid machine learning/deep learning model to classify patients in two outcome categories, non-ICU and ICU (intensive care admission or death), using 558 patients admitted in a northern Italy hospital in February/May of 2020. A fully 3D patient-level CNN classifier on baseline CT images is used as feature extractor. Features extracted, alongside with laboratory and clinical data, are fed for selection in a Boruta algorithm with SHAP game theoretical values. A classifier is built on the reduced feature space using CatBoost gradient boosting algorithm and reaching a probabilistic AUC of 0.949 on holdout test set. The model aims to provide clinical decision support to medical doctors, with the probability score of belonging to an outcome class and with case-based SHAP interpretation of features importance.
△ Less
Submitted 13 May, 2021;
originally announced May 2021.
-
F-formation Detection: Individuating Free-standing Conversational Groups in Images
Authors:
Francesco Setti,
Chris Russell,
Chiara Bassetti,
Marco Cristani
Abstract:
Detection of groups of interacting people is a very interesting and useful task in many modern technologies, with application fields spanning from video-surveillance to social robotics. In this paper we first furnish a rigorous definition of group considering the background of the social sciences: this allows us to specify many kinds of group, so far neglected in the Computer Vision literature. On…
▽ More
Detection of groups of interacting people is a very interesting and useful task in many modern technologies, with application fields spanning from video-surveillance to social robotics. In this paper we first furnish a rigorous definition of group considering the background of the social sciences: this allows us to specify many kinds of group, so far neglected in the Computer Vision literature. On top of this taxonomy, we present a detailed state of the art on the group detection algorithms. Then, as a main contribution, we present a brand new method for the automatic detection of groups in still images, which is based on a graph-cuts framework for clustering individuals; in particular we are able to codify in a computational sense the sociological definition of F-formation, that is very useful to encode a group having only proxemic information: position and orientation of people. We call the proposed method Graph-Cuts for F-formation (GCFF). We show how GCFF definitely outperforms all the state of the art methods in terms of different accuracy measures (some of them are brand new), demonstrating also a strong robustness to noise and versatility in recognizing groups of various cardinality.
△ Less
Submitted 9 September, 2014;
originally announced September 2014.