-
GynSurg: A Comprehensive Gynecology Laparoscopic Surgery Dataset
Authors:
Sahar Nasirihaghighi,
Negin Ghamsarian,
Leonie Peschek,
Matteo Munari,
Heinrich Husslein,
Raphael Sznitman,
Klaus Schoeffmann
Abstract:
Recent advances in deep learning have transformed computer-assisted intervention and surgical video analysis, driving improvements not only in surgical training, intraoperative decision support, and patient outcomes, but also in postoperative documentation and surgical discovery. Central to these developments is the availability of large, high-quality annotated datasets. In gynecologic laparoscopy…
▽ More
Recent advances in deep learning have transformed computer-assisted intervention and surgical video analysis, driving improvements not only in surgical training, intraoperative decision support, and patient outcomes, but also in postoperative documentation and surgical discovery. Central to these developments is the availability of large, high-quality annotated datasets. In gynecologic laparoscopy, surgical scene understanding and action recognition are fundamental for building intelligent systems that assist surgeons during operations and provide deeper analysis after surgery. However, existing datasets are often limited by small scale, narrow task focus, or insufficiently detailed annotations, limiting their utility for comprehensive, end-to-end workflow analysis. To address these limitations, we introduce GynSurg, the largest and most diverse multi-task dataset for gynecologic laparoscopic surgery to date. GynSurg provides rich annotations across multiple tasks, supporting applications in action recognition, semantic segmentation, surgical documentation, and discovery of novel procedural insights. We demonstrate the dataset quality and versatility by benchmarking state-of-the-art models under a standardized training protocol. To accelerate progress in the field, we publicly release the GynSurg dataset and its annotations
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
WetCat: Automating Skill Assessment in Wetlab Cataract Surgery Videos
Authors:
Negin Ghamsarian,
Raphael Sznitman,
Klaus Schoeffmann,
Jens Kowal
Abstract:
To meet the growing demand for systematic surgical training, wetlab environments have become indispensable platforms for hands-on practice in ophthalmology. Yet, traditional wetlab training depends heavily on manual performance evaluations, which are labor-intensive, time-consuming, and often subject to variability. Recent advances in computer vision offer promising avenues for automated skill ass…
▽ More
To meet the growing demand for systematic surgical training, wetlab environments have become indispensable platforms for hands-on practice in ophthalmology. Yet, traditional wetlab training depends heavily on manual performance evaluations, which are labor-intensive, time-consuming, and often subject to variability. Recent advances in computer vision offer promising avenues for automated skill assessment, enhancing both the efficiency and objectivity of surgical education. Despite notable progress in ophthalmic surgical datasets, existing resources predominantly focus on real surgeries or isolated tasks, falling short of supporting comprehensive skill evaluation in controlled wetlab settings. To address these limitations, we introduce WetCat, the first dataset of wetlab cataract surgery videos specifically curated for automated skill assessment. WetCat comprises high-resolution recordings of surgeries performed by trainees on artificial eyes, featuring comprehensive phase annotations and semantic segmentations of key anatomical structures. These annotations are meticulously designed to facilitate skill assessment during the critical capsulorhexis and phacoemulsification phases, adhering to standardized surgical skill assessment frameworks. By focusing on these essential phases, WetCat enables the development of interpretable, AI-driven evaluation tools aligned with established clinical metrics. This dataset lays a strong foundation for advancing objective, scalable surgical education and sets a new benchmark for automated workflow analysis and skill assessment in ophthalmology training. The dataset and annotations are publicly available in Synapse https://www.synapse.org/Synapse:syn66401174/files.
△ Less
Submitted 10 June, 2025;
originally announced June 2025.
-
The State-of-the-Art in Lifelog Retrieval: A Review of Progress at the ACM Lifelog Search Challenge Workshop 2022-24
Authors:
Allie Tran,
Werner Bailer,
Duc-Tien Dang-Nguyen,
Graham Healy,
Steve Hodges,
Björn Þór Jónsson,
Luca Rossetto,
Klaus Schoeffmann,
Minh-Triet Tran,
Lucia Vadicamo,
Cathal Gurrin
Abstract:
The ACM Lifelog Search Challenge (LSC) is a venue that welcomes and compares systems that support the exploration of lifelog data, and in particular the retrieval of specific information, through an interactive competition format. This paper reviews the recent advances in interactive lifelog retrieval as demonstrated at the ACM LSC from 2022 to 2024. Through a detailed comparative analysis, we hig…
▽ More
The ACM Lifelog Search Challenge (LSC) is a venue that welcomes and compares systems that support the exploration of lifelog data, and in particular the retrieval of specific information, through an interactive competition format. This paper reviews the recent advances in interactive lifelog retrieval as demonstrated at the ACM LSC from 2022 to 2024. Through a detailed comparative analysis, we highlight key improvements across three main retrieval tasks: known-item search, question answering, and ad-hoc search. Our analysis identifies trends such as the widespread adoption of embedding-based retrieval methods (e.g., CLIP, BLIP), increased integration of large language models (LLMs) for conversational retrieval, and continued innovation in multimodal and collaborative search interfaces. We further discuss how specific retrieval techniques and user interface (UI) designs have impacted system performance, emphasizing the importance of balancing retrieval complexity with usability. Our findings indicate that embedding-driven approaches combined with LLMs show promise for lifelog retrieval systems. Likewise, improving UI design can enhance usability and efficiency. Additionally, we recommend reconsidering multi-instance system evaluations within the expert track to better manage variability in user familiarity and configuration effectiveness.
△ Less
Submitted 7 June, 2025;
originally announced June 2025.
-
Feedback-Driven Pseudo-Label Reliability Assessment: Redefining Thresholding for Semi-Supervised Semantic Segmentation
Authors:
Negin Ghamsarian,
Sahar Nasirihaghighi,
Klaus Schoeffmann,
Raphael Sznitman
Abstract:
Semi-supervised learning leverages unlabeled data to enhance model performance, addressing the limitations of fully supervised approaches. Among its strategies, pseudo-supervision has proven highly effective, typically relying on one or multiple teacher networks to refine pseudo-labels before training a student network. A common practice in pseudo-supervision is filtering pseudo-labels based on pr…
▽ More
Semi-supervised learning leverages unlabeled data to enhance model performance, addressing the limitations of fully supervised approaches. Among its strategies, pseudo-supervision has proven highly effective, typically relying on one or multiple teacher networks to refine pseudo-labels before training a student network. A common practice in pseudo-supervision is filtering pseudo-labels based on pre-defined confidence thresholds or entropy. However, selecting optimal thresholds requires large labeled datasets, which are often scarce in real-world semi-supervised scenarios. To overcome this challenge, we propose Ensemble-of-Confidence Reinforcement (ENCORE), a dynamic feedback-driven thresholding strategy for pseudo-label selection. Instead of relying on static confidence thresholds, ENCORE estimates class-wise true-positive confidence within the unlabeled dataset and continuously adjusts thresholds based on the model's response to different levels of pseudo-label filtering. This feedback-driven mechanism ensures the retention of informative pseudo-labels while filtering unreliable ones, enhancing model training without manual threshold tuning. Our method seamlessly integrates into existing pseudo-supervision frameworks and significantly improves segmentation performance, particularly in data-scarce conditions. Extensive experiments demonstrate that integrating ENCORE with existing pseudo-supervision frameworks enhances performance across multiple datasets and network architectures, validating its effectiveness in semi-supervised learning.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
The CASTLE 2024 Dataset: Advancing the Art of Multimodal Understanding
Authors:
Luca Rossetto,
Werner Bailer,
Duc-Tien Dang-Nguyen,
Graham Healy,
Björn Þór Jónsson,
Onanong Kongmeesub,
Hoang-Bao Le,
Stevan Rudinac,
Klaus Schöffmann,
Florian Spiess,
Allie Tran,
Minh-Triet Tran,
Quang-Linh Tran,
Cathal Gurrin
Abstract:
Egocentric video has seen increased interest in recent years, as it is used in a range of areas. However, most existing datasets are limited to a single perspective. In this paper, we present the CASTLE 2024 dataset, a multimodal collection containing ego- and exo-centric (i.e., first- and third-person perspective) video and audio from 15 time-aligned sources, as well as other sensor streams and a…
▽ More
Egocentric video has seen increased interest in recent years, as it is used in a range of areas. However, most existing datasets are limited to a single perspective. In this paper, we present the CASTLE 2024 dataset, a multimodal collection containing ego- and exo-centric (i.e., first- and third-person perspective) video and audio from 15 time-aligned sources, as well as other sensor streams and auxiliary data. The dataset was recorded by volunteer participants over four days in a fixed location and includes the point of view of 10 participants, with an additional 5 fixed cameras providing an exocentric perspective. The entire dataset contains over 600 hours of UHD video recorded at 50 frames per second. In contrast to other datasets, CASTLE 2024 does not contain any partial censoring, such as blurred faces or distorted audio. The dataset is available via https://castle-dataset.github.io/.
△ Less
Submitted 21 March, 2025;
originally announced March 2025.
-
Results of the 2024 Video Browser Showdown
Authors:
Luca Rossetto,
Klaus Schoeffmann,
Cathal Gurrin,
Jakub Lokoč,
Werner Bailer
Abstract:
This report presents the results of the 13th Video Browser Showdown, held at the 2024 International Conference on Multimedia Modeling on the 29th of January 2024 in Amsterdam, the Netherlands.
This report presents the results of the 13th Video Browser Showdown, held at the 2024 International Conference on Multimedia Modeling on the 29th of January 2024 in Amsterdam, the Netherlands.
△ Less
Submitted 13 December, 2024;
originally announced February 2025.
-
Dual Invariance Self-training for Reliable Semi-supervised Surgical Phase Recognition
Authors:
Sahar Nasirihaghighi,
Negin Ghamsarian,
Raphael Sznitman,
Klaus Schoeffmann
Abstract:
Accurate surgical phase recognition is crucial for advancing computer-assisted interventions, yet the scarcity of labeled data hinders training reliable deep learning models. Semi-supervised learning (SSL), particularly with pseudo-labeling, shows promise over fully supervised methods but often lacks reliable pseudo-label assessment mechanisms. To address this gap, we propose a novel SSL framework…
▽ More
Accurate surgical phase recognition is crucial for advancing computer-assisted interventions, yet the scarcity of labeled data hinders training reliable deep learning models. Semi-supervised learning (SSL), particularly with pseudo-labeling, shows promise over fully supervised methods but often lacks reliable pseudo-label assessment mechanisms. To address this gap, we propose a novel SSL framework, Dual Invariance Self-Training (DIST), that incorporates both Temporal and Transformation Invariance to enhance surgical phase recognition. Our two-step self-training process dynamically selects reliable pseudo-labels, ensuring robust pseudo-supervision. Our approach mitigates the risk of noisy pseudo-labels, steering decision boundaries toward true data distribution and improving generalization to unseen data. Evaluations on Cataract and Cholec80 datasets show our method outperforms state-of-the-art SSL approaches, consistently surpassing both supervised and SSL baselines across various network architectures.
△ Less
Submitted 29 January, 2025;
originally announced January 2025.
-
SegSTRONG-C: Segmenting Surgical Tools Robustly On Non-adversarial Generated Corruptions -- An EndoVis'24 Challenge
Authors:
Hao Ding,
Yuqian Zhang,
Tuxun Lu,
Ruixing Liang,
Hongchao Shu,
Lalithkumar Seenivasan,
Yonghao Long,
Qi Dou,
Cong Gao,
Yicheng Leng,
Seok Bong Yoo,
Eung-Joo Lee,
Negin Ghamsarian,
Klaus Schoeffmann,
Raphael Sznitman,
Zijian Wu,
Yuxin Chen,
Septimiu E. Salcudean,
Samra Irshad,
Shadi Albarqouni,
Seong Tae Kim,
Yueyi Sun,
An Wang,
Long Bai,
Hongliang Ren
, et al. (17 additional authors not shown)
Abstract:
Surgical data science has seen rapid advancement due to the excellent performance of end-to-end deep neural networks (DNNs) for surgical video analysis. Despite their successes, end-to-end DNNs have been proven susceptible to even minor corruptions, substantially impairing the model's performance. This vulnerability has become a major concern for the translation of cutting-edge technology, especia…
▽ More
Surgical data science has seen rapid advancement due to the excellent performance of end-to-end deep neural networks (DNNs) for surgical video analysis. Despite their successes, end-to-end DNNs have been proven susceptible to even minor corruptions, substantially impairing the model's performance. This vulnerability has become a major concern for the translation of cutting-edge technology, especially for high-stakes decision-making in surgical data science. We introduce SegSTRONG-C, a benchmark and challenge in surgical data science dedicated, aiming to better understand model deterioration under unforeseen but plausible non-adversarial corruption and the capabilities of contemporary methods that seek to improve it. Through comprehensive baseline experiments and participating submissions from widespread community engagement, SegSTRONG-C reveals key themes for model failure and identifies promising directions for improving robustness. The performance of challenge winners, achieving an average 0.9394 DSC and 0.9301 NSD across the unreleased test sets with corruption types: bleeding, smoke, and low brightness, shows inspiring improvement of 0.1471 DSC and 0.2584 NSD in average comparing to strongest baseline methods with UNet architecture trained with AutoAugment. In conclusion, the SegSTRONG-C challenge has identified some practical approaches for enhancing model robustness, yet most approaches relied on conventional techniques that have known, and sometimes quite severe, limitations. Looking ahead, we advocate for expanding intellectual diversity and creativity in non-adversarial robustness beyond data augmentation or training scale, calling for new paradigms that enhance universal robustness to corruptions and may enable richer applications in surgical data science.
△ Less
Submitted 7 April, 2025; v1 submitted 16 July, 2024;
originally announced July 2024.
-
Cognitive Effort Measures Driven by Fixation Induced Retinal Flow in Visual Scanning Behavior during Virtual Driving
Authors:
Runlin Zhang,
Qing Xu,
Simon Parkinson,
Klaus Schoeffmann,
Yu Chen
Abstract:
In this paper, we consider the problem of visual scanning mechanism underpinning sensorimotor tasks, such as walking and driving, in dynamic environments. We exploit eye tracking data for offering two new cognitive effort measures in visual scanning behavior of virtual driving. By utilizing the retinal flow induced by fixation, two novel measures of cognitive effort are proposed through the import…
▽ More
In this paper, we consider the problem of visual scanning mechanism underpinning sensorimotor tasks, such as walking and driving, in dynamic environments. We exploit eye tracking data for offering two new cognitive effort measures in visual scanning behavior of virtual driving. By utilizing the retinal flow induced by fixation, two novel measures of cognitive effort are proposed through the importance of grids in the viewing plane and the concept of information quantity, respectively. Psychophysical studies are conducted to reveal the effectiveness of the two proposed measures. Both these two cognitive effort measures have shown their significant correlation with pupil size change. Our results suggest that the quantitative exploitation of eye tracking data provides an effective approach for the evaluation of sensorimotor activities.
△ Less
Submitted 15 May, 2024;
originally announced May 2024.
-
Optimal Quality and Efficiency in Adaptive Live Streaming with JND-Aware Low latency Encoding
Authors:
Vignesh V Menon,
Jingwen Zhu,
Prajit T Rajendran,
Samira Afzal,
Klaus Schoeffmann,
Patrick Le Callet,
Christian Timmerer
Abstract:
In HTTP adaptive live streaming applications, video segments are encoded at a fixed set of bitrate-resolution pairs known as bitrate ladder. Live encoders use the fastest available encoding configuration, referred to as preset, to ensure the minimum possible latency in video encoding. However, an optimized preset and optimized number of CPU threads for each encoding instance may result in (i) incr…
▽ More
In HTTP adaptive live streaming applications, video segments are encoded at a fixed set of bitrate-resolution pairs known as bitrate ladder. Live encoders use the fastest available encoding configuration, referred to as preset, to ensure the minimum possible latency in video encoding. However, an optimized preset and optimized number of CPU threads for each encoding instance may result in (i) increased quality and (ii) efficient CPU utilization while encoding. For low latency live encoders, the encoding speed is expected to be more than or equal to the video framerate. To this light, this paper introduces a Just Noticeable Difference (JND)-Aware Low latency Encoding Scheme (JALE), which uses random forest-based models to jointly determine the optimized encoder preset and thread count for each representation, based on video complexity features, the target encoding speed, the total number of available CPU threads, and the target encoder. Experimental results show that, on average, JALE yield a quality improvement of 1.32 dB PSNR and 5.38 VMAF points with the same bitrate, compared to the fastest preset encoding of the HTTP Live Streaming (HLS) bitrate ladder using x265 HEVC open-source encoder with eight CPU threads used for each representation. These enhancements are achieved while maintaining the desired encoding speed. Furthermore, on average, JALE results in an overall storage reduction of 72.70 %, a reduction in the total number of CPU threads used by 63.83 %, and a 37.87 % reduction in the overall encoding time, considering a JND of six VMAF points.
△ Less
Submitted 27 January, 2024;
originally announced January 2024.
-
Cataract-1K: Cataract Surgery Dataset for Scene Segmentation, Phase Recognition, and Irregularity Detection
Authors:
Negin Ghamsarian,
Yosuf El-Shabrawi,
Sahar Nasirihaghighi,
Doris Putzgruber-Adamitsch,
Martin Zinkernagel,
Sebastian Wolf,
Klaus Schoeffmann,
Raphael Sznitman
Abstract:
In recent years, the landscape of computer-assisted interventions and post-operative surgical video analysis has been dramatically reshaped by deep-learning techniques, resulting in significant advancements in surgeons' skills, operation room management, and overall surgical outcomes. However, the progression of deep-learning-powered surgical technologies is profoundly reliant on large-scale datas…
▽ More
In recent years, the landscape of computer-assisted interventions and post-operative surgical video analysis has been dramatically reshaped by deep-learning techniques, resulting in significant advancements in surgeons' skills, operation room management, and overall surgical outcomes. However, the progression of deep-learning-powered surgical technologies is profoundly reliant on large-scale datasets and annotations. Particularly, surgical scene understanding and phase recognition stand as pivotal pillars within the realm of computer-assisted surgery and post-operative assessment of cataract surgery videos. In this context, we present the largest cataract surgery video dataset that addresses diverse requisites for constructing computerized surgical workflow analysis and detecting post-operative irregularities in cataract surgery. We validate the quality of annotations by benchmarking the performance of several state-of-the-art neural network architectures for phase recognition and surgical scene segmentation. Besides, we initiate the research on domain adaptation for instrument segmentation in cataract surgery by evaluating cross-domain instrument segmentation performance in cataract surgery videos. The dataset and annotations will be publicly available upon acceptance of the paper.
△ Less
Submitted 11 December, 2023;
originally announced December 2023.
-
DeepPyramid+: Medical Image Segmentation using Pyramid View Fusion and Deformable Pyramid Reception
Authors:
Negin Ghamsarian,
Sebastian Wolf,
Martin Zinkernagel,
Klaus Schoeffmann,
Raphael Sznitman
Abstract:
Semantic Segmentation plays a pivotal role in many applications related to medical image and video analysis. However, designing a neural network architecture for medical image and surgical video segmentation is challenging due to the diverse features of relevant classes, including heterogeneity, deformability, transparency, blunt boundaries, and various distortions. We propose a network architectu…
▽ More
Semantic Segmentation plays a pivotal role in many applications related to medical image and video analysis. However, designing a neural network architecture for medical image and surgical video segmentation is challenging due to the diverse features of relevant classes, including heterogeneity, deformability, transparency, blunt boundaries, and various distortions. We propose a network architecture, DeepPyramid+, which addresses diverse challenges encountered in medical image and surgical video segmentation. The proposed DeepPyramid+ incorporates two major modules, namely "Pyramid View Fusion" (PVF) and "Deformable Pyramid Reception," (DPR), to address the outlined challenges. PVF replicates a deduction process within the neural network, aligning with the human visual system, thereby enhancing the representation of relative information at each pixel position. Complementarily, DPR introduces shape- and scale-adaptive feature extraction techniques using dilated deformable convolutions, enhancing accuracy and robustness in handling heterogeneous classes and deformable shapes. Extensive experiments conducted on diverse datasets, including endometriosis videos, MRI images, OCT scans, and cataract and laparoscopy videos, demonstrate the effectiveness of DeepPyramid+ in handling various challenges such as shape and scale variation, reflection, and blur degradation. DeepPyramid+ demonstrates significant improvements in segmentation performance, achieving up to a 3.65% increase in Dice coefficient for intra-domain segmentation and up to a 17% increase in Dice coefficient for cross-domain segmentation. DeepPyramid+ consistently outperforms state-of-the-art networks across diverse modalities considering different backbone networks, showcasing its versatility.
△ Less
Submitted 6 December, 2023;
originally announced December 2023.
-
Predicting Postoperative Intraocular Lens Dislocation in Cataract Surgery via Deep Learning
Authors:
Negin Ghamsarian,
Doris Putzgruber-Adamitsch,
Stephanie Sarny,
Raphael Sznitman,
Klaus Schoeffmann,
Yosuf El-Shabrawi
Abstract:
A critical yet unpredictable complication following cataract surgery is intraocular lens dislocation. Postoperative stability is imperative, as even a tiny decentration of multifocal lenses or inadequate alignment of the torus in toric lenses due to postoperative rotation can lead to a significant drop in visual acuity. Investigating possible intraoperative indicators that can predict post-surgica…
▽ More
A critical yet unpredictable complication following cataract surgery is intraocular lens dislocation. Postoperative stability is imperative, as even a tiny decentration of multifocal lenses or inadequate alignment of the torus in toric lenses due to postoperative rotation can lead to a significant drop in visual acuity. Investigating possible intraoperative indicators that can predict post-surgical instabilities of intraocular lenses can help prevent this complication. In this paper, we develop and evaluate the first fully-automatic framework for the computation of lens unfolding delay, rotation, and instability during surgery. Adopting a combination of three types of CNNs, namely recurrent, region-based, and pixel-based, the proposed framework is employed to assess the possibility of predicting post-operative lens dislocation during cataract surgery. This is achieved via performing a large-scale study on the statistical differences between the behavior of different brands of intraocular lenses and aligning the results with expert surgeons' hypotheses and observations about the lenses. We exploit a large-scale dataset of cataract surgery videos featuring four intraocular lens brands. Experimental results confirm the reliability of the proposed framework in evaluating the lens' statistics during the surgery. The Pearson correlation and t-test results reveal significant correlations between lens unfolding delay and lens rotation and significant differences between the intra-operative rotations stability of four groups of lenses. These results suggest that the proposed framework can help surgeons select the lenses based on the patient's eye conditions and predict post-surgical lens dislocation.
△ Less
Submitted 6 December, 2023;
originally announced December 2023.
-
Event Recognition in Laparoscopic Gynecology Videos with Hybrid Transformers
Authors:
Sahar Nasirihaghighi,
Negin Ghamsarian,
Heinrich Husslein,
Klaus Schoeffmann
Abstract:
Analyzing laparoscopic surgery videos presents a complex and multifaceted challenge, with applications including surgical training, intra-operative surgical complication prediction, and post-operative surgical assessment. Identifying crucial events within these videos is a significant prerequisite in a majority of these applications. In this paper, we introduce a comprehensive dataset tailored for…
▽ More
Analyzing laparoscopic surgery videos presents a complex and multifaceted challenge, with applications including surgical training, intra-operative surgical complication prediction, and post-operative surgical assessment. Identifying crucial events within these videos is a significant prerequisite in a majority of these applications. In this paper, we introduce a comprehensive dataset tailored for relevant event recognition in laparoscopic gynecology videos. Our dataset includes annotations for critical events associated with major intra-operative challenges and post-operative complications. To validate the precision of our annotations, we assess event recognition performance using several CNN-RNN architectures. Furthermore, we introduce and evaluate a hybrid transformer architecture coupled with a customized training-inference framework to recognize four specific events in laparoscopic surgery videos. Leveraging the Transformer networks, our proposed architecture harnesses inter-frame dependencies to counteract the adverse effects of relevant content occlusion, motion blur, and surgical scene variation, thus significantly enhancing event recognition accuracy. Moreover, we present a frame sampling strategy designed to manage variations in surgical scenes and the surgeons' skill level, resulting in event recognition with high temporal resolution. We empirically demonstrate the superiority of our proposed methodology in event recognition compared to conventional CNN-RNN architectures through a series of extensive experiments.
△ Less
Submitted 1 December, 2023;
originally announced December 2023.
-
Action Recognition in Video Recordings from Gynecologic Laparoscopy
Authors:
Sahar Nasirihaghighi,
Negin Ghamsarian,
Daniela Stefanics,
Klaus Schoeffmann,
Heinrich Husslein
Abstract:
Action recognition is a prerequisite for many applications in laparoscopic video analysis including but not limited to surgical training, operation room planning, follow-up surgery preparation, post-operative surgical assessment, and surgical outcome estimation. However, automatic action recognition in laparoscopic surgeries involves numerous challenges such as (I) cross-action and intra-action du…
▽ More
Action recognition is a prerequisite for many applications in laparoscopic video analysis including but not limited to surgical training, operation room planning, follow-up surgery preparation, post-operative surgical assessment, and surgical outcome estimation. However, automatic action recognition in laparoscopic surgeries involves numerous challenges such as (I) cross-action and intra-action duration variation, (II) relevant content distortion due to smoke, blood accumulation, fast camera motions, organ movements, object occlusion, and (III) surgical scene variations due to different illuminations and viewpoints. Besides, action annotations in laparoscopy surgeries are limited and expensive due to requiring expert knowledge. In this study, we design and evaluate a CNN-RNN architecture as well as a customized training-inference framework to deal with the mentioned challenges in laparoscopic surgery action recognition. Using stacked recurrent layers, our proposed network takes advantage of inter-frame dependencies to negate the negative effect of content distortion and variation in action recognition. Furthermore, our proposed frame sampling strategy effectively manages the duration variations in surgical actions to enable action recognition with high temporal resolution. Our extensive experiments confirm the superiority of our proposed method in action recognition compared to static CNNs.
△ Less
Submitted 30 November, 2023;
originally announced November 2023.
-
Content-Adaptive Variable Framerate Encoding Scheme for Green Live Streaming
Authors:
Vignesh V Menon,
Samira Afzal,
Prajit T Rajendran,
Klaus Schoeffmann,
Radu Prodan,
Christian Timmerer
Abstract:
Adaptive live video streaming applications use a fixed predefined configuration for the bitrate ladder with constant framerate and encoding presets in a session. However, selecting optimized framerates and presets for every bitrate ladder representation can enhance perceptual quality, improve computational resource allocation, and thus, the streaming energy efficiency. In particular, low framerate…
▽ More
Adaptive live video streaming applications use a fixed predefined configuration for the bitrate ladder with constant framerate and encoding presets in a session. However, selecting optimized framerates and presets for every bitrate ladder representation can enhance perceptual quality, improve computational resource allocation, and thus, the streaming energy efficiency. In particular, low framerates for low-bitrate representations reduce compression artifacts and decrease encoding energy consumption. In addition, an optimized preset may lead to improved compression efficiency. To this light, this paper proposes a Content-adaptive Variable Framerate (CVFR) encoding scheme, which offers two modes of operation: ecological (ECO) and high-quality (HQ). CVFR-ECO optimizes for the highest encoding energy savings by predicting the optimized framerate for each representation in the bitrate ladder. CVFR-HQ takes it further by predicting each representation's optimized framerate-encoding preset pair using low-complexity discrete cosine transform energy-based spatial and temporal features for compression efficiency and sustainable storage. We demonstrate the advantage of CVFR using the x264 open-source video encoder. The results show that CVFR-ECO yields an average PSNR and VMAF increase of 0.02 dB and 2.50 points, respectively, for the same bitrate, compared to the fastest preset highest framerate encoding. CVFR-ECO also yields an average encoding and storage energy consumption reduction of 34.54% and 76.24%, considering a just noticeable difference (JND) of six VMAF points. In comparison, CVFR-HQ yields an average increase in PSNR and VMAF of 2.43 dB and 10.14 points, respectively, for the same bitrate. Finally, CVFR-HQ resulted in an average reduction in storage energy consumption of 83.18%, considering a JND of six VMAF points.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
Energy-Efficient Multi-Codec Bitrate-Ladder Estimation for Adaptive Video Streaming
Authors:
Vignesh V Menon,
Reza Farahani,
Prajit T Rajendran,
Samira Afzal,
Klaus Schoeffmann,
Christian Timmerer
Abstract:
With the emergence of multiple modern video codecs, streaming service providers are forced to encode, store, and transmit bitrate ladders of multiple codecs separately, consequently suffering from additional energy costs for encoding, storage, and transmission. To tackle this issue, we introduce an online energy-efficient Multi-Codec Bitrate ladder Estimation scheme (MCBE) for adaptive video strea…
▽ More
With the emergence of multiple modern video codecs, streaming service providers are forced to encode, store, and transmit bitrate ladders of multiple codecs separately, consequently suffering from additional energy costs for encoding, storage, and transmission. To tackle this issue, we introduce an online energy-efficient Multi-Codec Bitrate ladder Estimation scheme (MCBE) for adaptive video streaming applications. In MCBE, quality representations within the bitrate ladder of new-generation codecs (e.g., High Efficiency Video Coding (HEVC), Alliance for Open Media Video 1 (AV1)) that lie below the predicted rate-distortion curve of the Advanced Video Coding (AVC) codec are removed. Moreover, perceptual redundancy between representations of the bitrate ladders of the considered codecs is also minimized based on a Just Noticeable Difference (JND) threshold. Therefore, random forest-based models predict the VMAF score of bitrate ladder representations of each codec. In a live streaming session where all clients support the decoding of AVC, HEVC, and AV1, MCBE achieves impressive results, reducing cumulative encoding energy by 56.45%, storage energy usage by 94.99%, and transmission energy usage by 77.61% (considering a JND of six VMAF points). These energy reductions are in comparison to a baseline bitrate ladder encoding based on current industry practice.
△ Less
Submitted 14 October, 2023;
originally announced October 2023.
-
Domain Adaptation for Medical Image Segmentation using Transformation-Invariant Self-Training
Authors:
Negin Ghamsarian,
Javier Gamazo Tejero,
Pablo Márquez Neila,
Sebastian Wolf,
Martin Zinkernagel,
Klaus Schoeffmann,
Raphael Sznitman
Abstract:
Models capable of leveraging unlabelled data are crucial in overcoming large distribution gaps between the acquired datasets across different imaging devices and configurations. In this regard, self-training techniques based on pseudo-labeling have been shown to be highly effective for semi-supervised domain adaptation. However, the unreliability of pseudo labels can hinder the capability of self-…
▽ More
Models capable of leveraging unlabelled data are crucial in overcoming large distribution gaps between the acquired datasets across different imaging devices and configurations. In this regard, self-training techniques based on pseudo-labeling have been shown to be highly effective for semi-supervised domain adaptation. However, the unreliability of pseudo labels can hinder the capability of self-training techniques to induce abstract representation from the unlabeled target dataset, especially in the case of large distribution gaps. Since the neural network performance should be invariant to image transformations, we look to this fact to identify uncertain pseudo labels. Indeed, we argue that transformation invariant detections can provide more reasonable approximations of ground truth. Accordingly, we propose a semi-supervised learning strategy for domain adaptation termed transformation-invariant self-training (TI-ST). The proposed method assesses pixel-wise pseudo-labels' reliability and filters out unreliable detections during self-training. We perform comprehensive evaluations for domain adaptation using three different modalities of medical images, two different network architectures, and several alternative state-of-the-art domain adaptation methods. Experimental results confirm the superiority of our proposed method in mitigating the lack of target domain annotation and boosting segmentation performance in the target domain.
△ Less
Submitted 31 July, 2023;
originally announced July 2023.
-
Relevance-Based Compression of Cataract Surgery Videos
Authors:
Natalia Mathá,
Klaus Schoeffmann,
Konstantin Schekotihin,
Stephanie Sarny,
Doris Putzgruber-Adamitsch,
Yosuf El-Shabrawi
Abstract:
In the last decade, the need for storing videos from cataract surgery has increased significantly. Hospitals continue to improve their imaging and recording devices (e.g., microscopes and cameras used in microscopic surgery, such as ophthalmology) to enhance their post-surgical processing efficiency. The video recordings enable a lot of user-cases after the actual surgery, for example, teaching, d…
▽ More
In the last decade, the need for storing videos from cataract surgery has increased significantly. Hospitals continue to improve their imaging and recording devices (e.g., microscopes and cameras used in microscopic surgery, such as ophthalmology) to enhance their post-surgical processing efficiency. The video recordings enable a lot of user-cases after the actual surgery, for example, teaching, documentation, and forensics. However, videos recorded from operations are typically stored in the internal archive without any domain-specific compression, leading to a massive storage space consumption. In this work, we propose a relevance-based compression scheme for videos from cataract surgery, which is based on content specifics of particular cataract surgery phases. We evaluate our compression scheme with three state-of-the-art video codecs, namely H.264/AVC, H.265/HEVC, and AV1, and ask medical experts to evaluate the visual quality of encoded videos. Our results show significant savings, in particular up to 95.94% when using H.264/AVC, up to 98.71% when using H.265/HEVC, and up to 98.82% when using AV1.
△ Less
Submitted 22 June, 2023;
originally announced June 2023.
-
Green Video Complexity Analysis for Efficient Encoding in Adaptive Video Streaming
Authors:
Vignesh V Menon,
Christian Feldmann,
Klaus Schoeffmann,
Mohammad Ghanbari,
Christian Timmerer
Abstract:
For adaptive streaming applications, low-complexity and accurate video complexity features are necessary to analyze the video content in real time, which ensures fast and compression-efficient video streaming without disruptions. State-of-the-art video complexity features are Spatial Information (SI) and Temporal Information (TI) features which do not correlate well with the encoding parameters in…
▽ More
For adaptive streaming applications, low-complexity and accurate video complexity features are necessary to analyze the video content in real time, which ensures fast and compression-efficient video streaming without disruptions. State-of-the-art video complexity features are Spatial Information (SI) and Temporal Information (TI) features which do not correlate well with the encoding parameters in adaptive streaming applications. To this light, Video Complexity Analyzer (VCA) was introduced, determining the features based on Discrete Cosine Transform (DCT)-energy. This paper presents optimizations on VCA for faster and energy-efficient video complexity analysis. Experimental results show that VCA v2.0, using eight CPU threads, Single Instruction Multiple Data (SIMD), and low-pass DCT optimization, determines seven complexity features of Ultra High Definition 8-bit videos with better accuracy at a speed of up to 292.68 fps and an energy consumption of 97.06% lower than the reference SITI implementation.
△ Less
Submitted 24 April, 2023;
originally announced April 2023.
-
Video Quality Assessment with Texture Information Fusion for Streaming Applications
Authors:
Vignesh V Menon,
Prajit T Rajendran,
Reza Farahani,
Klaus Schoeffmann,
Christian Timmerer
Abstract:
The rise in video streaming applications has increased the demand for video quality assessment (VQA). In 2016, Netflix introduced Video Multi-Method Assessment Fusion (VMAF), a full reference VQA metric that strongly correlates with perceptual quality, but its computation is time-intensive. We propose a Discrete Cosine Transform (DCT)-energy-based VQA with texture information fusion (VQ-TIF) model…
▽ More
The rise in video streaming applications has increased the demand for video quality assessment (VQA). In 2016, Netflix introduced Video Multi-Method Assessment Fusion (VMAF), a full reference VQA metric that strongly correlates with perceptual quality, but its computation is time-intensive. We propose a Discrete Cosine Transform (DCT)-energy-based VQA with texture information fusion (VQ-TIF) model for video streaming applications that determines the visual quality of the reconstructed video compared to the original video. VQ-TIF extracts Structural Similarity (SSIM) and spatiotemporal features of the frames from the original and reconstructed videos and fuses them using a long short-term memory (LSTM)-based model to estimate the visual quality. Experimental results show that VQ-TIF estimates the visual quality with a Pearson Correlation Coefficient (PCC) of 0.96 and a Mean Absolute Error (MAE) of 2.71, on average, compared to the ground truth VMAF scores. Additionally, VQ-TIF estimates the visual quality at a rate of 9.14 times faster than the state-of-the-art VMAF implementation, along with an 89.44 % reduction in energy consumption, assuming an Ultra HD (2160p) display resolution.
△ Less
Submitted 24 January, 2024; v1 submitted 28 February, 2023;
originally announced February 2023.
-
DeepPyramid: Enabling Pyramid View and Deformable Pyramid Reception for Semantic Segmentation in Cataract Surgery Videos
Authors:
Negin Ghamsarian,
Mario Taschwer,
Raphael Sznitman,
Klaus Schoeffmann
Abstract:
Semantic segmentation in cataract surgery has a wide range of applications contributing to surgical outcome enhancement and clinical risk reduction. However, the varying issues in segmenting the different relevant structures in these surgeries make the designation of a unique network quite challenging. This paper proposes a semantic segmentation network, termed DeepPyramid, that can deal with thes…
▽ More
Semantic segmentation in cataract surgery has a wide range of applications contributing to surgical outcome enhancement and clinical risk reduction. However, the varying issues in segmenting the different relevant structures in these surgeries make the designation of a unique network quite challenging. This paper proposes a semantic segmentation network, termed DeepPyramid, that can deal with these challenges using three novelties: (1) a Pyramid View Fusion module which provides a varying-angle global view of the surrounding region centering at each pixel position in the input convolutional feature map; (2) a Deformable Pyramid Reception module which enables a wide deformable receptive field that can adapt to geometric transformations in the object of interest; and (3) a dedicated Pyramid Loss that adaptively supervises multi-scale semantic feature maps. Combined, we show that these modules can effectively boost semantic segmentation performance, especially in the case of transparency, deformability, scalability, and blunt edges in objects. We demonstrate that our approach performs at a state-of-the-art level and outperforms a number of existing methods with a large margin (3.66% overall improvement in intersection over union compared to the best rival approach).
△ Less
Submitted 4 July, 2022;
originally announced July 2022.
-
ReCal-Net: Joint Region-Channel-Wise Calibrated Network for Semantic Segmentation in Cataract Surgery Videos
Authors:
Negin Ghamsarian,
Mario Taschwer,
Doris Putzgruber-Adamitsch,
Stephanie Sarny,
Yosuf El-Shabrawi,
Klaus Schoeffmann
Abstract:
Semantic segmentation in surgical videos is a prerequisite for a broad range of applications towards improving surgical outcomes and surgical video analysis. However, semantic segmentation in surgical videos involves many challenges. In particular, in cataract surgery, various features of the relevant objects such as blunt edges, color and context variation, reflection, transparency, and motion bl…
▽ More
Semantic segmentation in surgical videos is a prerequisite for a broad range of applications towards improving surgical outcomes and surgical video analysis. However, semantic segmentation in surgical videos involves many challenges. In particular, in cataract surgery, various features of the relevant objects such as blunt edges, color and context variation, reflection, transparency, and motion blur pose a challenge for semantic segmentation. In this paper, we propose a novel convolutional module termed as \textit{ReCal} module, which can calibrate the feature maps by employing region intra-and-inter-dependencies and channel-region cross-dependencies. This calibration strategy can effectively enhance semantic representation by correlating different representations of the same semantic label, considering a multi-angle local view centering around each pixel. Thus the proposed module can deal with distant visual characteristics of unique objects as well as cross-similarities in the visual characteristics of different objects. Moreover, we propose a novel network architecture based on the proposed module termed as ReCal-Net. Experimental results confirm the superiority of ReCal-Net compared to rival state-of-the-art approaches for all relevant objects in cataract surgery. Moreover, ablation studies reveal the effectiveness of the ReCal module in boosting semantic segmentation accuracy.
△ Less
Submitted 25 September, 2021;
originally announced September 2021.
-
DeepPyram: Enabling Pyramid View and Deformable Pyramid Reception for Semantic Segmentation in Cataract Surgery Videos
Authors:
Negin Ghamsarian,
Mario Taschwer,
klaus Schoeffmann
Abstract:
Semantic segmentation in cataract surgery has a wide range of applications contributing to surgical outcome enhancement and clinical risk reduction. However, the varying issues in segmenting the different relevant instances make the designation of a unique network quite challenging. This paper proposes a semantic segmentation network termed as DeepPyram that can achieve superior performance in seg…
▽ More
Semantic segmentation in cataract surgery has a wide range of applications contributing to surgical outcome enhancement and clinical risk reduction. However, the varying issues in segmenting the different relevant instances make the designation of a unique network quite challenging. This paper proposes a semantic segmentation network termed as DeepPyram that can achieve superior performance in segmenting relevant objects in cataract surgery videos with varying issues. This superiority mainly originates from three modules: (i) Pyramid View Fusion, which provides a varying-angle global view of the surrounding region centering at each pixel position in the input convolutional feature map; (ii) Deformable Pyramid Reception, which enables a wide deformable receptive field that can adapt to geometric transformations in the object of interest; and (iii) Pyramid Loss that adaptively supervises multi-scale semantic feature maps. These modules can effectively boost semantic segmentation performance, especially in the case of transparency, deformability, scalability, and blunt edges in objects. The proposed approach is evaluated using four datasets of cataract surgery for objects with different contextual features and compared with thirteen state-of-the-art segmentation networks. The experimental results confirm that DeepPyram outperforms the rival approaches without imposing additional trainable parameters. Our comprehensive ablation study further proves the effectiveness of the proposed modules.
△ Less
Submitted 11 September, 2021;
originally announced September 2021.
-
LensID: A CNN-RNN-Based Framework Towards Lens Irregularity Detection in Cataract Surgery Videos
Authors:
Negin Ghamsarian,
Mario Taschwer,
Doris Putzgruber-Adamitsch,
Stephanie Sarny,
Yosuf El-Shabrawi,
Klaus Schoeffmann
Abstract:
A critical complication after cataract surgery is the dislocation of the lens implant leading to vision deterioration and eye trauma. In order to reduce the risk of this complication, it is vital to discover the risk factors during the surgery. However, studying the relationship between lens dislocation and its suspicious risk factors using numerous videos is a time-extensive procedure. Hence, the…
▽ More
A critical complication after cataract surgery is the dislocation of the lens implant leading to vision deterioration and eye trauma. In order to reduce the risk of this complication, it is vital to discover the risk factors during the surgery. However, studying the relationship between lens dislocation and its suspicious risk factors using numerous videos is a time-extensive procedure. Hence, the surgeons demand an automatic approach to enable a larger-scale and, accordingly, more reliable study. In this paper, we propose a novel framework as the major step towards lens irregularity detection. In particular, we propose (I) an end-to-end recurrent neural network to recognize the lens-implantation phase and (II) a novel semantic segmentation network to segment the lens and pupil after the implantation phase. The phase recognition results reveal the effectiveness of the proposed surgical phase recognition approach. Moreover, the segmentation results confirm the proposed segmentation network's effectiveness compared to state-of-the-art rival approaches.
△ Less
Submitted 2 July, 2021;
originally announced July 2021.
-
Insights on the V3C2 Dataset
Authors:
Luca Rossetto,
Klaus Schoeffmann,
Abraham Bernstein
Abstract:
For research results to be comparable, it is important to have common datasets for experimentation and evaluation. The size of such datasets, however, can be an obstacle to their use. The Vimeo Creative Commons Collection (V3C) is a video dataset designed to be representative of video content found on the web, containing roughly 3800 hours of video in total, split into three shards. In this paper,…
▽ More
For research results to be comparable, it is important to have common datasets for experimentation and evaluation. The size of such datasets, however, can be an obstacle to their use. The Vimeo Creative Commons Collection (V3C) is a video dataset designed to be representative of video content found on the web, containing roughly 3800 hours of video in total, split into three shards. In this paper, we present insights on the second of these shards (V3C2) and discuss their implications for research areas, such as video retrieval, for which the dataset might be particularly useful. We also provide all the extracted data in order to simplify the use of the dataset.
△ Less
Submitted 4 May, 2021;
originally announced May 2021.
-
Relevance Detection in Cataract Surgery Videos by Spatio-Temporal Action Localization
Authors:
Negin Ghamsarian,
Mario Taschwer,
Doris Putzgruber-Adamitsch,
Stephanie Sarny,
Klaus Schoeffmann
Abstract:
In cataract surgery, the operation is performed with the help of a microscope. Since the microscope enables watching real-time surgery by up to two people only, a major part of surgical training is conducted using the recorded videos. To optimize the training procedure with the video content, the surgeons require an automatic relevance detection approach. In addition to relevance-based retrieval,…
▽ More
In cataract surgery, the operation is performed with the help of a microscope. Since the microscope enables watching real-time surgery by up to two people only, a major part of surgical training is conducted using the recorded videos. To optimize the training procedure with the video content, the surgeons require an automatic relevance detection approach. In addition to relevance-based retrieval, these results can be further used for skill assessment and irregularity detection in cataract surgery videos. In this paper, a three-module framework is proposed to detect and classify the relevant phase segments in cataract videos. Taking advantage of an idle frame recognition network, the video is divided into idle and action segments. To boost the performance in relevance detection, the cornea where the relevant surgical actions are conducted is detected in all frames using Mask R-CNN. The spatiotemporally localized segments containing higher-resolution information about the pupil texture and actions, and complementary temporal information from the same phase are fed into the relevance detection module. This module consists of four parallel recurrent CNNs being responsible to detect four relevant phases that have been defined with medical experts. The results will then be integrated to classify the action phases as irrelevant or one of four relevant phases. Experimental results reveal that the proposed approach outperforms static CNNs and different configurations of feature-based and end-to-end recurrent networks.
△ Less
Submitted 29 April, 2021;
originally announced April 2021.
-
Robust Medical Instrument Segmentation Challenge 2019
Authors:
Tobias Ross,
Annika Reinke,
Peter M. Full,
Martin Wagner,
Hannes Kenngott,
Martin Apitz,
Hellena Hempe,
Diana Mindroc Filimon,
Patrick Scholz,
Thuy Nuong Tran,
Pierangela Bruno,
Pablo Arbeláez,
Gui-Bin Bian,
Sebastian Bodenstedt,
Jon Lindström Bolmgren,
Laura Bravo-Sánchez,
Hua-Bin Chen,
Cristina González,
Dong Guo,
Pål Halvorsen,
Pheng-Ann Heng,
Enes Hosgor,
Zeng-Guang Hou,
Fabian Isensee,
Debesh Jha
, et al. (25 additional authors not shown)
Abstract:
Intraoperative tracking of laparoscopic instruments is often a prerequisite for computer and robotic-assisted interventions. While numerous methods for detecting, segmenting and tracking of medical instruments based on endoscopic video images have been proposed in the literature, key limitations remain to be addressed: Firstly, robustness, that is, the reliable performance of state-of-the-art meth…
▽ More
Intraoperative tracking of laparoscopic instruments is often a prerequisite for computer and robotic-assisted interventions. While numerous methods for detecting, segmenting and tracking of medical instruments based on endoscopic video images have been proposed in the literature, key limitations remain to be addressed: Firstly, robustness, that is, the reliable performance of state-of-the-art methods when run on challenging images (e.g. in the presence of blood, smoke or motion artifacts). Secondly, generalization; algorithms trained for a specific intervention in a specific hospital should generalize to other interventions or institutions.
In an effort to promote solutions for these limitations, we organized the Robust Medical Instrument Segmentation (ROBUST-MIS) challenge as an international benchmarking competition with a specific focus on the robustness and generalization capabilities of algorithms. For the first time in the field of endoscopic image processing, our challenge included a task on binary segmentation and also addressed multi-instance detection and segmentation. The challenge was based on a surgical data set comprising 10,040 annotated images acquired from a total of 30 surgical procedures from three different types of surgery. The validation of the competing methods for the three tasks (binary segmentation, multi-instance detection and multi-instance segmentation) was performed in three different stages with an increasing domain gap between the training and the test data. The results confirm the initial hypothesis, namely that algorithm performance degrades with an increasing domain gap. While the average detection and segmentation quality of the best-performing algorithms is high, future research should concentrate on detection and segmentation of small, crossing, moving and transparent instrument(s) (parts).
△ Less
Submitted 19 May, 2020; v1 submitted 23 March, 2020;
originally announced March 2020.
-
The diveXplore System at the Video Browser Showdown 2018 - Final Notes
Authors:
Klaus Schoeffmann,
Bernd Münzer,
Jürgen Primus,
Andreas Leibetseder
Abstract:
This short paper provides further details of the diveXplore system (formerly known as CoViSS), which has been used by team ITEC1 for the Video Browser Showdown (VBS) 2018. In particular, it gives a short overview of search features and some details of final system changes, not included in the corresponding VBS2018 paper, as well as a basic analysis of how the system has been used for VBS2018 (from…
▽ More
This short paper provides further details of the diveXplore system (formerly known as CoViSS), which has been used by team ITEC1 for the Video Browser Showdown (VBS) 2018. In particular, it gives a short overview of search features and some details of final system changes, not included in the corresponding VBS2018 paper, as well as a basic analysis of how the system has been used for VBS2018 (from a user perspective).
△ Less
Submitted 5 April, 2018;
originally announced April 2018.