-
UWAV: Uncertainty-weighted Weakly-supervised Audio-Visual Video Parsing
Authors:
Yung-Hsuan Lai,
Janek Ebbers,
Yu-Chiang Frank Wang,
François Germain,
Michael Jeffrey Jones,
Moitreya Chatterjee
Abstract:
Audio-Visual Video Parsing (AVVP) entails the challenging task of localizing both uni-modal events (i.e., those occurring exclusively in either the visual or acoustic modality of a video) and multi-modal events (i.e., those occurring in both modalities concurrently). Moreover, the prohibitive cost of annotating training data with the class labels of all these events, along with their start and end…
▽ More
Audio-Visual Video Parsing (AVVP) entails the challenging task of localizing both uni-modal events (i.e., those occurring exclusively in either the visual or acoustic modality of a video) and multi-modal events (i.e., those occurring in both modalities concurrently). Moreover, the prohibitive cost of annotating training data with the class labels of all these events, along with their start and end times, imposes constraints on the scalability of AVVP techniques unless they can be trained in a weakly-supervised setting, where only modality-agnostic, video-level labels are available in the training data. To this end, recently proposed approaches seek to generate segment-level pseudo-labels to better guide model training. However, the absence of inter-segment dependencies when generating these pseudo-labels and the general bias towards predicting labels that are absent in a segment limit their performance. This work proposes a novel approach towards overcoming these weaknesses called Uncertainty-weighted Weakly-supervised Audio-visual Video Parsing (UWAV). Additionally, our innovative approach factors in the uncertainty associated with these estimated pseudo-labels and incorporates a feature mixup based training regularization for improved training. Empirical results show that UWAV outperforms state-of-the-art methods for the AVVP task on multiple metrics, across two different datasets, attesting to its effectiveness and generalizability.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Using Temperature Sensitivity to Estimate Shiftable Electricity Demand: Implications for power system investments and climate change
Authors:
Michael J. Roberts,
Sisi Zhang,
Eleanor Yuan,
James Jones,
Matthias Fripp
Abstract:
Growth of intermittent renewable energy and climate change make it increasingly difficult to manage electricity demand variability. Centralized storage can help but is costly. An alternative is to shift demand. Cooling and heating demands are substantial and can be economically shifted using thermal storage. To estimate what thermal storage, employed at scale, might do to reshape electricity loads…
▽ More
Growth of intermittent renewable energy and climate change make it increasingly difficult to manage electricity demand variability. Centralized storage can help but is costly. An alternative is to shift demand. Cooling and heating demands are substantial and can be economically shifted using thermal storage. To estimate what thermal storage, employed at scale, might do to reshape electricity loads, we pair fine-scale weather data with hourly electricity use to estimate the share of temperature-sensitive demand across 31 regions that span the continental United States. We then show how much variability can be reduced by shifting temperature-sensitive loads, with and without improved transmission between regions. We find that approximately three quarters of within-day, within-region demand variability can be eliminated by shifting just half of temperature-sensitive demand. The variability-reducing benefits of shifting temperature-sensitive demand complement those gained from improved interregional transmission, and greatly mitigate the challenge of serving higher peaks under climate change.
△ Less
Submitted 13 June, 2022; v1 submitted 1 September, 2021;
originally announced September 2021.
-
Classification of Epithelial Ovarian Carcinoma Whole-Slide Pathology Images Using Deep Transfer Learning
Authors:
Yiping Wang,
David Farnell,
Hossein Farahani,
Mitchell Nursey,
Basile Tessier-Cloutier,
Steven J. M. Jones,
David G. Huntsman,
C. Blake Gilks,
Ali Bashashati
Abstract:
Ovarian cancer is the most lethal cancer of the female reproductive organs. There are $5$ major histological subtypes of epithelial ovarian cancer, each with distinct morphological, genetic, and clinical features. Currently, these histotypes are determined by a pathologist's microscopic examination of tumor whole-slide images (WSI). This process has been hampered by poor inter-observer agreement (…
▽ More
Ovarian cancer is the most lethal cancer of the female reproductive organs. There are $5$ major histological subtypes of epithelial ovarian cancer, each with distinct morphological, genetic, and clinical features. Currently, these histotypes are determined by a pathologist's microscopic examination of tumor whole-slide images (WSI). This process has been hampered by poor inter-observer agreement (Cohen's kappa $0.54$-$0.67$). We utilized a \textit{two}-stage deep transfer learning algorithm based on convolutional neural networks (CNN) and progressive resizing for automatic classification of epithelial ovarian carcinoma WSIs. The proposed algorithm achieved a mean accuracy of $87.54\%$ and Cohen's kappa of $0.8106$ in the slide-level classification of $305$ WSIs; performing better than a standard CNN and pathologists without gynecology-specific training.
△ Less
Submitted 28 June, 2020; v1 submitted 21 May, 2020;
originally announced May 2020.
-
Grounding Object Detections With Transcriptions
Authors:
Yasufumi Moriya,
Ramon Sanabria,
Florian Metze,
Gareth J. F. Jones
Abstract:
A vast amount of audio-visual data is available on the Internet thanks to video streaming services, to which users upload their content. However, there are difficulties in exploiting available data for supervised statistical models due to the lack of labels. Unfortunately, generating labels for such amount of data through human annotation can be expensive, time-consuming and prone to annotation er…
▽ More
A vast amount of audio-visual data is available on the Internet thanks to video streaming services, to which users upload their content. However, there are difficulties in exploiting available data for supervised statistical models due to the lack of labels. Unfortunately, generating labels for such amount of data through human annotation can be expensive, time-consuming and prone to annotation errors. In this paper, we propose a method to automatically extract entity-video frame pairs from a collection of instruction videos by using speech transcriptions and videos. We conduct experiments on image recognition and visual grounding tasks on the automatically constructed entity-video frame dataset of How2. The models will be evaluated on new manually annotated portion of How2 dev5 and val set and on the Flickr30k dataset. This work constitutes a first step towards meta-algorithms capable of automatically construct task-specific training sets.
△ Less
Submitted 28 July, 2019; v1 submitted 12 June, 2019;
originally announced June 2019.