Fine-Grained Classroom Activity Detection from Audio with Neural Networks
Authors:
Eric Slyman,
Chris Daw,
Morgan Skrabut,
Ana Usenko,
Brian Hutchinson
Abstract:
Instructors are increasingly incorporating student-centered learning techniques in their classrooms to improve learning outcomes. In addition to lecture, these class sessions involve forms of individual and group work, and greater rates of student-instructor interaction. Quantifying classroom activity is a key element of accelerating the evaluation and refinement of innovative teaching practices,…
▽ More
Instructors are increasingly incorporating student-centered learning techniques in their classrooms to improve learning outcomes. In addition to lecture, these class sessions involve forms of individual and group work, and greater rates of student-instructor interaction. Quantifying classroom activity is a key element of accelerating the evaluation and refinement of innovative teaching practices, but manual annotation does not scale. In this manuscript, we present advances to the young application area of automatic classroom activity detection from audio. Using a university classroom corpus with nine activity labels (e.g., "lecture," "group work," "student question"), we propose and evaluate deep fully connected, convolutional, and recurrent neural network architectures, comparing the performance of mel-filterbank, OpenSmile, and self-supervised acoustic features. We compare 9-way classification performance with 5-way and 4-way simplifications of the task and assess two types of generalization: (1) new class sessions from previously seen instructors, and (2) previously unseen instructors. We obtain strong results on the new fine-grained task and state-of-the-art on the 4-way task: our best model obtains frame-level error rates of 6.2%, 7.7% and 28.0% when generalizing to unseen instructors for the 4-way, 5-way, and 9-way classification tasks, respectively (relative reductions of 35.4%, 48.3% and 21.6% over a strong baseline). When estimating the aggregate time spent on classroom activities, our average root mean squared error is 1.64 minutes per class session, a 54.9% relative reduction over the baseline.
△ Less
Submitted 9 November, 2021; v1 submitted 29 July, 2021;
originally announced July 2021.
Proposal-based Few-shot Sound Event Detection for Speech and Environmental Sounds with Perceivers
Authors:
Piper Wolters,
Logan Sizemore,
Chris Daw,
Brian Hutchinson,
Lauren Phillips
Abstract:
Many applications involve detecting and localizing specific sound events within long, untrimmed documents, including keyword spotting, medical observation, and bioacoustic monitoring for conservation. Deep learning techniques often set the state-of-the-art for these tasks. However, for some types of events, there is insufficient labeled data to train such models. In this paper, we propose a region…
▽ More
Many applications involve detecting and localizing specific sound events within long, untrimmed documents, including keyword spotting, medical observation, and bioacoustic monitoring for conservation. Deep learning techniques often set the state-of-the-art for these tasks. However, for some types of events, there is insufficient labeled data to train such models. In this paper, we propose a region proposal-based approach to few-shot sound event detection utilizing the Perceiver architecture. Motivated by a lack of suitable benchmark datasets, we generate two new few-shot sound event localization datasets: "Vox-CASE," using clips of celebrity speech as the sound event, and "ESC-CASE," using environmental sound events. Our highest performing proposed few-shot approaches achieve 0.483 and 0.418 F1-score, respectively, with 5-shot 5-way tasks on these two datasets. These represent relative improvements of 72.5% and 11.2% over strong proposal-free few-shot sound event detection baselines.
△ Less
Submitted 23 December, 2023; v1 submitted 28 July, 2021;
originally announced July 2021.