Skip to main content

Showing 1–50 of 74 results for author: Petridis, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.18972  [pdf, ps, other

    eess.AS cs.AI

    Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis

    Authors: Minsu Kim, Pingchuan Ma, Honglie Chen, Stavros Petridis, Maja Pantic

    Abstract: This paper explores multi-modal controllable Text-to-Speech Synthesis (TTS) where the voice can be generated from face image, and the characteristics of output speech (e.g., pace, noise level, distance, tone, place) can be controllable with natural text description. Specifically, we aim to mitigate the following three challenges in face-driven TTS systems. 1) To overcome the limited audio quality… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

    Comments: Interspeech 2025

  2. arXiv:2505.15313  [pdf, ps, other

    cs.CV

    FaceCrafter: Identity-Conditional Diffusion with Disentangled Control over Facial Pose, Expression, and Emotion

    Authors: Kazuaki Mishima, Antoni Bigata Casademunt, Stavros Petridis, Maja Pantic, Kenji Suzuki

    Abstract: Human facial images encode a rich spectrum of information, encompassing both stable identity-related traits and mutable attributes such as pose, expression, and emotion. While recent advances in image generation have enabled high-quality identity-conditional face synthesis, precise control over non-identity attributes remains challenging, and disentangling identity from these mutable factors is pa… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: 9 pages(excluding references), 3 figures, 5 tables

  3. arXiv:2505.14336  [pdf, ps, other

    eess.AS cs.CV cs.MM cs.SD

    Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors Approach

    Authors: Umberto Cappellazzo, Minsu Kim, Stavros Petridis, Daniele Falavigna, Alessio Brutti

    Abstract: Audio-Visual Speech Recognition (AVSR) enhances robustness in noisy environments by integrating visual cues. While recent advances integrate Large Language Models (LLMs) into AVSR, their high computational cost hinders deployment in resource-constrained settings. To address this, we propose Llama-SMoP, an efficient Multimodal LLM that employs a Sparse Mixture of Projectors (SMoP) module to scale m… ▽ More

    Submitted 21 May, 2025; v1 submitted 20 May, 2025; originally announced May 2025.

    Comments: Interspeech 2025

  4. arXiv:2505.00497  [pdf, other

    cs.CV

    KeySync: A Robust Approach for Leakage-free Lip Synchronization in High Resolution

    Authors: Antoni Bigata, Rodrigo Mira, Stella Bounareli, Michał Stypułkowski, Konstantinos Vougioukas, Stavros Petridis, Maja Pantic

    Abstract: Lip synchronization, known as the task of aligning lip movements in an existing video with new input audio, is typically framed as a simpler variant of audio-driven facial animation. However, as well as suffering from the usual issues in talking head generation (e.g., temporal consistency), lip synchronization presents significant new challenges such as expression leakage from the input video and… ▽ More

    Submitted 1 May, 2025; originally announced May 2025.

  5. arXiv:2503.08798  [pdf, other

    cs.SD cs.LG eess.AS

    Contextual Speech Extraction: Leveraging Textual History as an Implicit Cue for Target Speech Extraction

    Authors: Minsu Kim, Rodrigo Mira, Honglie Chen, Stavros Petridis, Maja Pantic

    Abstract: In this paper, we investigate a novel approach for Target Speech Extraction (TSE), which relies solely on textual context to extract the target speech. We refer to this task as Contextual Speech Extraction (CSE). Unlike traditional TSE methods that rely on pre-recorded enrollment utterances, video of the target speaker's face, spatial information, or other explicit cues to identify the target stre… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

    Comments: Accepted to ICASSP 2025

  6. arXiv:2503.06362  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs

    Authors: Umberto Cappellazzo, Minsu Kim, Stavros Petridis

    Abstract: Audio-Visual Speech Recognition (AVSR) leverages both audio and visual modalities to enhance speech recognition robustness, particularly in noisy environments. Recent advancements in Large Language Models (LLMs) have demonstrated their effectiveness in speech recognition, including AVSR. However, due to the significant length of speech representations, direct integration with LLMs imposes substant… ▽ More

    Submitted 8 March, 2025; originally announced March 2025.

  7. arXiv:2503.06273  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations

    Authors: Jeong Hun Yeo, Minsu Kim, Chae Won Kim, Stavros Petridis, Yong Man Ro

    Abstract: We explore a novel zero-shot Audio-Visual Speech Recognition (AVSR) framework, dubbed Zero-AVSR, which enables speech recognition in target languages without requiring any audio-visual speech data in those languages. Specifically, we introduce the Audio-Visual Speech Romanizer (AV-Romanizer), which learns language-agnostic speech representations by predicting Roman text. Then, by leveraging the st… ▽ More

    Submitted 8 March, 2025; originally announced March 2025.

  8. arXiv:2503.01715  [pdf, other

    cs.CV cs.AI

    KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation

    Authors: Antoni Bigata, Michał Stypułkowski, Rodrigo Mira, Stella Bounareli, Konstantinos Vougioukas, Zoe Landgraf, Nikita Drobyshev, Maciej Zieba, Stavros Petridis, Maja Pantic

    Abstract: Current audio-driven facial animation methods achieve impressive results for short videos but suffer from error accumulation and identity drift when extended to longer durations. Existing methods attempt to mitigate this through external spatial control, increasing long-term consistency but compromising the naturalness of motion. We propose KeyFace, a novel two-stage diffusion-based framework, to… ▽ More

    Submitted 19 March, 2025; v1 submitted 3 March, 2025; originally announced March 2025.

    Comments: CVPR 2025

  9. Gensors: Authoring Personalized Visual Sensors with Multimodal Foundation Models and Reasoning

    Authors: Michael Xieyang Liu, Savvas Petridis, Vivian Tsai, Alexander J. Fiannaca, Alex Olwal, Michael Terry, Carrie J. Cai

    Abstract: Multimodal large language models (MLLMs), with their expansive world knowledge and reasoning capabilities, present a unique opportunity for end-users to create personalized AI sensors capable of reasoning about complex situations. A user could describe a desired sensing task in natural language (e.g., "alert if my toddler is getting into mischief"), with the MLLM analyzing the camera feed and resp… ▽ More

    Submitted 26 January, 2025; originally announced January 2025.

    Journal ref: 30th International Conference on Intelligent User Interfaces (IUI'25), March 24-27, 2025, Cagliari, Italy. ACM, New York, NY, USA, 16 pages

  10. arXiv:2411.02256  [pdf, other

    cs.CV

    Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs

    Authors: Alexandros Haliassos, Rodrigo Mira, Honglie Chen, Zoe Landgraf, Stavros Petridis, Maja Pantic

    Abstract: Research in auditory, visual, and audiovisual speech recognition (ASR, VSR, and AVSR, respectively) has traditionally been conducted independently. Even recent self-supervised studies addressing two or all three tasks simultaneously tend to yield separate models, leading to disjoint inference pipelines with increased memory requirements and redundancies. This paper proposes unified training strate… ▽ More

    Submitted 4 November, 2024; originally announced November 2024.

    Comments: NeurIPS 2024. Code: https://github.com/ahaliassos/usr

  11. arXiv:2410.07771  [pdf, other

    cs.SD cs.AI cs.CL cs.CV eess.AS

    Full-Rank No More: Low-Rank Weight Training for Modern Speech Recognition Models

    Authors: Adriana Fernandez-Lopez, Shiwei Liu, Lu Yin, Stavros Petridis, Maja Pantic

    Abstract: This paper investigates the under-explored area of low-rank weight training for large-scale Conformer-based speech recognition models from scratch. Our study demonstrates the viability of this training paradigm for such models, yielding several notable findings. Firstly, we discover that applying a low-rank structure exclusively to the attention modules can unexpectedly enhance performance, even w… ▽ More

    Submitted 10 October, 2024; originally announced October 2024.

    Comments: Submitted to ICASSP 2025

  12. arXiv:2409.12319  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    Large Language Models are Strong Audio-Visual Speech Recognition Learners

    Authors: Umberto Cappellazzo, Minsu Kim, Honglie Chen, Pingchuan Ma, Stavros Petridis, Daniele Falavigna, Alessio Brutti, Maja Pantic

    Abstract: Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities. For example, in the audio and speech domains, an LLM can be equipped with (automatic) speech recognition (ASR) abilities by just concatenating the audio tokens, computed with an audio encoder, and the text tokens to achieve state-of-the-art results.… ▽ More

    Submitted 7 March, 2025; v1 submitted 18 September, 2024; originally announced September 2024.

    Comments: Accepted for publication at ICASSP 2025. The code and checkpoints are available here: https://github.com/umbertocappellazzo/Llama-AVSR

  13. arXiv:2407.07825  [pdf, other

    cs.SD cs.CV cs.MM eess.AS

    RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement

    Authors: Honglie Chen, Rodrigo Mira, Stavros Petridis, Maja Pantic

    Abstract: In this paper, we aim to generate clean speech frame by frame from a live video stream and a noisy audio stream without relying on future inputs. To this end, we propose RT-LA-VocE, which completely re-designs every component of LA-VocE, a state-of-the-art non-causal audio-visual speech enhancement model, to perform causal real-time inference with a 40ms input frame. We do so by devising new visua… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

    Comments: Interspeech 2024

  14. arXiv:2406.18373  [pdf, other

    cs.CL cs.SD eess.AS

    Dynamic Data Pruning for Automatic Speech Recognition

    Authors: Qiao Xiao, Pingchuan Ma, Adriana Fernandez-Lopez, Boqian Wu, Lu Yin, Stavros Petridis, Mykola Pechenizkiy, Maja Pantic, Decebal Constantin Mocanu, Shiwei Liu

    Abstract: The recent success of Automatic Speech Recognition (ASR) is largely attributed to the ever-growing amount of training data. However, this trend has made model training prohibitively costly and imposed computational demands. While data pruning has been proposed to mitigate this issue by identifying a small subset of relevant data, its application in ASR has been barely explored, and existing works… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  15. arXiv:2406.17614  [pdf, other

    cs.CV cs.MM

    MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization

    Authors: Adriana Fernandez-Lopez, Honglie Chen, Pingchuan Ma, Lu Yin, Qiao Xiao, Stavros Petridis, Shiwei Liu, Maja Pantic

    Abstract: Pre-trained models have been a foundational approach in speech recognition, albeit with associated additional costs. In this study, we propose a regularization technique that facilitates the training of visual and audio-visual speech recognition models (VSR and AVSR) from scratch. This approach, abbreviated as \textbf{MSRS} (Multimodal Speech Recognition from Scratch), introduces a sparse regulari… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech 2024

  16. arXiv:2406.09264  [pdf, other

    cs.HC cs.AI cs.CL

    Towards Bidirectional Human-AI Alignment: A Systematic Review for Clarifications, Framework, and Future Directions

    Authors: Hua Shen, Tiffany Knearem, Reshmi Ghosh, Kenan Alkiek, Kundan Krishna, Yachuan Liu, Ziqiao Ma, Savvas Petridis, Yi-Hao Peng, Li Qiwei, Sushrita Rakshit, Chenglei Si, Yutong Xie, Jeffrey P. Bigham, Frank Bentley, Joyce Chai, Zachary Lipton, Qiaozhu Mei, Rada Mihalcea, Michael Terry, Diyi Yang, Meredith Ringel Morris, Paul Resnick, David Jurgens

    Abstract: Recent advancements in general-purpose AI have highlighted the importance of guiding AI systems towards the intended goals, ethical principles, and values of individuals and groups, a concept broadly recognized as alignment. However, the lack of clarified definitions and scopes of human-AI alignment poses a significant obstacle, hampering collaborative efforts across research domains to achieve th… ▽ More

    Submitted 10 August, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: proposing "bidirectional human-AI alignment" framework after a systematic review of over 400 alignment papers

  17. arXiv:2405.03806  [pdf, other

    cs.HC

    In Situ AI Prototyping: Infusing Multimodal Prompts into Mobile Settings with MobileMaker

    Authors: Savvas Petridis, Michael Xieyang Liu, Alexander J. Fiannaca, Vivian Tsai, Michael Terry, Carrie J. Cai

    Abstract: Recent advances in multimodal large language models (LLMs) have made it easier to rapidly prototype AI-powered features, especially for mobile use cases. However, gathering early, mobile-situated user feedback on these AI prototypes remains challenging. The broad scope and flexibility of LLMs means that, for a given use-case-specific prototype, there is a crucial need to understand the wide range… ▽ More

    Submitted 1 October, 2024; v1 submitted 6 May, 2024; originally announced May 2024.

  18. arXiv:2404.19110  [pdf, other

    cs.CV

    EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars

    Authors: Nikita Drobyshev, Antoni Bigata Casademunt, Konstantinos Vougioukas, Zoe Landgraf, Stavros Petridis, Maja Pantic

    Abstract: Head avatars animated by visual signals have gained popularity, particularly in cross-driving synthesis where the driver differs from the animated character, a challenging but highly practical approach. The recently presented MegaPortraits model has demonstrated state-of-the-art results in this domain. We conduct a deep examination and evaluation of this model, with a particular focus on its laten… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

  19. arXiv:2404.02098  [pdf, other

    cs.CV

    BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition

    Authors: Alexandros Haliassos, Andreas Zinonos, Rodrigo Mira, Stavros Petridis, Maja Pantic

    Abstract: Self-supervision has recently shown great promise for learning visual and auditory speech representations from unlabelled data. In this work, we propose BRAVEn, an extension to the recent RAVEn method, which learns speech representations entirely from raw audio-visual data. Our modifications to RAVEn enable BRAVEn to achieve state-of-the-art results among self-supervised methods in various setting… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

    Comments: ICASSP 2024. Code: https://github.com/ahaliassos/raven

  20. arXiv:2403.04894  [pdf, other

    cs.CL cs.AI

    ConstitutionalExperts: Training a Mixture of Principle-based Prompts

    Authors: Savvas Petridis, Ben Wedin, Ann Yuan, James Wexler, Nithum Thain

    Abstract: Large language models (LLMs) are highly capable at a variety of tasks given the right prompt, but writing one is still a difficult and tedious process. In this work, we introduce ConstitutionalExperts, a method for learning a prompt consisting of constitutional principles (i.e. rules), given a training dataset. Unlike prior methods that optimize the prompt as a single entity, our method incrementa… ▽ More

    Submitted 7 March, 2024; originally announced March 2024.

  21. arXiv:2401.08972  [pdf, other

    cs.CV

    Hearing Loss Detection from Facial Expressions in One-on-one Conversations

    Authors: Yufeng Yin, Ishwarya Ananthabhotla, Vamsi Krishna Ithapu, Stavros Petridis, Yu-Hsiang Wu, Christi Miller

    Abstract: Individuals with impaired hearing experience difficulty in conversations, especially in noisy environments. This difficulty often manifests as a change in behavior and may be captured via facial expressions, such as the expression of discomfort or fatigue. In this work, we build on this idea and introduce the problem of detecting hearing loss from an individual's facial expressions during a conver… ▽ More

    Submitted 16 January, 2024; originally announced January 2024.

    Comments: Accepted by ICASSP 2024

  22. arXiv:2310.17864  [pdf, other

    eess.AS cs.SD

    TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch

    Authors: Jeff Hwang, Moto Hira, Caroline Chen, Xiaohui Zhang, Zhaoheng Ni, Guangzhi Sun, Pingchuan Ma, Ruizhe Huang, Vineel Pratap, Yuekai Zhang, Anurag Kumar, Chin-Yun Yu, Chuang Zhu, Chunxi Liu, Jacob Kahn, Mirco Ravanelli, Peng Sun, Shinji Watanabe, Yangyang Shi, Yumeng Tao, Robin Scheibler, Samuele Cornell, Sean Kim, Stavros Petridis

    Abstract: TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of audio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by developing impactful features. Here, we survey TorchAudio's devel… ▽ More

    Submitted 26 October, 2023; originally announced October 2023.

  23. arXiv:2310.15435  [pdf, other

    cs.HC cs.AI

    PromptInfuser: How Tightly Coupling AI and UI Design Impacts Designers' Workflows

    Authors: Savvas Petridis, Michael Terry, Carrie J. Cai

    Abstract: Prototyping AI applications is notoriously difficult. While large language model (LLM) prompting has dramatically lowered the barriers to AI prototyping, designers are still prototyping AI functionality and UI separately. We investigate how coupling prompt and UI design affects designers' workflows. Grounding this research, we developed PromptInfuser, a Figma plugin that enables users to create se… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

  24. arXiv:2310.15428  [pdf, other

    cs.HC cs.AI

    ConstitutionMaker: Interactively Critiquing Large Language Models by Converting Feedback into Principles

    Authors: Savvas Petridis, Ben Wedin, James Wexler, Aaron Donsbach, Mahima Pushkarna, Nitesh Goyal, Carrie J. Cai, Michael Terry

    Abstract: Large language model (LLM) prompting is a promising new approach for users to create and customize their own chatbots. However, current methods for steering a chatbot's outputs, such as prompt engineering and fine-tuning, do not support users in converting their natural feedback on the model's outputs to changes in the prompt or model. In this work, we explore how to enable users to interactively… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

  25. arXiv:2307.04552  [pdf, other

    cs.CV

    SparseVSR: Lightweight and Noise Robust Visual Speech Recognition

    Authors: Adriana Fernandez-Lopez, Honglie Chen, Pingchuan Ma, Alexandros Haliassos, Stavros Petridis, Maja Pantic

    Abstract: Recent advances in deep neural networks have achieved unprecedented success in visual speech recognition. However, there remains substantial disparity between current methods and their deployment in resource-constrained devices. In this work, we explore different magnitude-based pruning techniques to generate a lightweight model that achieves higher performance than its dense model equivalent, esp… ▽ More

    Submitted 10 July, 2023; originally announced July 2023.

    Comments: Accepted to Interspeech 2023

  26. arXiv:2305.11364  [pdf, other

    cs.CL cs.AI

    Visualizing Linguistic Diversity of Text Datasets Synthesized by Large Language Models

    Authors: Emily Reif, Minsuk Kahng, Savvas Petridis

    Abstract: Large language models (LLMs) can be used to generate smaller, more refined datasets via few-shot prompting for benchmarking, fine-tuning or other use cases. However, understanding and evaluating these datasets is difficult, and the failure modes of LLM-generated data are still not well understood. Specifically, the data can be repetitive in surprising ways, not only semantically but also syntactic… ▽ More

    Submitted 27 September, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

  27. arXiv:2305.08854  [pdf, other

    cs.CV cs.AI cs.LG

    Laughing Matters: Introducing Laughing-Face Generation using Diffusion Models

    Authors: Antoni Bigata Casademunt, Rodrigo Mira, Nikita Drobyshev, Konstantinos Vougioukas, Stavros Petridis, Maja Pantic

    Abstract: Speech-driven animation has gained significant traction in recent years, with current methods achieving near-photorealistic results. However, the field remains underexplored regarding non-verbal communication despite evidence demonstrating its importance in human interaction. In particular, generating laughter sequences presents a unique challenge due to the intricacy and nuances of this behaviour… ▽ More

    Submitted 30 August, 2023; v1 submitted 15 May, 2023; originally announced May 2023.

  28. arXiv:2305.03711  [pdf, other

    cs.LG cs.CY

    Medical records condensation: a roadmap towards healthcare data democratisation

    Authors: Yujiang Wang, Anshul Thakur, Mingzhi Dong, Pingchuan Ma, Stavros Petridis, Li Shang, Tingting Zhu, David A. Clifton

    Abstract: The prevalence of artificial intelligence (AI) has envisioned an era of healthcare democratisation that promises every stakeholder a new and better way of life. However, the advancement of clinical AI research is significantly hurdled by the dearth of data democratisation in healthcare. To truly democratise data for AI studies, challenges are two-fold: 1. the sensitive information in clinical data… ▽ More

    Submitted 8 January, 2024; v1 submitted 5 May, 2023; originally announced May 2023.

  29. arXiv:2303.17200  [pdf, other

    cs.CV cs.AI cs.SD eess.AS

    SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision

    Authors: Xubo Liu, Egor Lakomkin, Konstantinos Vougioukas, Pingchuan Ma, Honglie Chen, Ruiming Xie, Morrie Doulaty, Niko Moritz, Jáchym Kolář, Stavros Petridis, Maja Pantic, Christian Fuegen

    Abstract: Recently reported state-of-the-art results in visual speech recognition (VSR) often rely on increasingly large amounts of video data, while the publicly available transcribed video datasets are limited in size. In this paper, for the first time, we study the potential of leveraging synthetic visual data for VSR. Our method, termed SynthVSR, substantially improves the performance of VSR systems wit… ▽ More

    Submitted 3 April, 2023; v1 submitted 30 March, 2023; originally announced March 2023.

    Comments: IEEE/CVF CVPR 2023

  30. Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels

    Authors: Pingchuan Ma, Alexandros Haliassos, Adriana Fernandez-Lopez, Honglie Chen, Stavros Petridis, Maja Pantic

    Abstract: Audio-visual speech recognition has received a lot of attention due to its robustness against acoustic noise. Recently, the performance of automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR, respectively) has been substantially improved, mainly due to the use of larger models and training sets. However, accurate labelling of datasets is time-consuming and expensive. Hence… ▽ More

    Submitted 28 June, 2023; v1 submitted 24 March, 2023; originally announced March 2023.

    Comments: Accepted to ICASSP 2023

  31. arXiv:2303.09455  [pdf, other

    cs.CL cs.CV cs.LG cs.SD eess.AS

    Learning Cross-lingual Visual Speech Representations

    Authors: Andreas Zinonos, Alexandros Haliassos, Pingchuan Ma, Stavros Petridis, Maja Pantic

    Abstract: Cross-lingual self-supervised learning has been a growing research topic in the last few years. However, current works only explored the use of audio signals to create representations. In this work, we study cross-lingual self-supervised visual representation learning. We use the recently-proposed Raw Audio-Visual Speech Encoders (RAVEn) framework to pre-train an audio-visual model with unlabelled… ▽ More

    Submitted 14 March, 2023; originally announced March 2023.

  32. arXiv:2301.03396  [pdf, other

    cs.CV

    Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation

    Authors: Michał Stypułkowski, Konstantinos Vougioukas, Sen He, Maciej Zięba, Stavros Petridis, Maja Pantic

    Abstract: Talking face generation has historically struggled to produce head movements and natural facial expressions without guidance from additional reference videos. Recent developments in diffusion-based generative models allow for more realistic and stable data synthesis and their performance on image and video generation has surpassed that of other generative models. In this work, we present an autore… ▽ More

    Submitted 29 July, 2023; v1 submitted 6 January, 2023; originally announced January 2023.

  33. arXiv:2212.06246  [pdf, other

    cs.LG cs.CV cs.SD

    Jointly Learning Visual and Auditory Speech Representations from Raw Data

    Authors: Alexandros Haliassos, Pingchuan Ma, Rodrigo Mira, Stavros Petridis, Maja Pantic

    Abstract: We present RAVEn, a self-supervised multi-modal approach to jointly learn visual and auditory speech representations. Our pre-training objective involves encoding masked inputs, and then predicting contextualised targets generated by slowly-evolving momentum encoders. Driven by the inherent differences between video and audio, our design is asymmetric w.r.t. the two modalities' pretext tasks: Wher… ▽ More

    Submitted 4 April, 2023; v1 submitted 12 December, 2022; originally announced December 2022.

    Comments: ICLR 2023. Code: https://github.com/ahaliassos/raven

  34. arXiv:2211.10999  [pdf, other

    cs.SD cs.CV cs.LG eess.AS

    LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders

    Authors: Rodrigo Mira, Buye Xu, Jacob Donley, Anurag Kumar, Stavros Petridis, Vamsi Krishna Ithapu, Maja Pantic

    Abstract: Audio-visual speech enhancement aims to extract clean speech from a noisy environment by leveraging not only the audio itself but also the target speaker's lip movements. This approach has been shown to yield improvements over audio-only speech enhancement, particularly for the removal of interfering speech. Despite recent advances in speech synthesis, most audio-visual approaches continue to use… ▽ More

    Submitted 13 March, 2023; v1 submitted 20 November, 2022; originally announced November 2022.

    Comments: accepted to ICASSP 2023

  35. arXiv:2211.02133  [pdf, other

    eess.AS cs.CV cs.SD

    Streaming Audio-Visual Speech Recognition with Alignment Regularization

    Authors: Pingchuan Ma, Niko Moritz, Stavros Petridis, Christian Fuegen, Maja Pantic

    Abstract: In this work, we propose a streaming AV-ASR system based on a hybrid connectionist temporal classification (CTC)/attention neural network architecture. The audio and the visual encoder neural networks are both based on the conformer architecture, which is made streamable using chunk-wise self-attention (CSA) and causal convolution. Streaming recognition with a decoder neural network is realized by… ▽ More

    Submitted 1 July, 2023; v1 submitted 3 November, 2022; originally announced November 2022.

    Comments: Accepted to Interspeech 2023

  36. arXiv:2210.11341  [pdf, other

    cs.CV cs.LG

    SS-VAERR: Self-Supervised Apparent Emotional Reaction Recognition from Video

    Authors: Marija Jegorova, Stavros Petridis, Maja Pantic

    Abstract: This work focuses on the apparent emotional reaction recognition (AERR) from the video-only input, conducted in a self-supervised fashion. The network is first pre-trained on different self-supervised pretext tasks and later fine-tuned on the downstream target task. Self-supervised learning facilitates the use of pre-trained architectures and larger datasets that might be deemed unfit for the targ… ▽ More

    Submitted 20 October, 2022; originally announced October 2022.

  37. Training Strategies for Improved Lip-reading

    Authors: Pingchuan Ma, Yujiang Wang, Stavros Petridis, Jie Shen, Maja Pantic

    Abstract: Several training strategies and temporal models have been recently proposed for isolated word lip-reading in a series of independent works. However, the potential of combining the best strategies and investigating the impact of each of them has not been explored. In this paper, we systematically investigate the performance of state-of-the-art data augmentation approaches, temporal models and other… ▽ More

    Submitted 29 September, 2022; v1 submitted 3 September, 2022; originally announced September 2022.

    Comments: ICASSP 2022. Code is available at https://sites.google.com/view/audiovisual-speech-recognition

    Journal ref: 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8472-8476, 2022

  38. arXiv:2205.02058  [pdf, other

    cs.SD cs.CV cs.LG eess.AS

    SVTS: Scalable Video-to-Speech Synthesis

    Authors: Rodrigo Mira, Alexandros Haliassos, Stavros Petridis, Björn W. Schuller, Maja Pantic

    Abstract: Video-to-speech synthesis (also known as lip-to-speech) refers to the translation of silent lip movements into the corresponding audio. This task has received an increasing amount of attention due to its self-supervised nature (i.e., can be trained without manual labelling) combined with the ever-growing collection of audio-visual data available online. Despite these strong motivations, contempora… ▽ More

    Submitted 15 August, 2022; v1 submitted 4 May, 2022; originally announced May 2022.

    Comments: accepted to INTERSPEECH 2022 (Oral Presentation)

  39. Self-supervised Video-centralised Transformer for Video Face Clustering

    Authors: Yujiang Wang, Mingzhi Dong, Jie Shen, Yiming Luo, Yiming Lin, Pingchuan Ma, Stavros Petridis, Maja Pantic

    Abstract: This paper presents a novel method for face clustering in videos using a video-centralised transformer. Previous works often employed contrastive learning to learn frame-level representation and used average pooling to aggregate the features along the temporal dimension. This approach may not fully capture the complicated video dynamics. In addition, despite the recent progress in video-based cont… ▽ More

    Submitted 15 February, 2023; v1 submitted 24 March, 2022; originally announced March 2022.

    Comments: Accepted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

  40. arXiv:2202.13084  [pdf, other

    cs.CV cs.SD eess.AS

    Visual Speech Recognition for Multiple Languages in the Wild

    Authors: Pingchuan Ma, Stavros Petridis, Maja Pantic

    Abstract: Visual speech recognition (VSR) aims to recognize the content of speech based on lip movements, without relying on the audio stream. Advances in deep learning and the availability of large audio-visual datasets have led to the development of much more accurate and robust VSR models than ever before. However, these advances are usually due to the larger training sets rather than the model design. H… ▽ More

    Submitted 30 October, 2022; v1 submitted 26 February, 2022; originally announced February 2022.

    Comments: Published in Nature Machine Intelligence

  41. arXiv:2201.07131  [pdf, other

    cs.CV

    Leveraging Real Talking Faces via Self-Supervision for Robust Forgery Detection

    Authors: Alexandros Haliassos, Rodrigo Mira, Stavros Petridis, Maja Pantic

    Abstract: One of the most pressing challenges for the detection of face-manipulated videos is generalising to forgery methods not seen during training while remaining effective under common corruptions such as compression. In this paper, we examine whether we can tackle this issue by harnessing videos of real talking faces, which contain rich information on natural facial appearance and behaviour and are re… ▽ More

    Submitted 21 October, 2022; v1 submitted 18 January, 2022; originally announced January 2022.

    Comments: CVPR 2022. Code: https://github.com/ahaliassos/RealForensics

  42. arXiv:2111.04920  [pdf, other

    cs.HC

    PopBlends: Strategies for Conceptual Blending with Large Language Models

    Authors: Sitong Wang, Savvas Petridis, Taeahn Kwon, Xiaojuan Ma, Lydia B. Chilton

    Abstract: Pop culture is an important aspect of communication. On social media people often post pop culture reference images that connect an event, product or other entity to a pop culture domain. Creating these images is a creative challenge that requires finding a conceptual connection between the users' topic and a pop culture domain. In cognitive theory, this task is called conceptual blending. We pres… ▽ More

    Submitted 19 February, 2023; v1 submitted 8 November, 2021; originally announced November 2021.

  43. arXiv:2110.11850  [pdf, other

    cs.CL

    Lightweight Decoding Strategies for Increasing Specificity

    Authors: Katy Ilonka Gero, Chris Kedzie, Savvas Petridis, Lydia Chilton

    Abstract: Language models are known to produce vague and generic outputs. We propose two unsupervised decoding strategies based on either word-frequency or point-wise mutual information to increase the specificity of any model that outputs a probability distribution over its vocabulary at generation time. We test the strategies in a prompt completion task; with human evaluations, we find that both strategie… ▽ More

    Submitted 22 October, 2021; originally announced October 2021.

  44. arXiv:2110.09168  [pdf, other

    cs.CV cs.LG

    Domain Generalisation for Apparent Emotional Facial Expression Recognition across Age-Groups

    Authors: Rafael Poyiadzi, Jie Shen, Stavros Petridis, Yujiang Wang, Maja Pantic

    Abstract: Apparent emotional facial expression recognition has attracted a lot of research attention recently. However, the majority of approaches ignore age differences and train a generic model for all ages. In this work, we study the effect of using different age-groups for training apparent emotional facial expression recognition models. To this end, we study Domain Generalisation in the context of appa… ▽ More

    Submitted 18 October, 2021; originally announced October 2021.

  45. arXiv:2106.09171  [pdf, other

    cs.LG cs.CV cs.SD eess.AS

    LiRA: Learning Visual Speech Representations from Audio through Self-supervision

    Authors: Pingchuan Ma, Rodrigo Mira, Stavros Petridis, Björn W. Schuller, Maja Pantic

    Abstract: The large amount of audiovisual content being shared online today has drawn substantial attention to the prospect of audiovisual self-supervised learning. Recent works have focused on each of these modalities separately, while others have attempted to model both simultaneously in a cross-modal fashion. However, comparatively little attention has been given to leveraging one modality as a training… ▽ More

    Submitted 16 June, 2021; originally announced June 2021.

    Comments: Accepted for publication at Interspeech 2021

  46. arXiv:2104.13332  [pdf, other

    cs.LG cs.CV cs.SD eess.AS

    End-to-End Video-To-Speech Synthesis using Generative Adversarial Networks

    Authors: Rodrigo Mira, Konstantinos Vougioukas, Pingchuan Ma, Stavros Petridis, Björn W. Schuller, Maja Pantic

    Abstract: Video-to-speech is the process of reconstructing the audio speech from a video of a spoken utterance. Previous approaches to this task have relied on a two-step process where an intermediate representation is inferred from the video, and is then decoded into waveform audio using a vocoder or a waveform reconstruction algorithm. In this work, we propose a new end-to-end video-to-speech model based… ▽ More

    Submitted 15 August, 2022; v1 submitted 27 April, 2021; originally announced April 2021.

    Comments: Published in IEEE Transactions on Cybernetics (April 2022)

  47. arXiv:2102.09281  [pdf, other

    cs.LG cs.CV cs.SD eess.AS

    DINO: A Conditional Energy-Based GAN for Domain Translation

    Authors: Konstantinos Vougioukas, Stavros Petridis, Maja Pantic

    Abstract: Domain translation is the process of transforming data from one domain to another while preserving the common semantics. Some of the most popular domain translation systems are based on conditional generative adversarial networks, which use source domain data to drive the generator and as an input to the discriminator. However, this approach does not enforce the preservation of shared semantics si… ▽ More

    Submitted 18 February, 2021; originally announced February 2021.

    Comments: Accepted to ICLR 2021

  48. arXiv:2102.06657  [pdf, other

    cs.CV eess.AS

    End-to-end Audio-visual Speech Recognition with Conformers

    Authors: Pingchuan Ma, Stavros Petridis, Maja Pantic

    Abstract: In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner. In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms, respectively, which are then fed to conformers and then fusion takes place via a Multi-Layer Perceptron (MLP). T… ▽ More

    Submitted 12 February, 2021; originally announced February 2021.

    Comments: Accepted to ICASSP 2021

  49. arXiv:2012.07657  [pdf, other

    cs.CV

    Lips Don't Lie: A Generalisable and Robust Approach to Face Forgery Detection

    Authors: Alexandros Haliassos, Konstantinos Vougioukas, Stavros Petridis, Maja Pantic

    Abstract: Although current deep learning-based face forgery detectors achieve impressive performance in constrained scenarios, they are vulnerable to samples created by unseen manipulation methods. Some recent works show improvements in generalisation but rely on cues that are easily corrupted by common post-processing operations such as compression. In this paper, we propose LipForensics, a detection appro… ▽ More

    Submitted 15 August, 2021; v1 submitted 14 December, 2020; originally announced December 2020.

    Comments: Accepted at CVPR2021. Code: https://github.com/ahaliassos/LipForensics

  50. arXiv:2010.03623  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Domain Adversarial Neural Networks for Dysarthric Speech Recognition

    Authors: Dominika Woszczyk, Stavros Petridis, David Millard

    Abstract: Speech recognition systems have improved dramatically over the last few years, however, their performance is significantly degraded for the cases of accented or impaired speech. This work explores domain adversarial neural networks (DANN) for speaker-independent speech recognition on the UAS dataset of dysarthric speech. The classification task on 10 spoken digits is performed using an end-to-end… ▽ More

    Submitted 7 October, 2020; originally announced October 2020.

    Comments: 5 pages, to be published in Interspeech 2020