Skip to main content

Showing 1–18 of 18 results for author: Senocak, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.05343  [pdf, other

    cs.CV cs.SD eess.AS

    Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization

    Authors: Sooyoung Park, Arda Senocak, Joon Son Chung

    Abstract: Large-scale vision-language models demonstrate strong multimodal alignment and generalization across diverse tasks. Among them, CLIP stands out as one of the most successful approaches. In this work, we extend the application of CLIP to sound source localization, proposing a self-supervised method operates without explicit text input. We introduce a framework that maps audios into tokens compatibl… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

    Comments: Journal Extension of WACV 2024 paper (arXiv:2311.04066). Code is available at https://github.com/swimmiing/ACL-SSL

  2. arXiv:2503.18880  [pdf, other

    cs.CV cs.SD eess.AS

    Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes

    Authors: Hyeonggon Ryu, Seongyu Kim, Joon Son Chung, Arda Senocak

    Abstract: We present a unified model capable of simultaneously grounding both spoken language and non-speech sounds within a visual scene, addressing key limitations in current audio-visual grounding models. Existing approaches are typically limited to handling either speech or non-speech sounds independently, or at best, together but sequentially without mixing. This limitation prevents them from capturing… ▽ More

    Submitted 24 March, 2025; originally announced March 2025.

    Comments: CVPR 2025

  3. arXiv:2412.06209  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment

    Authors: Kim Sung-Bin, Arda Senocak, Hyunwoo Ha, Tae-Hyun Oh

    Abstract: How does audio describe the world around us? In this work, we propose a method for generating images of visual scenes from diverse in-the-wild sounds. This cross-modal generation task is challenging due to the significant information gap between auditory and visual signals. We address this challenge by designing a model that aligns audio-visual modalities by enriching audio features with visual in… ▽ More

    Submitted 9 December, 2024; originally announced December 2024.

    Comments: Under-review

  4. arXiv:2410.18325  [pdf, other

    cs.CV

    AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models

    Authors: Kim Sung-Bin, Oh Hyun-Bin, JungMok Lee, Arda Senocak, Joon Son Chung, Tae-Hyun Oh

    Abstract: Following the success of Large Language Models (LLMs), expanding their boundaries to new modalities represents a significant paradigm shift in multimodal understanding. Human perception is inherently multimodal, relying not only on text but also on auditory and visual cues for a complete understanding of the world. In recognition of this fact, audio-visual LLMs have recently emerged. Despite promi… ▽ More

    Submitted 17 March, 2025; v1 submitted 23 October, 2024; originally announced October 2024.

    Comments: ICLR 2025

  5. arXiv:2407.13676  [pdf, other

    cs.MM cs.CV cs.SD eess.AS

    Aligning Sight and Sound: Advanced Sound Source Localization Through Audio-Visual Alignment

    Authors: Arda Senocak, Hyeonggon Ryu, Junsik Kim, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung

    Abstract: Recent studies on learning-based sound source localization have mainly focused on the localization performance perspective. However, prior work and existing benchmarks overlook a crucial aspect: cross-modal interaction, which is essential for interactive sound source localization. Cross-modal interaction is vital for understanding semantically matched or mismatched audio-visual events, such as sil… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: Journal Extension of ICCV 2023 paper (arXiV:2309.10724). Code is available at https://github.com/kaistmm/SSLalignment

  6. arXiv:2407.08691  [pdf, other

    cs.SD cs.AI eess.AS

    ElasticAST: An Audio Spectrogram Transformer for All Length and Resolutions

    Authors: Jiu Feng, Mehmet Hamza Erol, Joon Son Chung, Arda Senocak

    Abstract: Transformers have rapidly overtaken CNN-based architectures as the new standard in audio classification. Transformer-based models, such as the Audio Spectrogram Transformers (AST), also inherit the fixed-size input paradigm from CNNs. However, this leads to performance degradation for ASTs in the inference when input lengths vary from the training. This paper introduces an approach that enables th… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: Interspeech 2024. Code is available at https://github.com/JiuFengSC/ElasticAST

  7. arXiv:2406.03344  [pdf, other

    cs.SD cs.AI eess.AS

    Audio Mamba: Bidirectional State Space Model for Audio Representation Learning

    Authors: Mehmet Hamza Erol, Arda Senocak, Jiu Feng, Joon Son Chung

    Abstract: Transformers have rapidly become the preferred choice for audio classification, surpassing methods based on CNNs. However, Audio Spectrogram Transformers (ASTs) exhibit quadratic scaling due to self-attention. The removal of this quadratic self-attention cost presents an appealing direction. Recently, state space models (SSMs), such as Mamba, have demonstrated potential in language and vision task… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: Code is available at https://github.com/mhamzaerol/Audio-Mamba-AuM

  8. arXiv:2401.08415  [pdf, other

    cs.SD cs.LG eess.AS

    From Coarse to Fine: Efficient Training for Audio Spectrogram Transformers

    Authors: Jiu Feng, Mehmet Hamza Erol, Joon Son Chung, Arda Senocak

    Abstract: Transformers have become central to recent advances in audio classification. However, training an audio spectrogram transformer, e.g. AST, from scratch can be resource and time-intensive. Furthermore, the complexity of transformers heavily depends on the input audio spectrogram size. In this work, we aim to optimize AST training by linking to the resolution in the time-axis. We introduce multi-pha… ▽ More

    Submitted 16 January, 2024; originally announced January 2024.

    Comments: ICASSP 2024

  9. arXiv:2311.04066  [pdf, other

    cs.CV cs.AI cs.MM cs.SD eess.AS

    Can CLIP Help Sound Source Localization?

    Authors: Sooyoung Park, Arda Senocak, Joon Son Chung

    Abstract: Large-scale pre-trained image-text models demonstrate remarkable versatility across diverse tasks, benefiting from their robust representational capabilities and effective multimodal alignment. We extend the application of these models, specifically CLIP, to the domain of sound source localization. Unlike conventional approaches, we employ the pre-trained CLIP model without explicit text input, re… ▽ More

    Submitted 7 November, 2023; originally announced November 2023.

    Comments: WACV 2024

  10. arXiv:2309.10724  [pdf, other

    cs.CV cs.AI cs.MM cs.SD eess.AS

    Sound Source Localization is All about Cross-Modal Alignment

    Authors: Arda Senocak, Hyeonggon Ryu, Junsik Kim, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung

    Abstract: Humans can easily perceive the direction of sound sources in a visual scene, termed sound source localization. Recent studies on learning-based sound source localization have mainly explored the problem from a localization perspective. However, prior arts and existing benchmarks do not account for a more important aspect of the problem, cross-modal semantic understanding, which is essential for ge… ▽ More

    Submitted 19 September, 2023; originally announced September 2023.

    Comments: ICCV 2023

  11. arXiv:2307.09286  [pdf, other

    cs.SD cs.LG eess.AS

    FlexiAST: Flexibility is What AST Needs

    Authors: Jiu Feng, Mehmet Hamza Erol, Joon Son Chung, Arda Senocak

    Abstract: The objective of this work is to give patch-size flexibility to Audio Spectrogram Transformers (AST). Recent advancements in ASTs have shown superior performance in various audio-based tasks. However, the performance of standard ASTs degrades drastically when evaluated using different patch sizes from that used during training. As a result, AST models are typically re-trained to accommodate change… ▽ More

    Submitted 18 July, 2023; originally announced July 2023.

    Comments: Interspeech 2023

  12. arXiv:2303.17517  [pdf, other

    cs.CL cs.CV cs.SD eess.AS

    Hindi as a Second Language: Improving Visually Grounded Speech with Semantically Similar Samples

    Authors: Hyeonggon Ryu, Arda Senocak, In So Kweon, Joon Son Chung

    Abstract: The objective of this work is to explore the learning of visually grounded speech models (VGS) from multilingual perspective. Bilingual VGS models are generally trained with an equal number of spoken captions from both languages. However, in reality, there can be an imbalance among the languages for the available spoken captions. Our key contribution in this work is to leverage the power of a high… ▽ More

    Submitted 30 March, 2023; originally announced March 2023.

    Comments: ICASSP 2023

  13. arXiv:2303.17490  [pdf, other

    cs.CV cs.MM cs.SD eess.AS eess.IV

    Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment

    Authors: Kim Sung-Bin, Arda Senocak, Hyunwoo Ha, Andrew Owens, Tae-Hyun Oh

    Abstract: How does audio describe the world around us? In this paper, we propose a method for generating an image of a scene from sound. Our method addresses the challenges of dealing with the large gaps that often exist between sight and sound. We design a model that works by scheduling the learning procedure of each model component to associate audio-visual modalities despite their information gaps. The k… ▽ More

    Submitted 30 March, 2023; originally announced March 2023.

    Comments: CVPR 2023

  14. arXiv:2211.01966  [pdf, other

    cs.CV cs.MM cs.SD eess.AS eess.IV

    MarginNCE: Robust Sound Localization with a Negative Margin

    Authors: Sooyoung Park, Arda Senocak, Joon Son Chung

    Abstract: The goal of this work is to localize sound sources in visual scenes with a self-supervised approach. Contrastive learning in the context of sound source localization leverages the natural correspondence between audio and visual signals where the audio-visual pairs from the same source are assumed as positive, while randomly selected pairs are negatives. However, this approach brings in noisy corre… ▽ More

    Submitted 3 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023. SOTA performance in Audio-Visual Sound Localization. 5 Pages

  15. arXiv:2202.05961  [pdf, other

    cs.CV eess.IV

    Audio-Visual Fusion Layers for Event Type Aware Video Recognition

    Authors: Arda Senocak, Junsik Kim, Tae-Hyun Oh, Hyeonggon Ryu, Dingzeyu Li, In So Kweon

    Abstract: Human brain is continuously inundated with the multisensory information and their complex interactions coming from the outside world at any given moment. Such information is automatically analyzed by binding or segregating in our brain. While this task might seem effortless for human brains, it is extremely challenging to build a machine that can perform similar tasks since complex interactions ca… ▽ More

    Submitted 11 February, 2022; originally announced February 2022.

  16. arXiv:2202.03007  [pdf, other

    cs.CV cs.SD eess.AS eess.IV

    Learning Sound Localization Better From Semantically Similar Samples

    Authors: Arda Senocak, Hyeonggon Ryu, Junsik Kim, In So Kweon

    Abstract: The objective of this work is to localize the sound sources in visual scenes. Existing audio-visual works employ contrastive learning by assigning corresponding audio-visual pairs from the same source as positives while randomly mismatched pairs as negatives. However, these negative pairs may contain semantically matched audio-visual information. Thus, these semantically correlated pairs, "hard po… ▽ More

    Submitted 7 February, 2022; originally announced February 2022.

    Comments: Accepted to ICASSP 2022. SOTA performance in Audio-Visual Sound Localization. 5 Pages

  17. arXiv:1911.09649  [pdf, other

    cs.CV

    Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications

    Authors: Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, In So Kweon

    Abstract: Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn to correlate the visual scene and sound, as well as localize the sound source only by observing them like humans? To investigate its empirical learnability, in this work we first present a novel unsupervised algorithm to address the problem of localizing sound sources in visual scenes. In order to a… ▽ More

    Submitted 20 November, 2019; originally announced November 2019.

    Comments: To appear in TPAMI. arXiv admin note: substantial text overlap with arXiv:1803.03849

  18. arXiv:1803.03849  [pdf, other

    cs.CV cs.AI cs.MM

    Learning to Localize Sound Source in Visual Scenes

    Authors: Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, In So Kweon

    Abstract: Visual events are usually accompanied by sounds in our daily lives. We pose the question: Can the machine learn the correspondence between visual scene and the sound, and localize the sound source only by observing sound and visual scene pairs like human? In this paper, we propose a novel unsupervised algorithm to address the problem of localizing the sound source in visual scenes. A two-stream ne… ▽ More

    Submitted 10 March, 2018; originally announced March 2018.

    Comments: To appear in CVPR 2018. Total 9 pages