Skip to main content

Showing 1–14 of 14 results for author: Shlizerman, E

Searching in archive eess. Search in all archives.
.
  1. arXiv:2506.05414  [pdf, ps, other

    cs.CV cs.AI cs.LG cs.MM cs.SD eess.AS

    SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing

    Authors: Mingfei Chen, Zijun Cui, Xiulong Liu, Jinlin Xiang, Caleb Zheng, Jingyuan Li, Eli Shlizerman

    Abstract: 3D spatial reasoning in dynamic, audio-visual environments is a cornerstone of human cognition yet remains largely unexplored by existing Audio-Visual Large Language Models (AV-LLMs) and benchmarks, which predominantly focus on static or 2D scenes. We introduce SAVVY-Bench, the first benchmark for 3D spatial reasoning in dynamic scenes with synchronized spatial audio. SAVVY-Bench is comprised of t… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

    Comments: Project website with demo videos: https://zijuncui02.github.io/SAVVY/

  2. arXiv:2504.10746  [pdf, ps, other

    cs.CV cs.AI cs.LG cs.SD eess.AS

    Hearing Anywhere in Any Environment

    Authors: Xiulong Liu, Anurag Kumar, Paul Calamia, Sebastia V. Amengual, Calvin Murdock, Ishwarya Ananthabhotla, Philip Robinson, Eli Shlizerman, Vamsi Krishna Ithapu, Ruohan Gao

    Abstract: In mixed reality applications, a realistic acoustic experience in spatial environments is as crucial as the visual experience for achieving true immersion. Despite recent advances in neural approaches for Room Impulse Response (RIR) estimation, most existing methods are limited to the single environment on which they are trained, lacking the ability to generalize to new rooms with different geomet… ▽ More

    Submitted 4 June, 2025; v1 submitted 14 April, 2025; originally announced April 2025.

    Comments: CVPR 2025; Project Page: https://dragonliu1995.github.io/hearinganywhereinanyenvironment/

  3. arXiv:2411.10657  [pdf, other

    eess.SP

    Brain-to-Text Decoding with Context-Aware Neural Representations and Large Language Models

    Authors: Jingyuan Li, Trung Le, Chaofei Fan, Mingfei Chen, Eli Shlizerman

    Abstract: Decoding attempted speech from neural activity offers a promising avenue for restoring communication abilities in individuals with speech impairments. Previous studies have focused on mapping neural activity to text using phonemes as the intermediate target. While successful, decoding neural activity directly to phonemes ignores the context dependent nature of the neural activity-to-phoneme mappin… ▽ More

    Submitted 15 November, 2024; originally announced November 2024.

  4. arXiv:2411.05679  [pdf, other

    cs.CV cs.AI cs.LG cs.SD eess.AS

    Tell What You Hear From What You See -- Video to Audio Generation Through Text

    Authors: Xiulong Liu, Kun Su, Eli Shlizerman

    Abstract: The content of visual and audio scenes is multi-faceted such that a video can be paired with various audio and vice-versa. Thereby, in video-to-audio generation task, it is imperative to introduce steering approaches for controlling the generated audio. While Video-to-Audio generation is a well-established generative task, existing methods lack such controllability. In this work, we propose VATT,… ▽ More

    Submitted 4 April, 2025; v1 submitted 8 November, 2024; originally announced November 2024.

    Comments: NeurIPS 2024. Project page: https://dragonliu1995.github.io/VATT-home

  5. arXiv:2409.19132  [pdf, other

    cs.MM cs.CV cs.LG cs.SD eess.AS

    From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation

    Authors: Kun Su, Xiulong Liu, Eli Shlizerman

    Abstract: Video encompasses both visual and auditory data, creating a perceptually rich experience where these two modalities complement each other. As such, videos are a valuable type of media for the investigation of the interplay between audio and visual elements. Previous studies of audio-visual modalities primarily focused on either audio-visual representation learning or generative modeling of a modal… ▽ More

    Submitted 27 September, 2024; originally announced September 2024.

    Comments: Accepted by ICML 2024

  6. arXiv:2406.06534  [pdf, other

    cs.CV eess.IV physics.optics

    Compressed Meta-Optical Encoder for Image Classification

    Authors: Anna Wirth-Singh, Jinlin Xiang, Minho Choi, Johannes E. Fröch, Luocheng Huang, Shane Colburn, Eli Shlizerman, Arka Majumdar

    Abstract: Optical and hybrid convolutional neural networks (CNNs) recently have become of increasing interest to achieve low-latency, low-power image classification and computer vision tasks. However, implementing optical nonlinearity is challenging, and omitting the nonlinear layers in a standard CNN comes at a significant reduction in accuracy. In this work, we use knowledge distillation to compress modif… ▽ More

    Submitted 14 June, 2024; v1 submitted 22 April, 2024; originally announced June 2024.

  7. arXiv:2303.16897  [pdf, other

    cs.CV cs.LG cs.SD eess.AS

    Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos

    Authors: Kun Su, Kaizhi Qian, Eli Shlizerman, Antonio Torralba, Chuang Gan

    Abstract: Modeling sounds emitted from physical object interactions is critical for immersive perceptual experiences in real and virtual worlds. Traditional methods of impact sound synthesis use physics simulation to obtain a set of physics parameters that could represent and synthesize the sound. However, they require fine details of both the object geometries and impact locations, which are rarely availab… ▽ More

    Submitted 8 July, 2023; v1 submitted 29 March, 2023; originally announced March 2023.

    Comments: CVPR 2023. Project page: https://sukun1045.github.io/video-physics-sound-diffusion/

  8. arXiv:2012.03478  [pdf, other

    cs.SD cs.CV eess.AS

    Multi-Instrumentalist Net: Unsupervised Generation of Music from Body Movements

    Authors: Kun Su, Xiulong Liu, Eli Shlizerman

    Abstract: We propose a novel system that takes as an input body movements of a musician playing a musical instrument and generates music in an unsupervised setting. Learning to generate multi-instrumental music from videos without labeling the instruments is a challenging problem. To achieve the transformation, we built a pipeline named 'Multi-instrumentalistNet' (MI Net). At its base, the pipeline learns a… ▽ More

    Submitted 7 December, 2020; originally announced December 2020.

    Comments: Please see associated video at https://www.youtube.com/watch?v=yo5OZKBbBh4

  9. arXiv:2006.14348  [pdf, other

    cs.CV cs.LG cs.MM cs.SD eess.AS eess.IV

    Audeo: Audio Generation for a Silent Performance Video

    Authors: Kun Su, Xiulong Liu, Eli Shlizerman

    Abstract: We present a novel system that gets as an input video frames of a musician playing the piano and generates the music for that video. Generation of music from visual cues is a challenging problem and it is not clear whether it is an attainable goal at all. Our main aim in this work is to explore the plausibility of such a transformation and to identify cues and components able to carry the associat… ▽ More

    Submitted 22 June, 2020; originally announced June 2020.

    Comments: Please see associated video at https://www.youtube.com/watch?v=8rS3VgjG7_c

    Journal ref: Advances in neural information processing 2020

  10. arXiv:2006.07352  [pdf, other

    q-bio.NC cs.AI cs.LG eess.SY

    Deep Reinforcement Learning for Neural Control

    Authors: Jimin Kim, Eli Shlizerman

    Abstract: We present a novel methodology for control of neural circuits based on deep reinforcement learning. Our approach achieves aimed behavior by generating external continuous stimulation of existing neural circuits (neuromodulation control) or modulations of neural circuits architecture (connectome control). Both forms of control are challenging due to nonlinear and recurrent complexity of neural acti… ▽ More

    Submitted 12 June, 2020; originally announced June 2020.

    Comments: Please see the associated Video at: https://youtu.be/ixsUMfb9m_U

  11. arXiv:2006.06911  [pdf, other

    cs.CV cs.LG eess.IV

    Iterate & Cluster: Iterative Semi-Supervised Action Recognition

    Authors: Jingyuan Li, Eli Shlizerman

    Abstract: We propose a novel system for active semi-supervised feature-based action recognition. Given time sequences of features tracked during movements our system clusters the sequences into actions. Our system is based on encoder-decoder unsupervised methods shown to perform clustering by self-organization of their latent representation through the auto-regression task. These methods were tested on huma… ▽ More

    Submitted 11 June, 2020; originally announced June 2020.

    Comments: for associated video, see https://www.youtube.com/watch?v=ewuoz2tt73E

  12. arXiv:1911.12409  [pdf, other

    cs.CV cs.LG eess.IV

    PREDICT & CLUSTER: Unsupervised Skeleton Based Action Recognition

    Authors: Kun Su, Xiulong Liu, Eli Shlizerman

    Abstract: We propose a novel system for unsupervised skeleton-based action recognition. Given inputs of body keypoints sequences obtained during various movements, our system associates the sequences with actions. Our system is based on an encoder-decoder recurrent neural network, where the encoder learns a separable feature representation within its hidden states formed by training the model to perform pre… ▽ More

    Submitted 27 November, 2019; originally announced November 2019.

    Comments: See video at: https://www.youtube.com/watch?v=-dcCFUBRmwE

  13. arXiv:1905.12176  [pdf, other

    cs.LG eess.SP q-bio.NC stat.ML

    Clustering and Recognition of Spatiotemporal Features through Interpretable Embedding of Sequence to Sequence Recurrent Neural Networks

    Authors: Kun Su, Eli Shlizerman

    Abstract: Encoder-decoder recurrent neural network models (RNN Seq2Seq) have achieved great success in ubiquitous areas of computation and applications. It was shown to be successful in modeling data with both temporal and spatial dependencies for translation or prediction tasks. In this study, we propose an embedding approach to visualize and interpret the representation of data by these models. Furthermor… ▽ More

    Submitted 31 January, 2020; v1 submitted 28 May, 2019; originally announced May 2019.

  14. arXiv:1712.09382  [pdf, other

    eess.AS cs.CV cs.SD

    Audio to Body Dynamics

    Authors: Eli Shlizerman, Lucio M. Dery, Hayden Schoen, Ira Kemelmacher-Shlizerman

    Abstract: We present a method that gets as input an audio of violin or piano playing, and outputs a video of skeleton predictions which are further used to animate an avatar. The key idea is to create an animation of an avatar that moves their hands similarly to how a pianist or violinist would do, just from audio. Aiming for a fully detailed correct arms and fingers motion is a goal, however, it's not clea… ▽ More

    Submitted 19 December, 2017; originally announced December 2017.

    Comments: Link with videos https://arviolin.github.io/AudioBodyDynamics/

    Journal ref: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018