Skip to main content

Showing 1–23 of 23 results for author: Roman, R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.23509  [pdf, other

    cs.SD cs.LG eess.AS

    Spectrotemporal Modulation: Efficient and Interpretable Feature Representation for Classifying Speech, Music, and Environmental Sounds

    Authors: Andrew Chang, Yike Li, Iran R. Roman, David Poeppel

    Abstract: Audio DNNs have demonstrated impressive performance on various machine listening tasks; however, most of their representations are computationally costly and uninterpretable, leaving room for optimization. Here, we propose a novel approach centered on spectrotemporal modulation (STM) features, a signal processing method that mimics the neurophysiological representation in the human auditory cortex… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: Interspeech 2025

  2. arXiv:2504.02988  [pdf, ps, other

    cs.SD eess.AS

    Generating Diverse Audio-Visual 360 Soundscapes for Sound Event Localization and Detection

    Authors: Adrian S. Roman, Aiden Chang, Gerardo Meza, Iran R. Roman

    Abstract: We present SELDVisualSynth, a tool for generating synthetic videos for audio-visual sound event localization and detection (SELD). Our approach incorporates real-world background images to improve realism in synthetic audio-visual SELD data while also ensuring audio-visual spatial alignment. The tool creates 360 synthetic videos where objects move matching synthetic SELD audio data and its annotat… ▽ More

    Submitted 3 April, 2025; originally announced April 2025.

  3. Design and Implementation of the Transparent, Interpretable, and Multimodal (TIM) AR Personal Assistant

    Authors: Erin McGowan, Joao Rulff, Sonia Castelo, Guande Wu, Shaoyu Chen, Roque Lopez, Bea Steers, Iran R. Roman, Fabio F. Dias, Jing Qian, Parikshit Solunke, Michael Middleton, Ryan McKendrick, Claudio T. Silva

    Abstract: The concept of an AI assistant for task guidance is rapidly shifting from a science fiction staple to an impending reality. Such a system is inherently complex, requiring models for perceptual grounding, attention, and reasoning, an intuitive interface that adapts to the performer's needs, and the orchestration of data streams from many sensors. Moreover, all data acquired by the system must be re… ▽ More

    Submitted 2 April, 2025; originally announced April 2025.

    Comments: Copyright 2025 IEEE. All rights reserved, including rights for text and data mining and training of artificial intelligence and similar technologies. Personal use is permitted, but republication/redistribution requires IEEE permission. Article accepted for publication in IEEE Computer Graphics and Applications. This is the author's version, content may change prior to final publication

  4. arXiv:2501.03720  [pdf, other

    cs.SD eess.AS

    Guitar-TECHS: An Electric Guitar Dataset Covering Techniques, Musical Excerpts, Chords and Scales Using a Diverse Array of Hardware

    Authors: Hegel Pedroza, Wallace Abreu, Ryan M. Corey, Iran R. Roman

    Abstract: Guitar-related machine listening research involves tasks like timbre transfer, performance generation, and automatic transcription. However, small datasets often limit model robustness due to insufficient acoustic diversity and musical content. To address these issues, we introduce Guitar-TECHS, a comprehensive dataset featuring a variety of guitar techniques, musical excerpts, chords, and scales.… ▽ More

    Submitted 7 January, 2025; originally announced January 2025.

    Journal ref: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025

  5. arXiv:2501.01757  [pdf, other

    cs.SD eess.AS

    MusicGen-Stem: Multi-stem music generation and edition through autoregressive modeling

    Authors: Simon Rouard, Robin San Roman, Yossi Adi, Axel Roebel

    Abstract: While most music generation models generate a mixture of stems (in mono or stereo), we propose to train a multi-stem generative model with 3 stems (bass, drums and other) that learn the musical dependencies between them. To do so, we train one specialized compression algorithm per stem to tokenize the music into parallel streams of tokens. Then, we leverage recent improvements in the task of music… ▽ More

    Submitted 7 January, 2025; v1 submitted 3 January, 2025; originally announced January 2025.

    Comments: 5 pages, 3 figures, accepted to ICASSP 2025

  6. arXiv:2412.08821  [pdf, other

    cs.CL

    Large Concept Models: Language Modeling in a Sentence Representation Space

    Authors: LCM team, Loïc Barrault, Paul-Ambroise Duquenne, Maha Elbayad, Artyom Kozhevnikov, Belen Alastruey, Pierre Andrews, Mariano Coria, Guillaume Couairon, Marta R. Costa-jussà, David Dale, Hady Elsahar, Kevin Heffernan, João Maria Janeiro, Tuan Tran, Christophe Ropers, Eduardo Sánchez, Robin San Roman, Alexandre Mourachko, Safiyyah Saleem, Holger Schwenk

    Abstract: LLMs have revolutionized the field of artificial intelligence and have emerged as the de-facto tool for many tasks. The current established technology of LLMs is to process input and generate output at the token level. This is in sharp contrast to humans who operate at multiple levels of abstraction, well beyond single words, to analyze information and to generate creative content. In this paper,… ▽ More

    Submitted 15 December, 2024; v1 submitted 11 December, 2024; originally announced December 2024.

    Comments: 49 pages

  7. arXiv:2411.08234  [pdf, other

    cs.SD

    Analyzing Pitch Content in Traditional Ghanaian Seperewa Songs

    Authors: Kelvin L Walls, Iran R Roman, Kelsey Van Ert, Colter Harper, Leila Adu-Gilmore

    Abstract: This study examines the pitch content in traditional Ghanaian seperewa (Akan harp-lute) songs, utilizing a unique dataset from field recordings of the mid-twentieth century. We selected 71 songs and used Demucs to isolate vocals from instrumental tracks. We then retrieved the F0 content from these isolated tacks and applied Gaussian Mixture Models (GMM) to approximate musical scales. Comparative F… ▽ More

    Submitted 12 November, 2024; originally announced November 2024.

  8. arXiv:2409.02915  [pdf, other

    cs.SD eess.AS

    Latent Watermarking of Audio Generative Models

    Authors: Robin San Roman, Pierre Fernandez, Antoine Deleforge, Yossi Adi, Romain Serizel

    Abstract: The advancements in audio generative models have opened up new challenges in their responsible disclosure and the detection of their misuse. In response, we introduce a method to watermark latent generative models by a specific watermarking of their training data. The resulting watermarked models produce latent representations whose decoded outputs are detected with high confidence, regardless of… ▽ More

    Submitted 4 September, 2024; originally announced September 2024.

  9. arXiv:2401.17264  [pdf, other

    cs.SD cs.AI cs.CR

    Proactive Detection of Voice Cloning with Localized Watermarking

    Authors: Robin San Roman, Pierre Fernandez, Alexandre Défossez, Teddy Furon, Tuan Tran, Hady Elsahar

    Abstract: In the rapidly evolving field of speech generative models, there is a pressing need to ensure audio authenticity against the risks of voice cloning. We present AudioSeal, the first audio watermarking technique designed specifically for localized detection of AI-generated speech. AudioSeal employs a generator/detector architecture trained jointly with a localization loss to enable localized waterma… ▽ More

    Submitted 6 June, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

    Comments: Published at ICML 2024. Code at https://github.com/facebookresearch/audioseal - webpage at https://pierrefdz.github.io/publications/audioseal/

  10. arXiv:2401.12238  [pdf, other

    eess.AS cs.LG cs.SD

    Spatial Scaper: A Library to Simulate and Augment Soundscapes for Sound Event Localization and Detection in Realistic Rooms

    Authors: Iran R. Roman, Christopher Ick, Sivan Ding, Adrian S. Roman, Brian McFee, Juan P. Bello

    Abstract: Sound event localization and detection (SELD) is an important task in machine listening. Major advancements rely on simulated data with sound events in specific rooms and strong spatio-temporal labels. SELD data is simulated by convolving spatialy-localized room impulse responses (RIRs) with sound waveforms to place sound events in a soundscape. However, RIRs require manual collection in specific… ▽ More

    Submitted 19 January, 2024; originally announced January 2024.

    Comments: 5 pages, 4 figures, 1 table, to be presented at ICASSP 2024 in Seoul, South Korea

  11. arXiv:2401.08717  [pdf, other

    cs.SD eess.AS

    Robust DOA estimation using deep acoustic imaging

    Authors: Adrian S. Roman, Iran R. Roman, Juan P. Bello

    Abstract: Direction of arrival estimation (DoAE) aims at tracking a sound in azimuth and elevation. Recent advancements include data-driven models with inputs derived from ambisonics intensity vectors or correlations between channels in a microphone array. A spherical intensity map (SIM), or acoustic image, is an alternative input representation that remains underexplored. SIMs benefit from high-resolution… ▽ More

    Submitted 15 January, 2024; originally announced January 2024.

  12. arXiv:2312.14036  [pdf, other

    cs.SD eess.AS

    Total variation in popular rap vocals from 2009-2023: extension of the analysis by Georgieva, Ripolles & McFee

    Authors: Kelvin L Walls, Iran R Roman, Bea Steers, Elena Georgieva

    Abstract: Pitch variability in rap vocals is overlooked in favor of the genre's uniquely dynamic rhythmic properties. We present an analysis of fundamental frequency (F0) variation in rap vocals over the past 14 years, focusing on song examples that represent the state of modern rap music. Our analysis aims at identifying meaningful trends over time, and is in turn a continuation of the 2023 analysis by Geo… ▽ More

    Submitted 21 December, 2023; originally announced December 2023.

    Journal ref: Ismir 2023 Hybrid Conference 2023 Nov 5

  13. arXiv:2312.05187  [pdf, other

    cs.CL cs.SD eess.AS

    Seamless: Multilingual Expressive and Streaming Speech Translation

    Authors: Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, John Hoffman, Min-Jae Hwang, Hirofumi Inaguma, Christopher Klaiber, Ilia Kulikov, Pengwei Li, Daniel Licht, Jean Maillard, Ruslan Mavlyutov, Alice Rakotoarison, Kaushik Ram Sadagopan, Abinesh Ramakrishnan, Tuan Tran, Guillaume Wenzek , et al. (40 additional authors not shown)

    Abstract: Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4… ▽ More

    Submitted 8 December, 2023; originally announced December 2023.

  14. arXiv:2310.00870  [pdf, other

    cs.SD cs.IR eess.AS

    F0 analysis of Ghanaian pop singing reveals progressive alignment with equal temperament over the past three decades: a case study

    Authors: Iran R. Roman, Daniel Faronbi, Isabelle Burger-Weiser, Leila Adu-Gilmore

    Abstract: Contemporary Ghanaian popular singing combines European and traditional Ghanaian influences. We hypothesize that access to technology embedded with equal temperament catalyzed a progressive alignment of Ghanaian singing with equal-tempered scales over time. To test this, we study the Ghanaian singer Daddy Lumba, whose work spans from the earliest Ghanaian electronic style in the late 1980s to the… ▽ More

    Submitted 1 October, 2023; originally announced October 2023.

    Comments: Pages 27-33

  15. arXiv:2309.09288  [pdf, other

    cs.SD eess.AS

    Sound Source Distance Estimation in Diverse and Dynamic Acoustic Conditions

    Authors: Saksham Singh Kushwaha, Iran R. Roman, Magdalena Fuentes, Juan Pablo Bello

    Abstract: Localizing a moving sound source in the real world involves determining its direction-of-arrival (DOA) and distance relative to a microphone. Advancements in DOA estimation have been facilitated by data-driven methods optimized with large open-source datasets with microphone array recordings in diverse environments. In contrast, estimating a sound source's distance remains understudied. Existing a… ▽ More

    Submitted 17 September, 2023; originally announced September 2023.

    Comments: Accepted in WASPAA 2023

  16. arXiv:2308.02560  [pdf, other

    cs.SD cs.LG eess.AS

    From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion

    Authors: Robin San Roman, Yossi Adi, Antoine Deleforge, Romain Serizel, Gabriel Synnaeve, Alexandre Défossez

    Abstract: Deep generative models can generate high-fidelity audio conditioned on various types of representations (e.g., mel-spectrograms, Mel-frequency Cepstral Coefficients (MFCC)). Recently, such models have been used to synthesize audio waveforms conditioned on highly compressed representations. Although such methods produce impressive results, they are prone to generate audible artifacts when the condi… ▽ More

    Submitted 8 November, 2023; v1 submitted 2 August, 2023; originally announced August 2023.

    Comments: 10 pages

    Journal ref: Thirty-seventh Conference on Neural Information Processing Systems (2023)

  17. arXiv:2302.06018  [pdf, other

    cs.GT cs.LG

    Optimizing Floors in First Price Auctions: an Empirical Study of Yahoo Advertising

    Authors: Miguel Alcobendas, Jonathan Ji, Hemakumar Gokulakannan, Dawit Wami, Boris Kapchits, Emilien Pouradier Duteil, Korby Satow, Maria Rosario Levy Roman, Oriol Diaz, Amado A. Diaz Jr., Rabi Kavoori

    Abstract: Floors (also known as reserve prices) help publishers to increase the expected revenue of their ad space, which is usually sold via auctions. Floors are defined as the minimum bid that a seller (it can be a publisher or an ad exchange) is willing to accept for the inventory opportunity. In this paper, we present a model to set floors in first price auctions, and discuss the impact of its implement… ▽ More

    Submitted 9 February, 2024; v1 submitted 12 February, 2023; originally announced February 2023.

  18. arXiv:2110.09111  [pdf, other

    cs.AI cs.LG cs.SI

    Analyzing Wikipedia Membership Dataset and PredictingUnconnected Nodes in the Signed Networks

    Authors: Zhihao Wu, Taoran Li, Ray Roman

    Abstract: In the age of digital interaction, person-to-person relationships existing on social media may be different from the very same interactions that exist offline. Examining potential or spurious relationships between members in a social network is a fertile area of research for computer scientists -- here we examine how relationships can be predicted between two unconnected people in a social network… ▽ More

    Submitted 18 October, 2021; originally announced October 2021.

    Comments: The work was done in UCLA CS249 17Spring

  19. arXiv:2110.05948  [pdf, other

    eess.SP cs.AI cs.CV cs.GR cs.LG cs.SD eess.AS eess.IV

    Denoising Diffusion Gamma Models

    Authors: Eliya Nachmani, Robin San Roman, Lior Wolf

    Abstract: Generative diffusion processes are an emerging and effective tool for image and speech generation. In the existing methods, the underlying noise distribution of the diffusion process is Gaussian noise. However, fitting distributions with more degrees of freedom could improve the performance of such generative models. In this work, we investigate other types of noise distribution for the diffusion… ▽ More

    Submitted 10 October, 2021; originally announced October 2021.

    Comments: arXiv admin note: substantial text overlap with arXiv:2106.07582

  20. arXiv:2109.12690  [pdf, ps, other

    cs.SD cs.DB cs.LG eess.AS

    Soundata: A Python library for reproducible use of audio datasets

    Authors: Magdalena Fuentes, Justin Salamon, Pablo Zinemanas, Martín Rocamora, Genís Paja, Irán R. Román, Marius Miron, Xavier Serra, Juan Pablo Bello

    Abstract: Soundata is a Python library for loading and working with audio datasets in a standardized way, removing the need for writing custom loaders in every project, and improving reproducibility by providing tools to validate data against a canonical version. It speeds up research pipelines by allowing users to quickly download a dataset, load it into memory in a standardized and reproducible way, valid… ▽ More

    Submitted 4 October, 2021; v1 submitted 26 September, 2021; originally announced September 2021.

  21. arXiv:2106.07582  [pdf, other

    cs.LG cs.CV cs.SD eess.AS

    Non Gaussian Denoising Diffusion Models

    Authors: Eliya Nachmani, Robin San Roman, Lior Wolf

    Abstract: Generative diffusion processes are an emerging and effective tool for image and speech generation. In the existing methods, the underline noise distribution of the diffusion process is Gaussian noise. However, fitting distributions with more degrees of freedom, could help the performance of such generative models. In this work, we investigate other types of noise distribution for the diffusion pro… ▽ More

    Submitted 14 June, 2021; originally announced June 2021.

  22. arXiv:2005.00728  [pdf, other

    cs.CL cs.AI cs.CV cs.LG cs.RO

    RMM: A Recursive Mental Model for Dialog Navigation

    Authors: Homero Roman Roman, Yonatan Bisk, Jesse Thomason, Asli Celikyilmaz, Jianfeng Gao

    Abstract: Language-guided robots must be able to both ask humans questions and understand answers. Much existing work focuses only on the latter. In this paper, we go beyond instruction following and introduce a two-agent task where one agent navigates and asks questions that a second, guiding agent answers. Inspired by theory of mind, we propose the Recursive Mental Model (RMM). The navigating agent models… ▽ More

    Submitted 5 October, 2020; v1 submitted 2 May, 2020; originally announced May 2020.

    Comments: Findings of Empirical Methods in Natural Language Processing (EMNLP Findings), 2020

  23. Mobile Edge Computing, Fog et al.: A Survey and Analysis of Security Threats and Challenges

    Authors: Rodrigo Roman, Javier Lopez, Masahiro Mambo

    Abstract: For various reasons, the cloud computing paradigm is unable to meet certain requirements (e.g. low latency and jitter, context awareness, mobility support) that are crucial for several applications (e.g. vehicular networks, augmented reality). To fulfil these requirements, various paradigms, such as fog computing, mobile edge computing, and mobile cloud computing, have emerged in recent years. Whi… ▽ More

    Submitted 14 November, 2016; v1 submitted 1 February, 2016; originally announced February 2016.

    Comments: In press, accepted manuscript: Future Generation Computer Systems

    ACM Class: A.1; D.4.6; C.2.4