Search | arXiv e-print repository

arXiv:2505.23509 [pdf, other]

Spectrotemporal Modulation: Efficient and Interpretable Feature Representation for Classifying Speech, Music, and Environmental Sounds

Authors: Andrew Chang, Yike Li, Iran R. Roman, David Poeppel

Abstract: Audio DNNs have demonstrated impressive performance on various machine listening tasks; however, most of their representations are computationally costly and uninterpretable, leaving room for optimization. Here, we propose a novel approach centered on spectrotemporal modulation (STM) features, a signal processing method that mimics the neurophysiological representation in the human auditory cortex… ▽ More Audio DNNs have demonstrated impressive performance on various machine listening tasks; however, most of their representations are computationally costly and uninterpretable, leaving room for optimization. Here, we propose a novel approach centered on spectrotemporal modulation (STM) features, a signal processing method that mimics the neurophysiological representation in the human auditory cortex. The classification performance of our STM-based model, without any pretraining, is comparable to that of pretrained audio DNNs across diverse naturalistic speech, music, and environmental sounds, which are essential categories for both human cognition and machine perception. These results show that STM is an efficient and interpretable feature representation for audio classification, advancing the development of machine listening and unlocking exciting new possibilities for basic understanding of speech and auditory sciences, as well as developing audio BCI and cognitive computing. △ Less

Submitted 29 May, 2025; originally announced May 2025.

Comments: Interspeech 2025

arXiv:2504.02988 [pdf, ps, other]

Generating Diverse Audio-Visual 360 Soundscapes for Sound Event Localization and Detection

Authors: Adrian S. Roman, Aiden Chang, Gerardo Meza, Iran R. Roman

Abstract: We present SELDVisualSynth, a tool for generating synthetic videos for audio-visual sound event localization and detection (SELD). Our approach incorporates real-world background images to improve realism in synthetic audio-visual SELD data while also ensuring audio-visual spatial alignment. The tool creates 360 synthetic videos where objects move matching synthetic SELD audio data and its annotat… ▽ More We present SELDVisualSynth, a tool for generating synthetic videos for audio-visual sound event localization and detection (SELD). Our approach incorporates real-world background images to improve realism in synthetic audio-visual SELD data while also ensuring audio-visual spatial alignment. The tool creates 360 synthetic videos where objects move matching synthetic SELD audio data and its annotations. Experimental results demonstrate that a model trained with this data attains performance gains across multiple metrics, achieving superior localization recall (56.4 LR) and competitive localization error (21.9deg LE). We open-source our data generation tool for maximal use by members of the SELD research community. △ Less

Submitted 3 April, 2025; originally announced April 2025.

arXiv:2504.02197 [pdf, other]

doi 10.1109/MCG.2025.3549696

Design and Implementation of the Transparent, Interpretable, and Multimodal (TIM) AR Personal Assistant

Authors: Erin McGowan, Joao Rulff, Sonia Castelo, Guande Wu, Shaoyu Chen, Roque Lopez, Bea Steers, Iran R. Roman, Fabio F. Dias, Jing Qian, Parikshit Solunke, Michael Middleton, Ryan McKendrick, Claudio T. Silva

Abstract: The concept of an AI assistant for task guidance is rapidly shifting from a science fiction staple to an impending reality. Such a system is inherently complex, requiring models for perceptual grounding, attention, and reasoning, an intuitive interface that adapts to the performer's needs, and the orchestration of data streams from many sensors. Moreover, all data acquired by the system must be re… ▽ More The concept of an AI assistant for task guidance is rapidly shifting from a science fiction staple to an impending reality. Such a system is inherently complex, requiring models for perceptual grounding, attention, and reasoning, an intuitive interface that adapts to the performer's needs, and the orchestration of data streams from many sensors. Moreover, all data acquired by the system must be readily available for post-hoc analysis to enable developers to understand performer behavior and quickly detect failures. We introduce TIM, the first end-to-end AI-enabled task guidance system in augmented reality which is capable of detecting both the user and scene as well as providing adaptable, just-in-time feedback. We discuss the system challenges and propose design solutions. We also demonstrate how TIM adapts to domain applications with varying needs, highlighting how the system components can be customized for each scenario. △ Less

Submitted 2 April, 2025; originally announced April 2025.

Comments: Copyright 2025 IEEE. All rights reserved, including rights for text and data mining and training of artificial intelligence and similar technologies. Personal use is permitted, but republication/redistribution requires IEEE permission. Article accepted for publication in IEEE Computer Graphics and Applications. This is the author's version, content may change prior to final publication

arXiv:2501.03720 [pdf, other]

Guitar-TECHS: An Electric Guitar Dataset Covering Techniques, Musical Excerpts, Chords and Scales Using a Diverse Array of Hardware

Authors: Hegel Pedroza, Wallace Abreu, Ryan M. Corey, Iran R. Roman

Abstract: Guitar-related machine listening research involves tasks like timbre transfer, performance generation, and automatic transcription. However, small datasets often limit model robustness due to insufficient acoustic diversity and musical content. To address these issues, we introduce Guitar-TECHS, a comprehensive dataset featuring a variety of guitar techniques, musical excerpts, chords, and scales.… ▽ More Guitar-related machine listening research involves tasks like timbre transfer, performance generation, and automatic transcription. However, small datasets often limit model robustness due to insufficient acoustic diversity and musical content. To address these issues, we introduce Guitar-TECHS, a comprehensive dataset featuring a variety of guitar techniques, musical excerpts, chords, and scales. These elements are performed by diverse musicians across various recording settings. Guitar-TECHS incorporates recordings from two stereo microphones: an egocentric microphone positioned on the performer's head and an exocentric microphone placed in front of the performer. It also includes direct input recordings and microphoned amplifier outputs, offering a wide spectrum of audio inputs and recording qualities. All signals and MIDI labels are properly synchronized. Its multi-perspective and multi-modal content makes Guitar-TECHS a valuable resource for advancing data-driven guitar research, and to develop robust guitar listening algorithms. We provide empirical data to demonstrate the dataset's effectiveness in training robust models for Guitar Tablature Transcription. △ Less

Submitted 7 January, 2025; originally announced January 2025.

Journal ref: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025

arXiv:2501.01757 [pdf, other]

MusicGen-Stem: Multi-stem music generation and edition through autoregressive modeling

Authors: Simon Rouard, Robin San Roman, Yossi Adi, Axel Roebel

Abstract: While most music generation models generate a mixture of stems (in mono or stereo), we propose to train a multi-stem generative model with 3 stems (bass, drums and other) that learn the musical dependencies between them. To do so, we train one specialized compression algorithm per stem to tokenize the music into parallel streams of tokens. Then, we leverage recent improvements in the task of music… ▽ More While most music generation models generate a mixture of stems (in mono or stereo), we propose to train a multi-stem generative model with 3 stems (bass, drums and other) that learn the musical dependencies between them. To do so, we train one specialized compression algorithm per stem to tokenize the music into parallel streams of tokens. Then, we leverage recent improvements in the task of music source separation to train a multi-stream text-to-music language model on a large dataset. Finally, thanks to a particular conditioning method, our model is able to edit bass, drums or other stems on existing or generated songs as well as doing iterative composition (e.g. generating bass on top of existing drums). This gives more flexibility in music generation algorithms and it is to the best of our knowledge the first open-source multi-stem autoregressive music generation model that can perform good quality generation and coherent source editing. Code and model weights will be released and samples are available on https://simonrouard.github.io/musicgenstem/. △ Less

Submitted 7 January, 2025; v1 submitted 3 January, 2025; originally announced January 2025.

Comments: 5 pages, 3 figures, accepted to ICASSP 2025

arXiv:2412.08821 [pdf, other]

Large Concept Models: Language Modeling in a Sentence Representation Space

Authors: LCM team, Loïc Barrault, Paul-Ambroise Duquenne, Maha Elbayad, Artyom Kozhevnikov, Belen Alastruey, Pierre Andrews, Mariano Coria, Guillaume Couairon, Marta R. Costa-jussà, David Dale, Hady Elsahar, Kevin Heffernan, João Maria Janeiro, Tuan Tran, Christophe Ropers, Eduardo Sánchez, Robin San Roman, Alexandre Mourachko, Safiyyah Saleem, Holger Schwenk

Abstract: LLMs have revolutionized the field of artificial intelligence and have emerged as the de-facto tool for many tasks. The current established technology of LLMs is to process input and generate output at the token level. This is in sharp contrast to humans who operate at multiple levels of abstraction, well beyond single words, to analyze information and to generate creative content. In this paper,… ▽ More LLMs have revolutionized the field of artificial intelligence and have emerged as the de-facto tool for many tasks. The current established technology of LLMs is to process input and generate output at the token level. This is in sharp contrast to humans who operate at multiple levels of abstraction, well beyond single words, to analyze information and to generate creative content. In this paper, we present an attempt at an architecture which operates on an explicit higher-level semantic representation, which we name a concept. Concepts are language- and modality-agnostic and represent a higher level idea or action in a flow. Hence, we build a "Large Concept Model". In this study, as proof of feasibility, we assume that a concept corresponds to a sentence, and use an existing sentence embedding space, SONAR, which supports up to 200 languages in both text and speech modalities. The Large Concept Model is trained to perform autoregressive sentence prediction in an embedding space. We explore multiple approaches, namely MSE regression, variants of diffusion-based generation, and models operating in a quantized SONAR space. These explorations are performed using 1.6B parameter models and training data in the order of 1.3T tokens. We then scale one architecture to a model size of 7B parameters and training data of about 2.7T tokens. We perform an experimental evaluation on several generative tasks, namely summarization and a new task of summary expansion. Finally, we show that our model exhibits impressive zero-shot generalization performance to many languages, outperforming existing LLMs of the same size. The training code of our models is freely available. △ Less

Submitted 15 December, 2024; v1 submitted 11 December, 2024; originally announced December 2024.

Comments: 49 pages

arXiv:2411.08234 [pdf, other]

Analyzing Pitch Content in Traditional Ghanaian Seperewa Songs

Authors: Kelvin L Walls, Iran R Roman, Kelsey Van Ert, Colter Harper, Leila Adu-Gilmore

Abstract: This study examines the pitch content in traditional Ghanaian seperewa (Akan harp-lute) songs, utilizing a unique dataset from field recordings of the mid-twentieth century. We selected 71 songs and used Demucs to isolate vocals from instrumental tracks. We then retrieved the F0 content from these isolated tacks and applied Gaussian Mixture Models (GMM) to approximate musical scales. Comparative F… ▽ More This study examines the pitch content in traditional Ghanaian seperewa (Akan harp-lute) songs, utilizing a unique dataset from field recordings of the mid-twentieth century. We selected 71 songs and used Demucs to isolate vocals from instrumental tracks. We then retrieved the F0 content from these isolated tacks and applied Gaussian Mixture Models (GMM) to approximate musical scales. Comparative F0 analysis between vocals and seperewa revealed higher microtonal deviations from equal temperament in vocal tracks. We also note challenges in using MIR tools for musical scale approximation in non-Western music. Our research contributes to the quantitative study of pitch in traditional music of Sub-Saharan Africa. △ Less

Submitted 12 November, 2024; originally announced November 2024.

arXiv:2409.02915 [pdf, other]

Latent Watermarking of Audio Generative Models

Authors: Robin San Roman, Pierre Fernandez, Antoine Deleforge, Yossi Adi, Romain Serizel

Abstract: The advancements in audio generative models have opened up new challenges in their responsible disclosure and the detection of their misuse. In response, we introduce a method to watermark latent generative models by a specific watermarking of their training data. The resulting watermarked models produce latent representations whose decoded outputs are detected with high confidence, regardless of… ▽ More The advancements in audio generative models have opened up new challenges in their responsible disclosure and the detection of their misuse. In response, we introduce a method to watermark latent generative models by a specific watermarking of their training data. The resulting watermarked models produce latent representations whose decoded outputs are detected with high confidence, regardless of the decoding method used. This approach enables the detection of the generated content without the need for a post-hoc watermarking step. It provides a more secure solution for open-sourced models and facilitates the identification of derivative works that fine-tune or use these models without adhering to their license terms. Our results indicate for instance that generated outputs are detected with an accuracy of more than 75% at a false positive rate of $10^{-3}$, even after fine-tuning the latent generative model. △ Less

Submitted 4 September, 2024; originally announced September 2024.

arXiv:2401.17264 [pdf, other]

Proactive Detection of Voice Cloning with Localized Watermarking

Authors: Robin San Roman, Pierre Fernandez, Alexandre Défossez, Teddy Furon, Tuan Tran, Hady Elsahar

Abstract: In the rapidly evolving field of speech generative models, there is a pressing need to ensure audio authenticity against the risks of voice cloning. We present AudioSeal, the first audio watermarking technique designed specifically for localized detection of AI-generated speech. AudioSeal employs a generator/detector architecture trained jointly with a localization loss to enable localized waterma… ▽ More In the rapidly evolving field of speech generative models, there is a pressing need to ensure audio authenticity against the risks of voice cloning. We present AudioSeal, the first audio watermarking technique designed specifically for localized detection of AI-generated speech. AudioSeal employs a generator/detector architecture trained jointly with a localization loss to enable localized watermark detection up to the sample level, and a novel perceptual loss inspired by auditory masking, that enables AudioSeal to achieve better imperceptibility. AudioSeal achieves state-of-the-art performance in terms of robustness to real life audio manipulations and imperceptibility based on automatic and human evaluation metrics. Additionally, AudioSeal is designed with a fast, single-pass detector, that significantly surpasses existing models in speed - achieving detection up to two orders of magnitude faster, making it ideal for large-scale and real-time applications. △ Less

Submitted 6 June, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

Comments: Published at ICML 2024. Code at https://github.com/facebookresearch/audioseal - webpage at https://pierrefdz.github.io/publications/audioseal/

arXiv:2401.12238 [pdf, other]

Spatial Scaper: A Library to Simulate and Augment Soundscapes for Sound Event Localization and Detection in Realistic Rooms

Authors: Iran R. Roman, Christopher Ick, Sivan Ding, Adrian S. Roman, Brian McFee, Juan P. Bello

Abstract: Sound event localization and detection (SELD) is an important task in machine listening. Major advancements rely on simulated data with sound events in specific rooms and strong spatio-temporal labels. SELD data is simulated by convolving spatialy-localized room impulse responses (RIRs) with sound waveforms to place sound events in a soundscape. However, RIRs require manual collection in specific… ▽ More Sound event localization and detection (SELD) is an important task in machine listening. Major advancements rely on simulated data with sound events in specific rooms and strong spatio-temporal labels. SELD data is simulated by convolving spatialy-localized room impulse responses (RIRs) with sound waveforms to place sound events in a soundscape. However, RIRs require manual collection in specific rooms. We present SpatialScaper, a library for SELD data simulation and augmentation. Compared to existing tools, SpatialScaper emulates virtual rooms via parameters such as size and wall absorption. This allows for parameterized placement (including movement) of foreground and background sound sources. SpatialScaper also includes data augmentation pipelines that can be applied to existing SELD data. As a case study, we use SpatialScaper to add rooms to the DCASE SELD data. Training a model with our data led to progressive performance improves as a direct function of acoustic diversity. These results show that SpatialScaper is valuable to train robust SELD models. △ Less

Submitted 19 January, 2024; originally announced January 2024.

Comments: 5 pages, 4 figures, 1 table, to be presented at ICASSP 2024 in Seoul, South Korea

arXiv:2401.08717 [pdf, other]

Robust DOA estimation using deep acoustic imaging

Authors: Adrian S. Roman, Iran R. Roman, Juan P. Bello

Abstract: Direction of arrival estimation (DoAE) aims at tracking a sound in azimuth and elevation. Recent advancements include data-driven models with inputs derived from ambisonics intensity vectors or correlations between channels in a microphone array. A spherical intensity map (SIM), or acoustic image, is an alternative input representation that remains underexplored. SIMs benefit from high-resolution… ▽ More Direction of arrival estimation (DoAE) aims at tracking a sound in azimuth and elevation. Recent advancements include data-driven models with inputs derived from ambisonics intensity vectors or correlations between channels in a microphone array. A spherical intensity map (SIM), or acoustic image, is an alternative input representation that remains underexplored. SIMs benefit from high-resolution microphone arrays, yet most DoAE datasets use low-resolution ones. Therefore, we first propose a super-resolution method to upsample low-resolution microphones. Next, we benchmark DoAE models that use SIMs as input. We arrive to a model that uses SIMs for DoAE estimation and outperforms a baseline and a state-of-the-art model. Our study highlights the relevance of acoustic imaging for DoAE tasks. △ Less

Submitted 15 January, 2024; originally announced January 2024.

arXiv:2312.14036 [pdf, other]

Total variation in popular rap vocals from 2009-2023: extension of the analysis by Georgieva, Ripolles & McFee

Authors: Kelvin L Walls, Iran R Roman, Bea Steers, Elena Georgieva

Abstract: Pitch variability in rap vocals is overlooked in favor of the genre's uniquely dynamic rhythmic properties. We present an analysis of fundamental frequency (F0) variation in rap vocals over the past 14 years, focusing on song examples that represent the state of modern rap music. Our analysis aims at identifying meaningful trends over time, and is in turn a continuation of the 2023 analysis by Geo… ▽ More Pitch variability in rap vocals is overlooked in favor of the genre's uniquely dynamic rhythmic properties. We present an analysis of fundamental frequency (F0) variation in rap vocals over the past 14 years, focusing on song examples that represent the state of modern rap music. Our analysis aims at identifying meaningful trends over time, and is in turn a continuation of the 2023 analysis by Georgieva, Ripolles & McFee. They found rap to be an outlier with larger F0 variation compared to other genres, but with a declining trend since the genre's inception. However, they only analyzed data through 2010. Our analysis looks beyond 2010. We once again observe rap's large F0 variation, but with a decelerated decline in recent years. △ Less

Submitted 21 December, 2023; originally announced December 2023.

Journal ref: Ismir 2023 Hybrid Conference 2023 Nov 5

arXiv:2312.05187 [pdf, other]

Seamless: Multilingual Expressive and Streaming Speech Translation

Authors: Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, John Hoffman, Min-Jae Hwang, Hirofumi Inaguma, Christopher Klaiber, Ilia Kulikov, Pengwei Li, Daniel Licht, Jean Maillard, Ruslan Mavlyutov, Alice Rakotoarison, Kaushik Ram Sadagopan, Abinesh Ramakrishnan, Tuan Tran, Guillaume Wenzek , et al. (40 additional authors not shown)

Abstract: Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4… ▽ More Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model-SeamlessM4T v2. This newer model, incorporating an updated UnitY2 framework, was trained on more low-resource language data. SeamlessM4T v2 provides the foundation on which our next two models are initiated. SeamlessExpressive enables translation that preserves vocal styles and prosody. Compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one's voice. As for SeamlessStreaming, our model leverages the Efficient Monotonic Multihead Attention mechanism to generate low-latency target translations without waiting for complete source utterances. As the first of its kind, SeamlessStreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. To ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. Consequently, we bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. The contributions to this work are publicly released and accessible at https://github.com/facebookresearch/seamless_communication △ Less

Submitted 8 December, 2023; originally announced December 2023.

arXiv:2310.00870 [pdf, other]

doi 10.5281/zenodo.8136568

F0 analysis of Ghanaian pop singing reveals progressive alignment with equal temperament over the past three decades: a case study

Authors: Iran R. Roman, Daniel Faronbi, Isabelle Burger-Weiser, Leila Adu-Gilmore

Abstract: Contemporary Ghanaian popular singing combines European and traditional Ghanaian influences. We hypothesize that access to technology embedded with equal temperament catalyzed a progressive alignment of Ghanaian singing with equal-tempered scales over time. To test this, we study the Ghanaian singer Daddy Lumba, whose work spans from the earliest Ghanaian electronic style in the late 1980s to the… ▽ More Contemporary Ghanaian popular singing combines European and traditional Ghanaian influences. We hypothesize that access to technology embedded with equal temperament catalyzed a progressive alignment of Ghanaian singing with equal-tempered scales over time. To test this, we study the Ghanaian singer Daddy Lumba, whose work spans from the earliest Ghanaian electronic style in the late 1980s to the present. Studying a singular musician as a case study allows us to refine our analysis without over-interpreting the findings. We curated a collection of his songs, distributed between 1989 and 2016, to extract F0 values from isolated vocals. We used Gaussian mixture modeling (GMM) to approximate each song's scale and found that the pitch variance has been decreasing over time. We also determined whether the GMM components follow the arithmetic relationships observed in equal-tempered scales, and observed that Daddy Lumba's singing better aligns with equal temperament in recent years. Together, results reveal the impact of exposure to equal-tempered scales, resulting in lessened microtonal content in Daddy Lumba's singing. Our study highlights a potential vulnerability of Ghanaian musical scales and implies a need for research that maps and archives singing styles. △ Less

Submitted 1 October, 2023; originally announced October 2023.

Comments: Pages 27-33

arXiv:2309.09288 [pdf, other]

Sound Source Distance Estimation in Diverse and Dynamic Acoustic Conditions

Authors: Saksham Singh Kushwaha, Iran R. Roman, Magdalena Fuentes, Juan Pablo Bello

Abstract: Localizing a moving sound source in the real world involves determining its direction-of-arrival (DOA) and distance relative to a microphone. Advancements in DOA estimation have been facilitated by data-driven methods optimized with large open-source datasets with microphone array recordings in diverse environments. In contrast, estimating a sound source's distance remains understudied. Existing a… ▽ More Localizing a moving sound source in the real world involves determining its direction-of-arrival (DOA) and distance relative to a microphone. Advancements in DOA estimation have been facilitated by data-driven methods optimized with large open-source datasets with microphone array recordings in diverse environments. In contrast, estimating a sound source's distance remains understudied. Existing approaches assume recordings by non-coincident microphones to use methods that are susceptible to differences in room reverberation. We present a CRNN able to estimate the distance of moving sound sources across multiple datasets featuring diverse rooms, outperforming a recently-published approach. We also characterize our model's performance as a function of sound source distance and different training losses. This analysis reveals optimal training using a loss that weighs model errors as an inverse function of the sound source true distance. Our study is the first to demonstrate that sound source distance estimation can be performed across diverse acoustic conditions using deep learning. △ Less

Submitted 17 September, 2023; originally announced September 2023.

Comments: Accepted in WASPAA 2023

arXiv:2308.02560 [pdf, other]

From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion

Authors: Robin San Roman, Yossi Adi, Antoine Deleforge, Romain Serizel, Gabriel Synnaeve, Alexandre Défossez

Abstract: Deep generative models can generate high-fidelity audio conditioned on various types of representations (e.g., mel-spectrograms, Mel-frequency Cepstral Coefficients (MFCC)). Recently, such models have been used to synthesize audio waveforms conditioned on highly compressed representations. Although such methods produce impressive results, they are prone to generate audible artifacts when the condi… ▽ More Deep generative models can generate high-fidelity audio conditioned on various types of representations (e.g., mel-spectrograms, Mel-frequency Cepstral Coefficients (MFCC)). Recently, such models have been used to synthesize audio waveforms conditioned on highly compressed representations. Although such methods produce impressive results, they are prone to generate audible artifacts when the conditioning is flawed or imperfect. An alternative modeling approach is to use diffusion models. However, these have mainly been used as speech vocoders (i.e., conditioned on mel-spectrograms) or generating relatively low sampling rate signals. In this work, we propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality (e.g., speech, music, environmental sounds) from low-bitrate discrete representations. At equal bit rate, the proposed approach outperforms state-of-the-art generative techniques in terms of perceptual quality. Training and, evaluation code, along with audio samples, are available on the facebookresearch/audiocraft Github page. △ Less

Submitted 8 November, 2023; v1 submitted 2 August, 2023; originally announced August 2023.

Comments: 10 pages

Journal ref: Thirty-seventh Conference on Neural Information Processing Systems (2023)

arXiv:2302.06018 [pdf, other]

Optimizing Floors in First Price Auctions: an Empirical Study of Yahoo Advertising

Authors: Miguel Alcobendas, Jonathan Ji, Hemakumar Gokulakannan, Dawit Wami, Boris Kapchits, Emilien Pouradier Duteil, Korby Satow, Maria Rosario Levy Roman, Oriol Diaz, Amado A. Diaz Jr., Rabi Kavoori

Abstract: Floors (also known as reserve prices) help publishers to increase the expected revenue of their ad space, which is usually sold via auctions. Floors are defined as the minimum bid that a seller (it can be a publisher or an ad exchange) is willing to accept for the inventory opportunity. In this paper, we present a model to set floors in first price auctions, and discuss the impact of its implement… ▽ More Floors (also known as reserve prices) help publishers to increase the expected revenue of their ad space, which is usually sold via auctions. Floors are defined as the minimum bid that a seller (it can be a publisher or an ad exchange) is willing to accept for the inventory opportunity. In this paper, we present a model to set floors in first price auctions, and discuss the impact of its implementation on Yahoo sites. The model captures important characteristics of the online advertising industry. For instance, some bidders impose restrictions on how ad exchanges can handle data from bidders, conditioning the model choice to set reserve prices. Our solution induces bidders to change their bidding behavior as a response to the floors enclosed in the bid request, helping online publishers to increase their ad revenue. The outlined methodology has been implemented at Yahoo with remarkable results. The annualized incremental revenue is estimated at +1.3% on Yahoo display inventory, and +2.5% on video ad inventory. These are non-negligible numbers in the multi-million Yahoo ad business. △ Less

Submitted 9 February, 2024; v1 submitted 12 February, 2023; originally announced February 2023.

arXiv:2110.09111 [pdf, other]

Analyzing Wikipedia Membership Dataset and PredictingUnconnected Nodes in the Signed Networks

Authors: Zhihao Wu, Taoran Li, Ray Roman

Abstract: In the age of digital interaction, person-to-person relationships existing on social media may be different from the very same interactions that exist offline. Examining potential or spurious relationships between members in a social network is a fertile area of research for computer scientists -- here we examine how relationships can be predicted between two unconnected people in a social network… ▽ More In the age of digital interaction, person-to-person relationships existing on social media may be different from the very same interactions that exist offline. Examining potential or spurious relationships between members in a social network is a fertile area of research for computer scientists -- here we examine how relationships can be predicted between two unconnected people in a social network by using area under Precison-Recall curve and ROC. Modeling the social network as a signed graph, we compare Triadic model,Latent Information model and Sentiment model and use them to predict peer to peer interactions, first using a plain signed network, and second using a signed network with comments as context. We see that our models are much better than random model and could complement each other in different cases. △ Less

Submitted 18 October, 2021; originally announced October 2021.

Comments: The work was done in UCLA CS249 17Spring

arXiv:2110.05948 [pdf, other]

Denoising Diffusion Gamma Models

Authors: Eliya Nachmani, Robin San Roman, Lior Wolf

Abstract: Generative diffusion processes are an emerging and effective tool for image and speech generation. In the existing methods, the underlying noise distribution of the diffusion process is Gaussian noise. However, fitting distributions with more degrees of freedom could improve the performance of such generative models. In this work, we investigate other types of noise distribution for the diffusion… ▽ More Generative diffusion processes are an emerging and effective tool for image and speech generation. In the existing methods, the underlying noise distribution of the diffusion process is Gaussian noise. However, fitting distributions with more degrees of freedom could improve the performance of such generative models. In this work, we investigate other types of noise distribution for the diffusion process. Specifically, we introduce the Denoising Diffusion Gamma Model (DDGM) and show that noise from Gamma distribution provides improved results for image and speech generation. Our approach preserves the ability to efficiently sample state in the training diffusion process while using Gamma noise. △ Less

Submitted 10 October, 2021; originally announced October 2021.

Comments: arXiv admin note: substantial text overlap with arXiv:2106.07582

arXiv:2109.12690 [pdf, ps, other]

Soundata: A Python library for reproducible use of audio datasets

Authors: Magdalena Fuentes, Justin Salamon, Pablo Zinemanas, Martín Rocamora, Genís Paja, Irán R. Román, Marius Miron, Xavier Serra, Juan Pablo Bello

Abstract: Soundata is a Python library for loading and working with audio datasets in a standardized way, removing the need for writing custom loaders in every project, and improving reproducibility by providing tools to validate data against a canonical version. It speeds up research pipelines by allowing users to quickly download a dataset, load it into memory in a standardized and reproducible way, valid… ▽ More Soundata is a Python library for loading and working with audio datasets in a standardized way, removing the need for writing custom loaders in every project, and improving reproducibility by providing tools to validate data against a canonical version. It speeds up research pipelines by allowing users to quickly download a dataset, load it into memory in a standardized and reproducible way, validate that the dataset is complete and correct, and more. Soundata is based and inspired on mirdata and design to complement mirdata by working with environmental sound, bioacoustic and speech datasets, among others. Soundata was created to be easy to use, easy to contribute to, and to increase reproducibility and standardize usage of sound datasets in a flexible way. △ Less

Submitted 4 October, 2021; v1 submitted 26 September, 2021; originally announced September 2021.

arXiv:2106.07582 [pdf, other]

Non Gaussian Denoising Diffusion Models

Authors: Eliya Nachmani, Robin San Roman, Lior Wolf

Abstract: Generative diffusion processes are an emerging and effective tool for image and speech generation. In the existing methods, the underline noise distribution of the diffusion process is Gaussian noise. However, fitting distributions with more degrees of freedom, could help the performance of such generative models. In this work, we investigate other types of noise distribution for the diffusion pro… ▽ More Generative diffusion processes are an emerging and effective tool for image and speech generation. In the existing methods, the underline noise distribution of the diffusion process is Gaussian noise. However, fitting distributions with more degrees of freedom, could help the performance of such generative models. In this work, we investigate other types of noise distribution for the diffusion process. Specifically, we show that noise from Gamma distribution provides improved results for image and speech generation. Moreover, we show that using a mixture of Gaussian noise variables in the diffusion process improves the performance over a diffusion process that is based on a single distribution. Our approach preserves the ability to efficiently sample state in the training diffusion process while using Gamma noise and a mixture of noise. △ Less

Submitted 14 June, 2021; originally announced June 2021.

arXiv:2005.00728 [pdf, other]

RMM: A Recursive Mental Model for Dialog Navigation

Authors: Homero Roman Roman, Yonatan Bisk, Jesse Thomason, Asli Celikyilmaz, Jianfeng Gao

Abstract: Language-guided robots must be able to both ask humans questions and understand answers. Much existing work focuses only on the latter. In this paper, we go beyond instruction following and introduce a two-agent task where one agent navigates and asks questions that a second, guiding agent answers. Inspired by theory of mind, we propose the Recursive Mental Model (RMM). The navigating agent models… ▽ More Language-guided robots must be able to both ask humans questions and understand answers. Much existing work focuses only on the latter. In this paper, we go beyond instruction following and introduce a two-agent task where one agent navigates and asks questions that a second, guiding agent answers. Inspired by theory of mind, we propose the Recursive Mental Model (RMM). The navigating agent models the guiding agent to simulate answers given candidate generated questions. The guiding agent in turn models the navigating agent to simulate navigation steps it would take to generate answers. We use the progress agents make towards the goal as a reinforcement learning reward signal to directly inform not only navigation actions, but also both question and answer generation. We demonstrate that RMM enables better generalization to novel environments. Interlocutor modelling may be a way forward for human-agent dialogue where robots need to both ask and answer questions. △ Less

Submitted 5 October, 2020; v1 submitted 2 May, 2020; originally announced May 2020.

Comments: Findings of Empirical Methods in Natural Language Processing (EMNLP Findings), 2020

arXiv:1602.00484 [pdf, other]

doi 10.1016/j.future.2016.11.009

Mobile Edge Computing, Fog et al.: A Survey and Analysis of Security Threats and Challenges

Authors: Rodrigo Roman, Javier Lopez, Masahiro Mambo

Abstract: For various reasons, the cloud computing paradigm is unable to meet certain requirements (e.g. low latency and jitter, context awareness, mobility support) that are crucial for several applications (e.g. vehicular networks, augmented reality). To fulfil these requirements, various paradigms, such as fog computing, mobile edge computing, and mobile cloud computing, have emerged in recent years. Whi… ▽ More For various reasons, the cloud computing paradigm is unable to meet certain requirements (e.g. low latency and jitter, context awareness, mobility support) that are crucial for several applications (e.g. vehicular networks, augmented reality). To fulfil these requirements, various paradigms, such as fog computing, mobile edge computing, and mobile cloud computing, have emerged in recent years. While these edge paradigms share several features, most of the existing research is compartmentalised; no synergies have been explored. This is especially true in the field of security, where most analyses focus only on one edge paradigm, while ignoring the others. The main goal of this study is to holistically analyse the security threats, challenges, and mechanisms inherent in all edge paradigms, while highlighting potential synergies and venues of collaboration. In our results, we will show that all edge paradigms should consider the advances in other paradigms. △ Less

Submitted 14 November, 2016; v1 submitted 1 February, 2016; originally announced February 2016.

Comments: In press, accepted manuscript: Future Generation Computer Systems

ACM Class: A.1; D.4.6; C.2.4

Showing 1–23 of 23 results for author: Roman, R