-
The Llama 3 Herd of Models
Authors:
Aaron Grattafiori,
Abhimanyu Dubey,
Abhinav Jauhri,
Abhinav Pandey,
Abhishek Kadian,
Ahmad Al-Dahle,
Aiesha Letman,
Akhil Mathur,
Alan Schelten,
Alex Vaughan,
Amy Yang,
Angela Fan,
Anirudh Goyal,
Anthony Hartshorn,
Aobo Yang,
Archi Mitra,
Archie Sravankumar,
Artem Korenev,
Arthur Hinsvark,
Arun Rao,
Aston Zhang,
Aurelien Rodriguez,
Austen Gregerson,
Ava Spataru,
Baptiste Roziere
, et al. (536 additional authors not shown)
Abstract:
Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical…
▽ More
Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.
△ Less
Submitted 23 November, 2024; v1 submitted 31 July, 2024;
originally announced July 2024.
-
COVID-19 Twitter Sentiment Classification Using Hybrid Deep Learning Model Based on Grid Search Methodology
Authors:
Jitendra Tembhurne,
Anant Agrawal,
Kirtan Lakhotia
Abstract:
In the contemporary era, social media platforms amass an extensive volume of social data contributed by their users. In order to promptly grasp the opinions and emotional inclinations of individuals regarding a product or event, it becomes imperative to perform sentiment analysis on the user-generated content. Microblog comments often encompass both lengthy and concise text entries, presenting a c…
▽ More
In the contemporary era, social media platforms amass an extensive volume of social data contributed by their users. In order to promptly grasp the opinions and emotional inclinations of individuals regarding a product or event, it becomes imperative to perform sentiment analysis on the user-generated content. Microblog comments often encompass both lengthy and concise text entries, presenting a complex scenario. This complexity is particularly pronounced in extensive textual content due to its rich content and intricate word interrelations compared to shorter text entries. Sentiment analysis of public opinion shared on social networking websites such as Facebook or Twitter has evolved and found diverse applications. However, several challenges remain to be tackled in this field. The hybrid methodologies have emerged as promising models for mitigating sentiment analysis errors, particularly when dealing with progressively intricate training data. In this article, to investigate the hesitancy of COVID-19 vaccination, we propose eight different hybrid deep learning models for sentiment classification with an aim of improving overall accuracy of the model. The sentiment prediction is achieved using embedding, deep learning model and grid search algorithm on Twitter COVID-19 dataset. According to the study, public sentiment towards COVID-19 immunization appears to be improving with time, as evidenced by the gradual decline in vaccine reluctance. Through extensive evaluation, proposed model reported an increased accuracy of 98.86%, outperforming other models. Specifically, the combination of BERT, CNN and GS yield the highest accuracy, while the combination of GloVe, BiLSTM, CNN and GS follows closely behind with an accuracy of 98.17%. In addition, increase in accuracy in the range of 2.11% to 14.46% is reported by the proposed model in comparisons with existing works.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
A Large-Scale Evaluation of Speech Foundation Models
Authors:
Shu-wen Yang,
Heng-Jui Chang,
Zili Huang,
Andy T. Liu,
Cheng-I Lai,
Haibin Wu,
Jiatong Shi,
Xuankai Chang,
Hsiang-Sheng Tsai,
Wen-Chin Huang,
Tzu-hsun Feng,
Po-Han Chi,
Yist Y. Lin,
Yung-Sung Chuang,
Tzu-Hsien Huang,
Wei-Cheng Tseng,
Kushal Lakhotia,
Shang-Wen Li,
Abdelrahman Mohamed,
Shinji Watanabe,
Hung-yi Lee
Abstract:
The foundation model paradigm leverages a shared foundation model to achieve state-of-the-art (SOTA) performance for various tasks, requiring minimal downstream-specific modeling and data annotation. This approach has proven crucial in the field of Natural Language Processing (NLP). However, the speech processing community lacks a similar setup to explore the paradigm systematically. In this work,…
▽ More
The foundation model paradigm leverages a shared foundation model to achieve state-of-the-art (SOTA) performance for various tasks, requiring minimal downstream-specific modeling and data annotation. This approach has proven crucial in the field of Natural Language Processing (NLP). However, the speech processing community lacks a similar setup to explore the paradigm systematically. In this work, we establish the Speech processing Universal PERformance Benchmark (SUPERB) to study the effectiveness of the paradigm for speech. We propose a unified multi-tasking framework to address speech processing tasks in SUPERB using a frozen foundation model followed by task-specialized, lightweight prediction heads. Combining our results with community submissions, we verify that the foundation model paradigm is promising for speech, and our multi-tasking framework is simple yet effective, as the best-performing foundation model shows competitive generalizability across most SUPERB tasks. For reproducibility and extensibility, we have developed a long-term maintained platform that enables deterministic benchmarking, allows for result sharing via an online leaderboard, and promotes collaboration through a community-driven benchmark database to support new development cycles. Finally, we conduct a series of analyses to offer an in-depth understanding of SUPERB and speech foundation models, including information flows across tasks inside the models, the correctness of the weighted-sum benchmarking protocol and the statistical significance and robustness of the benchmark.
△ Less
Submitted 29 May, 2024; v1 submitted 14 April, 2024;
originally announced April 2024.
-
Edge-Disjoint Spanning Trees on Star-Product Networks
Authors:
Kelly Isham,
Laura Monroe,
Kartik Lakhotia,
Aleyah Dawkins,
Daniel Hwang,
Ales Kubicek
Abstract:
A star-product operation may be used to create large graphs from smaller factor graphs. Network topologies based on star-products demonstrate several advantages including low-diameter, high scalability, modularity and others. Many state-of-the-art diameter-2 and -3 topologies~(Slim Fly, Bundlefly, PolarStar etc.) can be represented as star products.
In this paper, we explore constructions of edg…
▽ More
A star-product operation may be used to create large graphs from smaller factor graphs. Network topologies based on star-products demonstrate several advantages including low-diameter, high scalability, modularity and others. Many state-of-the-art diameter-2 and -3 topologies~(Slim Fly, Bundlefly, PolarStar etc.) can be represented as star products.
In this paper, we explore constructions of edge-disjoint spanning trees~(EDSTs) in star-product topologies. EDSTs expose multiple parallel disjoint pathways in the network and can be leveraged to accelerate collective communication, enhance fault tolerance and network recovery, and manage congestion.
Our EDSTs have provably maximum or near-maximum cardinality which amplifies their benefits. We further analyze their depths and show that for one of our constructions, all trees have order of the depth of the EDSTs of the factor graphs, and for all other constructions, a large subset of the trees have that depth.
△ Less
Submitted 14 May, 2025; v1 submitted 18 March, 2024;
originally announced March 2024.
-
A High-Performance Design, Implementation, Deployment, and Evaluation of The Slim Fly Network
Authors:
Nils Blach,
Maciej Besta,
Daniele De Sensi,
Jens Domke,
Hussein Harake,
Shigang Li,
Patrick Iff,
Marek Konieczny,
Kartik Lakhotia,
Ales Kubicek,
Marcel Ferrari,
Fabrizio Petrini,
Torsten Hoefler
Abstract:
Novel low-diameter network topologies such as Slim Fly (SF) offer significant cost and power advantages over the established Fat Tree, Clos, or Dragonfly. To spearhead the adoption of low-diameter networks, we design, implement, deploy, and evaluate the first real-world SF installation. We focus on deployment, management, and operational aspects of our test cluster with 200 servers and carefully a…
▽ More
Novel low-diameter network topologies such as Slim Fly (SF) offer significant cost and power advantages over the established Fat Tree, Clos, or Dragonfly. To spearhead the adoption of low-diameter networks, we design, implement, deploy, and evaluate the first real-world SF installation. We focus on deployment, management, and operational aspects of our test cluster with 200 servers and carefully analyze performance. We demonstrate techniques for simple cabling and cabling validation as well as a novel high-performance routing architecture for InfiniBand-based low-diameter topologies. Our real-world benchmarks show SF's strong performance for many modern workloads such as deep neural network training, graph analytics, or linear algebra kernels. SF outperforms non-blocking Fat Trees in scalability while offering comparable or better performance and lower cost for large network sizes. Our work can facilitate deploying SF while the associated (open-source) routing architecture is fully portable and applicable to accelerate any low-diameter interconnect.
△ Less
Submitted 21 April, 2024; v1 submitted 5 October, 2023;
originally announced October 2023.
-
PolarStar: Expanding the Scalability Horizon of Diameter-3 Networks
Authors:
Kartik Lakhotia,
Laura Monroe,
Kelly Isham,
Maciej Besta,
Nils Blach,
Torsten Hoefler,
Fabrizio Petrini
Abstract:
We present PolarStar, a novel family of diameter-3 network topologies derived from the star product of low-diameter factor graphs. PolarStar gives the largest known diameter-3 network topologies for almost all radixes, thus providing the best known scalable diameter-$3$ network. Compared to current state-of-the-art diameter-$3$ networks, PolarStar achieves $1.3\times$ geometric mean increase in sc…
▽ More
We present PolarStar, a novel family of diameter-3 network topologies derived from the star product of low-diameter factor graphs. PolarStar gives the largest known diameter-3 network topologies for almost all radixes, thus providing the best known scalable diameter-$3$ network. Compared to current state-of-the-art diameter-$3$ networks, PolarStar achieves $1.3\times$ geometric mean increase in scale over Bundlefly, $1.9\times$ over Dragonfly, and $6.7\times$ over {3-D} HyperX. PolarStar has many other desirable properties, including a modular layout, large bisection, high resilience to link failures and a large number of feasible configurations for every radix. We give a detailed evaluation with simulations of synthetic and real-world traffic patterns and show that PolarStar exhibits comparable or better performance than current diameter-3 networks.
△ Less
Submitted 6 August, 2024; v1 submitted 14 February, 2023;
originally announced February 2023.
-
PolarFly: A Cost-Effective and Flexible Low-Diameter Topology
Authors:
Kartik Lakhotia,
Maciej Besta,
Laura Monroe,
Kelly Isham,
Patrick Iff,
Torsten Hoefler,
Fabrizio Petrini
Abstract:
In this paper we present PolarFly, a diameter-2 network topology based on the Erdos-Renyi family of polarity graphs from finite geometry. This is a highly scalable low-diameter topology that asymptotically reaches the Moore bound on the number of nodes for a given network degree and diameter
PolarFly achieves high Moore bound efficiency even for the moderate radixes commonly seen in current and…
▽ More
In this paper we present PolarFly, a diameter-2 network topology based on the Erdos-Renyi family of polarity graphs from finite geometry. This is a highly scalable low-diameter topology that asymptotically reaches the Moore bound on the number of nodes for a given network degree and diameter
PolarFly achieves high Moore bound efficiency even for the moderate radixes commonly seen in current and near-future routers, reaching more than 96% of the theoretical peak. It also offers more feasible router degrees than the state-of-the-art solutions, greatly adding to the selection of scalable diameter-2 networks. PolarFly enjoys many other topological properties highly relevant in practice, such as a modular design and expandability that allow incremental growth in network size without rewiring the whole network. Our evaluation shows that PolarFly outperforms competitive networks in terms of scalability, cost and performance for various traffic patterns.
△ Less
Submitted 2 May, 2023; v1 submitted 2 August, 2022;
originally announced August 2022.
-
SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities
Authors:
Hsiang-Sheng Tsai,
Heng-Jui Chang,
Wen-Chin Huang,
Zili Huang,
Kushal Lakhotia,
Shu-wen Yang,
Shuyan Dong,
Andy T. Liu,
Cheng-I Jeff Lai,
Jiatong Shi,
Xuankai Chang,
Phil Hall,
Hsuan-Jui Chen,
Shang-Wen Li,
Shinji Watanabe,
Abdelrahman Mohamed,
Hung-yi Lee
Abstract:
Transfer learning has proven to be crucial in advancing the state of speech and natural language processing research in recent years. In speech, a model pre-trained by self-supervised learning transfers remarkably well on multiple tasks. However, the lack of a consistent evaluation methodology is limiting towards a holistic understanding of the efficacy of such models. SUPERB was a step towards in…
▽ More
Transfer learning has proven to be crucial in advancing the state of speech and natural language processing research in recent years. In speech, a model pre-trained by self-supervised learning transfers remarkably well on multiple tasks. However, the lack of a consistent evaluation methodology is limiting towards a holistic understanding of the efficacy of such models. SUPERB was a step towards introducing a common benchmark to evaluate pre-trained models across various speech tasks. In this paper, we introduce SUPERB-SG, a new benchmark focused on evaluating the semantic and generative capabilities of pre-trained models by increasing task diversity and difficulty over SUPERB. We use a lightweight methodology to test the robustness of representations learned by pre-trained models under shifts in data domain and quality across different types of tasks. It entails freezing pre-trained model parameters, only using simple task-specific trainable heads. The goal is to be inclusive of all researchers, and encourage efficient use of computational resources. We also show that the task diversity of SUPERB-SG coupled with limited task supervision is an effective recipe for evaluating the generalizability of model representation.
△ Less
Submitted 14 March, 2022;
originally announced March 2022.
-
textless-lib: a Library for Textless Spoken Language Processing
Authors:
Eugene Kharitonov,
Jade Copet,
Kushal Lakhotia,
Tu Anh Nguyen,
Paden Tomasello,
Ann Lee,
Ali Elkahky,
Wei-Ning Hsu,
Abdelrahman Mohamed,
Emmanuel Dupoux,
Yossi Adi
Abstract:
Textless spoken language processing research aims to extend the applicability of standard NLP toolset onto spoken language and languages with few or no textual resources. In this paper, we introduce textless-lib, a PyTorch-based library aimed to facilitate research in this research area. We describe the building blocks that the library provides and demonstrate its usability by discuss three differ…
▽ More
Textless spoken language processing research aims to extend the applicability of standard NLP toolset onto spoken language and languages with few or no textual resources. In this paper, we introduce textless-lib, a PyTorch-based library aimed to facilitate research in this research area. We describe the building blocks that the library provides and demonstrate its usability by discuss three different use-case examples: (i) speaker probing, (ii) speech resynthesis and compression, and (iii) speech continuation. We believe that textless-lib substantially simplifies research the textless setting and will be handful not only for speech researchers but also for the NLP community at large. The code, documentation, and pre-trained models are available at https://github.com/facebookresearch/textlesslib/ .
△ Less
Submitted 15 February, 2022;
originally announced February 2022.
-
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction
Authors:
Bowen Shi,
Wei-Ning Hsu,
Kushal Lakhotia,
Abdelrahman Mohamed
Abstract:
Video recordings of speech contain correlated audio and visual information, providing a strong signal for speech representation learning from the speaker's lip movements and the produced sound. We introduce Audio-Visual Hidden Unit BERT (AV-HuBERT), a self-supervised representation learning framework for audio-visual speech, which masks multi-stream video input and predicts automatically discovere…
▽ More
Video recordings of speech contain correlated audio and visual information, providing a strong signal for speech representation learning from the speaker's lip movements and the produced sound. We introduce Audio-Visual Hidden Unit BERT (AV-HuBERT), a self-supervised representation learning framework for audio-visual speech, which masks multi-stream video input and predicts automatically discovered and iteratively refined multimodal hidden units. AV-HuBERT learns powerful audio-visual speech representation benefiting both lip-reading and automatic speech recognition. On the largest public lip-reading benchmark LRS3 (433 hours), AV-HuBERT achieves 32.5% WER with only 30 hours of labeled data, outperforming the former state-of-the-art approach (33.6%) trained with a thousand times more transcribed video data (31K hours). The lip-reading WER is further reduced to 26.9% when using all 433 hours of labeled data from LRS3 and combined with self-training. Using our audio-visual representation on the same benchmark for audio-only speech recognition leads to a 40% relative WER reduction over the state-of-the-art performance (1.3% vs 2.3%). Our code and models are available at https://github.com/facebookresearch/av_hubert
△ Less
Submitted 12 March, 2022; v1 submitted 5 January, 2022;
originally announced January 2022.
-
XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale
Authors:
Arun Babu,
Changhan Wang,
Andros Tjandra,
Kushal Lakhotia,
Qiantong Xu,
Naman Goyal,
Kritika Singh,
Patrick von Platen,
Yatharth Saraf,
Juan Pino,
Alexei Baevski,
Alexis Conneau,
Michael Auli
Abstract:
This paper presents XLS-R, a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0. We train models with up to 2B parameters on nearly half a million hours of publicly available speech audio in 128 languages, an order of magnitude more public data than the largest known prior work. Our evaluation covers a wide range of tasks, domains, data regimes and languages, b…
▽ More
This paper presents XLS-R, a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0. We train models with up to 2B parameters on nearly half a million hours of publicly available speech audio in 128 languages, an order of magnitude more public data than the largest known prior work. Our evaluation covers a wide range of tasks, domains, data regimes and languages, both high and low-resource. On the CoVoST-2 speech translation benchmark, we improve the previous state of the art by an average of 7.4 BLEU over 21 translation directions into English. For speech recognition, XLS-R improves over the best known prior work on BABEL, MLS, CommonVoice as well as VoxPopuli, lowering error rates by 14-34% relative on average. XLS-R also sets a new state of the art on VoxLingua107 language identification. Moreover, we show that with sufficient model size, cross-lingual pretraining can outperform English-only pretraining when translating English speech into other languages, a setting which favors monolingual pretraining. We hope XLS-R can help to improve speech processing tasks for many more languages of the world.
△ Less
Submitted 16 December, 2021; v1 submitted 17 November, 2021;
originally announced November 2021.
-
Evaluating robustness of You Only Hear Once(YOHO) Algorithm on noisy audios in the VOICe Dataset
Authors:
Soham Tiwari,
Kshitiz Lakhotia,
Manjunath Mulimani
Abstract:
Sound event detection (SED) in machine listening entails identifying the different sounds in an audio file and identifying the start and end time of a particular sound event in the audio. SED finds use in various applications such as audio surveillance, speech recognition, and context-based indexing and retrieval of data in a multimedia database. However, in real-life scenarios, the audios from va…
▽ More
Sound event detection (SED) in machine listening entails identifying the different sounds in an audio file and identifying the start and end time of a particular sound event in the audio. SED finds use in various applications such as audio surveillance, speech recognition, and context-based indexing and retrieval of data in a multimedia database. However, in real-life scenarios, the audios from various sources are seldom devoid of any interfering noise or disturbance. In this paper, we test the performance of the You Only Hear Once (YOHO) algorithm on noisy audio data. Inspired by the You Only Look Once (YOLO) algorithm in computer vision, the YOHO algorithm can match the performance of the various state-of-the-art algorithms on datasets such as Music Speech Detection Dataset, TUT Sound Event, and Urban-SED datasets but at lower inference times. In this paper, we explore the performance of the YOHO algorithm on the VOICe dataset containing audio files with noise at different sound-to-noise ratios (SNR). YOHO could outperform or at least match the best performing SED algorithms reported in the VOICe dataset paper and make inferences in less time.
△ Less
Submitted 1 November, 2021;
originally announced November 2021.
-
Parallel Peeling of Bipartite Networks for Hierarchical Dense Subgraph Discovery
Authors:
Kartik Lakhotia,
Rajgopal Kannan,
Viktor Prasanna
Abstract:
Wing and Tip decomposition construct a hierarchy of butterfly-dense edge and vertex induced bipartite subgraphs, respectively. They have applications in several domains including e-commerce, recommendation systems and document analysis. Existing decomposition algorithms use a bottom-up approach that constructs the hierarchy in an increasing order of subgraph density. They iteratively peel the enti…
▽ More
Wing and Tip decomposition construct a hierarchy of butterfly-dense edge and vertex induced bipartite subgraphs, respectively. They have applications in several domains including e-commerce, recommendation systems and document analysis. Existing decomposition algorithms use a bottom-up approach that constructs the hierarchy in an increasing order of subgraph density. They iteratively peel the entities with minimum butterfly count i.e. remove them from the graph and update the butterfly count of other entities. However, the amount of butterflies in large bipartite graphs makes bottom-up peeling computationally demanding. Furthermore, the strict order of peeling entities results in a numerous sequentially dependent iterations, which makes parallelization challenging.
In this paper, we propose a novel Parallel Bipartite Network peelinG (PBNG) framework which adopts a two-phased peeling approach to relax the order of peeling, and in turn, reduce synchronization. The first phase divides the decomposition hierarchy into several partitions using very few peeling iterations. The second phase concurrently processes these partitions to generate the final hierarchy, and requires no global synchronization. The two-phased peeling further enables batching optimizations that dramatically improve the computational efficiency.
We empirically evaluate PBNG using several real-world bipartite graphs. Compared to the state-of-the-art frameworks and decomposition algorithms, PBNG achieves up to four orders of magnitude reduction in synchronization and two orders of magnitude speedup, respectively. We also present the first decomposition results of some of the largest public real-world datasets, which PBNG can peel in few minutes but existing algorithms fail to process even in several days.
△ Less
Submitted 24 October, 2021;
originally announced October 2021.
-
Salient Phrase Aware Dense Retrieval: Can a Dense Retriever Imitate a Sparse One?
Authors:
Xilun Chen,
Kushal Lakhotia,
Barlas OÄŸuz,
Anchit Gupta,
Patrick Lewis,
Stan Peshterliev,
Yashar Mehdad,
Sonal Gupta,
Wen-tau Yih
Abstract:
Despite their recent popularity and well-known advantages, dense retrievers still lag behind sparse methods such as BM25 in their ability to reliably match salient phrases and rare entities in the query and to generalize to out-of-domain data. It has been argued that this is an inherent limitation of dense models. We rebut this claim by introducing the Salient Phrase Aware Retriever (SPAR), a dens…
▽ More
Despite their recent popularity and well-known advantages, dense retrievers still lag behind sparse methods such as BM25 in their ability to reliably match salient phrases and rare entities in the query and to generalize to out-of-domain data. It has been argued that this is an inherent limitation of dense models. We rebut this claim by introducing the Salient Phrase Aware Retriever (SPAR), a dense retriever with the lexical matching capacity of a sparse model. We show that a dense Lexical Model Λ can be trained to imitate a sparse one, and SPAR is built by augmenting a standard dense retriever with Λ. Empirically, SPAR shows superior performance on a range of tasks including five question answering datasets, MS MARCO passage retrieval, as well as the EntityQuestions and BEIR benchmarks for out-of-domain evaluation, exceeding the performance of state-of-the-art dense and sparse retrievers. The code and models of SPAR are available at: https://github.com/facebookresearch/dpr-scale/tree/main/spar
△ Less
Submitted 11 November, 2022; v1 submitted 13 October, 2021;
originally announced October 2021.
-
Text-Free Prosody-Aware Generative Spoken Language Modeling
Authors:
Eugene Kharitonov,
Ann Lee,
Adam Polyak,
Yossi Adi,
Jade Copet,
Kushal Lakhotia,
Tu-Anh Nguyen,
Morgane Rivière,
Abdelrahman Mohamed,
Emmanuel Dupoux,
Wei-Ning Hsu
Abstract:
Speech pre-training has primarily demonstrated efficacy on classification tasks, while its capability of generating novel speech, similar to how GPT-2 can generate coherent paragraphs, has barely been explored. Generative Spoken Language Modeling (GSLM) \cite{Lakhotia2021} is the only prior work addressing the generative aspects of speech pre-training, which replaces text with discovered phone-lik…
▽ More
Speech pre-training has primarily demonstrated efficacy on classification tasks, while its capability of generating novel speech, similar to how GPT-2 can generate coherent paragraphs, has barely been explored. Generative Spoken Language Modeling (GSLM) \cite{Lakhotia2021} is the only prior work addressing the generative aspects of speech pre-training, which replaces text with discovered phone-like units for language modeling and shows the ability to generate meaningful novel sentences. Unfortunately, despite eliminating the need of text, the units used in GSLM discard most of the prosodic information. Hence, GSLM fails to leverage prosody for better comprehension, and does not generate expressive speech. In this work, we present a prosody-aware generative spoken language model (pGSLM). It is composed of a multi-stream transformer language model (MS-TLM) of speech, represented as discovered unit and prosodic feature streams, and an adapted HiFi-GAN model converting MS-TLM outputs to waveforms. We devise a series of metrics for prosody modeling and generation, and re-use metrics from GSLM for content modeling. Experimental results show that the pGSLM can utilize prosody to improve both prosody and content modeling, and also generate natural, meaningful, and coherent speech given a spoken prompt. Audio samples can be found at https://speechbot.github.io/pgslm. Codes and models are available at https://github.com/pytorch/fairseq/tree/main/examples/textless_nlp/pgslm.
△ Less
Submitted 10 May, 2022; v1 submitted 7 September, 2021;
originally announced September 2021.
-
Domain-matched Pre-training Tasks for Dense Retrieval
Authors:
Barlas OÄŸuz,
Kushal Lakhotia,
Anchit Gupta,
Patrick Lewis,
Vladimir Karpukhin,
Aleksandra Piktus,
Xilun Chen,
Sebastian Riedel,
Wen-tau Yih,
Sonal Gupta,
Yashar Mehdad
Abstract:
Pre-training on larger datasets with ever increasing model size is now a proven recipe for increased performance across almost all NLP tasks. A notable exception is information retrieval, where additional pre-training has so far failed to produce convincing results. We show that, with the right pre-training setup, this barrier can be overcome. We demonstrate this by pre-training large bi-encoder m…
▽ More
Pre-training on larger datasets with ever increasing model size is now a proven recipe for increased performance across almost all NLP tasks. A notable exception is information retrieval, where additional pre-training has so far failed to produce convincing results. We show that, with the right pre-training setup, this barrier can be overcome. We demonstrate this by pre-training large bi-encoder models on 1) a recently released set of 65 million synthetically generated questions, and 2) 200 million post-comment pairs from a preexisting dataset of Reddit conversations made available by pushshift.io. We evaluate on a set of information retrieval and dialogue retrieval benchmarks, showing substantial improvements over supervised baselines.
△ Less
Submitted 28 July, 2021;
originally announced July 2021.
-
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
Authors:
Wei-Ning Hsu,
Benjamin Bolte,
Yao-Hung Hubert Tsai,
Kushal Lakhotia,
Ruslan Salakhutdinov,
Abdelrahman Mohamed
Abstract:
Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths with no explicit segmentation. To deal with these three problems, we propose the Hidden-Unit BERT (HuBERT) approach for…
▽ More
Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths with no explicit segmentation. To deal with these three problems, we propose the Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss. A key ingredient of our approach is applying the prediction loss over the masked regions only, which forces the model to learn a combined acoustic and language model over the continuous inputs. HuBERT relies primarily on the consistency of the unsupervised clustering step rather than the intrinsic quality of the assigned cluster labels. Starting with a simple k-means teacher of 100 clusters, and using two iterations of clustering, the HuBERT model either matches or improves upon the state-of-the-art wav2vec 2.0 performance on the Librispeech (960h) and Libri-light (60,000h) benchmarks with 10min, 1h, 10h, 100h, and 960h fine-tuning subsets. Using a 1B parameter model, HuBERT shows up to 19% and 13% relative WER reduction on the more challenging dev-other and test-other evaluation subsets.
△ Less
Submitted 14 June, 2021;
originally announced June 2021.
-
SUPERB: Speech processing Universal PERformance Benchmark
Authors:
Shu-wen Yang,
Po-Han Chi,
Yung-Sung Chuang,
Cheng-I Jeff Lai,
Kushal Lakhotia,
Yist Y. Lin,
Andy T. Liu,
Jiatong Shi,
Xuankai Chang,
Guan-Ting Lin,
Tzu-Hsien Huang,
Wei-Cheng Tseng,
Ko-tik Lee,
Da-Rong Liu,
Zili Huang,
Shuyan Dong,
Shang-Wen Li,
Shinji Watanabe,
Abdelrahman Mohamed,
Hung-yi Lee
Abstract:
Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV). The paradigm pretrains a shared model on large volumes of unlabeled data and achieves state-of-the-art (SOTA) for various tasks with minimal adaptation. However, the speech processing community lacks a similar setup to systematically explore the paradigm. To bridge…
▽ More
Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV). The paradigm pretrains a shared model on large volumes of unlabeled data and achieves state-of-the-art (SOTA) for various tasks with minimal adaptation. However, the speech processing community lacks a similar setup to systematically explore the paradigm. To bridge this gap, we introduce Speech processing Universal PERformance Benchmark (SUPERB). SUPERB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks with minimal architecture changes and labeled data. Among multiple usages of the shared model, we especially focus on extracting the representation learned from SSL due to its preferable re-usability. We present a simple framework to solve SUPERB tasks by learning task-specialized lightweight prediction heads on top of the frozen shared model. Our results demonstrate that the framework is promising as SSL representations show competitive generalizability and accessibility across SUPERB tasks. We release SUPERB as a challenge with a leaderboard and a benchmark toolkit to fuel the research in representation learning and general speech processing.
△ Less
Submitted 15 October, 2021; v1 submitted 3 May, 2021;
originally announced May 2021.
-
Speech Resynthesis from Discrete Disentangled Self-Supervised Representations
Authors:
Adam Polyak,
Yossi Adi,
Jade Copet,
Eugene Kharitonov,
Kushal Lakhotia,
Wei-Ning Hsu,
Abdelrahman Mohamed,
Emmanuel Dupoux
Abstract:
We propose using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, we separately extract low-bitrate representations for speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner. We analyze various state-of-the-art, self-supervised representation learning methods and she…
▽ More
We propose using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, we separately extract low-bitrate representations for speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner. We analyze various state-of-the-art, self-supervised representation learning methods and shed light on the advantages of each method while considering reconstruction quality and disentanglement properties. Specifically, we evaluate the F0 reconstruction, speaker identification performance (for both resynthesis and voice conversion), recordings' intelligibility, and overall quality using subjective human evaluation. Lastly, we demonstrate how these representations can be used for an ultra-lightweight speech codec. Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods. Audio samples can be found under the following link: speechbot.github.io/resynthesis.
△ Less
Submitted 27 July, 2021; v1 submitted 1 April, 2021;
originally announced April 2021.
-
Generative Spoken Language Modeling from Raw Audio
Authors:
Kushal Lakhotia,
Evgeny Kharitonov,
Wei-Ning Hsu,
Yossi Adi,
Adam Polyak,
Benjamin Bolte,
Tu-Anh Nguyen,
Jade Copet,
Alexei Baevski,
Adelrahman Mohamed,
Emmanuel Dupoux
Abstract:
We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text u…
▽ More
We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo-text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems.
△ Less
Submitted 9 September, 2021; v1 submitted 1 February, 2021;
originally announced February 2021.
-
FiD-Ex: Improving Sequence-to-Sequence Models for Extractive Rationale Generation
Authors:
Kushal Lakhotia,
Bhargavi Paranjape,
Asish Ghoshal,
Wen-tau Yih,
Yashar Mehdad,
Srinivasan Iyer
Abstract:
Natural language (NL) explanations of model predictions are gaining popularity as a means to understand and verify decisions made by large black-box pre-trained models, for NLP tasks such as Question Answering (QA) and Fact Verification. Recently, pre-trained sequence to sequence (seq2seq) models have proven to be very effective in jointly making predictions, as well as generating NL explanations.…
▽ More
Natural language (NL) explanations of model predictions are gaining popularity as a means to understand and verify decisions made by large black-box pre-trained models, for NLP tasks such as Question Answering (QA) and Fact Verification. Recently, pre-trained sequence to sequence (seq2seq) models have proven to be very effective in jointly making predictions, as well as generating NL explanations. However, these models have many shortcomings; they can fabricate explanations even for incorrect predictions, they are difficult to adapt to long input documents, and their training requires a large amount of labeled data. In this paper, we develop FiD-Ex, which addresses these shortcomings for seq2seq models by: 1) introducing sentence markers to eliminate explanation fabrication by encouraging extractive generation, 2) using the fusion-in-decoder architecture to handle long input contexts, and 3) intermediate fine-tuning on re-structured open domain QA datasets to improve few-shot performance. FiD-Ex significantly improves over prior work in terms of explanation metrics and task accuracy, on multiple tasks from the ERASER explainability benchmark, both in the fully supervised and in the few-shot settings.
△ Less
Submitted 31 December, 2020;
originally announced December 2020.
-
RECEIPT: REfine CoarsE-grained IndePendent Tasks for Parallel Tip decomposition of Bipartite Graphs
Authors:
Kartik Lakhotia,
Rajgopal Kannan,
Viktor Prasanna,
Cesar A. F. De Rose
Abstract:
Tip decomposition is a crucial kernel for mining dense subgraphs in bipartite networks, with applications in spam detection, analysis of affiliation networks etc. It creates a hierarchy of vertex-induced subgraphs with varying densities determined by the participation of vertices in butterflies (2,2-bicliques). To build the hierarchy, existing algorithms iteratively follow a delete-update(peeling)…
▽ More
Tip decomposition is a crucial kernel for mining dense subgraphs in bipartite networks, with applications in spam detection, analysis of affiliation networks etc. It creates a hierarchy of vertex-induced subgraphs with varying densities determined by the participation of vertices in butterflies (2,2-bicliques). To build the hierarchy, existing algorithms iteratively follow a delete-update(peeling) process: deleting vertices with the minimum number of butterflies and correspondingly updating the butterfly count of their 2-hop neighbors. The need to explore 2-hop neighborhood renders tip-decomposition computationally very expensive. Furthermore, the inherent sequentiality in peeling only minimum butterfly vertices makes derived parallel algorithms prone to heavy synchronization.
In this paper, we propose a novel parallel tip-decomposition algorithm -- REfine CoarsE-grained Independent Tasks (RECEIPT) that relaxes the peeling order restrictions by partitioning the vertices into multiple independent subsets that can be concurrently peeled. This enables RECEIPT to simultaneously achieve a high degree of parallelism and dramatic reduction in synchronizations. Further, RECEIPT employs a hybrid peeling strategy along with other optimizations that drastically reduce the amount of wedge exploration and execution time.
We perform detailed experimental evaluation of RECEIPT on a shared-memory multicore server. It can process some of the largest publicly available bipartite datasets orders of magnitude faster than the state-of-the-art algorithms -- achieving up to 1100x and 64x reduction in the number of thread synchronizations and traversed wedges, respectively. Using 36 threads, RECEIPT can provide up to 17.1x self-relative speedup. Our implementation of RECEIPT is available at https://github.com/kartiklakhotia/RECEIPT.
△ Less
Submitted 16 October, 2020;
originally announced October 2020.
-
SPEC2: SPECtral SParsE CNN Accelerator on FPGAs
Authors:
Yue Niu,
Hanqing Zeng,
Ajitesh Srivastava,
Kartik Lakhotia,
Rajgopal Kannan,
Yanzhi Wang,
Viktor Prasanna
Abstract:
To accelerate inference of Convolutional Neural Networks (CNNs), various techniques have been proposed to reduce computation redundancy. Converting convolutional layers into frequency domain significantly reduces the computation complexity of the sliding window operations in space domain. On the other hand, weight pruning techniques address the redundancy in model parameters by converting dense co…
▽ More
To accelerate inference of Convolutional Neural Networks (CNNs), various techniques have been proposed to reduce computation redundancy. Converting convolutional layers into frequency domain significantly reduces the computation complexity of the sliding window operations in space domain. On the other hand, weight pruning techniques address the redundancy in model parameters by converting dense convolutional kernels into sparse ones. To obtain high-throughput FPGA implementation, we propose SPEC2 -- the first work to prune and accelerate spectral CNNs. First, we propose a systematic pruning algorithm based on Alternative Direction Method of Multipliers (ADMM). The offline pruning iteratively sets the majority of spectral weights to zero, without using any handcrafted heuristics. Then, we design an optimized pipeline architecture on FPGA that has efficient random access into the sparse kernels and exploits various dimensions of parallelism in convolutional layers. Overall, SPEC2 achieves high inference throughput with extremely low computation complexity and negligible accuracy degradation. We demonstrate SPEC2 by pruning and implementing LeNet and VGG16 on the Xilinx Virtex platform. After pruning 75% of the spectral weights, SPEC2 achieves 0% accuracy loss for LeNet, and <1% accuracy loss for VGG16. The resulting accelerators achieve up to 24x higher throughput, compared with the state-of-the-art FPGA implementations for VGG16.
△ Less
Submitted 10 October, 2023; v1 submitted 16 October, 2019;
originally announced October 2019.
-
Approximation Algorithms for Coordinating Ad Campaigns on Social Networks
Authors:
Kartik Lakhotia,
David Kempe
Abstract:
We study a natural model of coordinated social ad campaigns over a social network, based on models of Datta et al. and Aslay et al. Multiple advertisers are willing to pay the host - up to a known budget - per user exposure, whether the exposure is sponsored or orgain (i.e. shared by a friend). Campaigns are seeded with sponsored ads to some users, but no user must be exposed to too many sponsored…
▽ More
We study a natural model of coordinated social ad campaigns over a social network, based on models of Datta et al. and Aslay et al. Multiple advertisers are willing to pay the host - up to a known budget - per user exposure, whether the exposure is sponsored or orgain (i.e. shared by a friend). Campaigns are seeded with sponsored ads to some users, but no user must be exposed to too many sponsored ads. Thus, while ad campaigns proceed independently over the network, they need to be carefully coordinated with respect to their seed sets.
We study the objective of maximizing host's total ad revenue. Our main result is to show that under a broad class of influence models, the problem can be reduced to maximizing a submodular function subject to two matroid constraints; it can therefore be approximated within a factor essentially 1/2 in polynomial time. When there is no bound on the individual seed set sizes of advertisers, the constraints correspond only to a single matroid, and the guarantee can be improved to 1-1/e; in that case, a factor 1/2 is achieved by a practical greedy algorithm. The 1-1/e approximation algorithm for matroid-constrained problem is far from practical; however, we show that specifically under the Independent Cascade model, LP rounding and Reverse Reachability techniques can be combined to obtain a 1-1/e approximation algorithm.
Our theoretical results are complemented by experiments evaluating the extent to which the coordination of multiple ad campaigns inhibits the revenue obtained from each individual campaign, as a function of the similarity of the influence networks and strength of ties in the networks. Our experiments suggest that as networks for different advertisers become less similar, the harmful effect of competition decreases. With respect to tie strengths, we show that the most harm is done in an intermediate range.
△ Less
Submitted 26 December, 2019; v1 submitted 24 August, 2019;
originally announced August 2019.
-
Planting Trees for scalable and efficient Canonical Hub Labeling
Authors:
Kartik Lakhotia,
Qing Dong,
Rajgopal Kannan,
Viktor Prasanna
Abstract:
Point-to-Point Shortest Distance (PPSD) query is a crucial primitive in graph database applications. Hub labeling algorithms compute a labeling that converts a PPSD query into a list intersection problem (over a pre-computed indexing) enabling swift query response. However, constructing hub labeling is computationally challenging. Even state-of-the-art parallel algorithms based on Pruned Landmark…
▽ More
Point-to-Point Shortest Distance (PPSD) query is a crucial primitive in graph database applications. Hub labeling algorithms compute a labeling that converts a PPSD query into a list intersection problem (over a pre-computed indexing) enabling swift query response. However, constructing hub labeling is computationally challenging. Even state-of-the-art parallel algorithms based on Pruned Landmark Labeling (PLL) [3], are plagued by large label size, violation of given network hierarchy, poor scalability and inability to process large graphs.
In this paper, we develop novel parallel shared-memory and distributed-memory algorithms for constructing the Canonical Hub Labeling (CHL) that is minimal in size for a given network hierarchy. To the best of our knowledge, none of the existing parallel algorithms guarantee canonical labeling. Our key contribution, the PLaNT algorithm, scales well beyond the limits of current practice by completely avoiding inter-node communication. PLaNT also enables the design of a collaborative label partitioning scheme across multiple nodes for completely in-memory processing of massive graphs whose labels cannot fit on a single machine.
Compared to the sequential PLL, we empirically demonstrate upto 47.4x speedup on a 72 thread shared-memory platform. On a 64-node cluster, PLaNT achieves an average 42x speedup over single node execution. Finally, we show how our approach demonstrates superior scalability - we can process 14x larger graphs (in terms of label size) and construct hub labeling orders of magnitude faster compared to state-of-the-art distributed paraPLL algorithm.
△ Less
Submitted 28 June, 2019;
originally announced July 2019.
-
PyText: A Seamless Path from NLP research to production
Authors:
Ahmed Aly,
Kushal Lakhotia,
Shicong Zhao,
Mrinal Mohit,
Barlas Oguz,
Abhinav Arora,
Sonal Gupta,
Christopher Dewan,
Stef Nelson-Lindall,
Rushin Shah
Abstract:
We introduce PyText - a deep learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapid experimentation and of serving models at scale. It achieves this by providing simple and extensible interfaces for model components, and by using PyTorch's capabilities of exporting models for inference via the optimized Caffe2 execution engine.…
▽ More
We introduce PyText - a deep learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapid experimentation and of serving models at scale. It achieves this by providing simple and extensible interfaces for model components, and by using PyTorch's capabilities of exporting models for inference via the optimized Caffe2 execution engine. We report our own experience of migrating experimentation and production workflows to PyText, which enabled us to iterate faster on novel modeling ideas and then seamlessly ship them at industrial scale.
△ Less
Submitted 12 December, 2018;
originally announced December 2018.
-
GPOP: A cache- and work-efficient framework for Graph Processing Over Partitions
Authors:
Kartik Lakhotia,
Sourav Pati,
Rajgopal Kannan,
Viktor Prasanna
Abstract:
Past decade has seen the development of many shared-memory graph processing frameworks, intended to reduce the effort of developing high performance parallel applications. However many of these frameworks, based on Vertex-centric or Edge-centric paradigms suffer from several issues, such as poor cache utilization, irregular memory accesses, heavy use of synchronization primitives and theoretical i…
▽ More
Past decade has seen the development of many shared-memory graph processing frameworks, intended to reduce the effort of developing high performance parallel applications. However many of these frameworks, based on Vertex-centric or Edge-centric paradigms suffer from several issues, such as poor cache utilization, irregular memory accesses, heavy use of synchronization primitives and theoretical inefficiency, that deteriorate overall performance and scalability.
Recently, we proposed a cache and memory efficient partition-centric paradigm for computing PageRank. In this paper, we generalize this approach to develop a novel Graph Processing Over Partitions (GPOP) framework that is cache-efficient, scalable and work-efficient. GPOP induces locality in memory accesses by increasing granularity of execution to vertex subsets called 'partitions', thereby dramatically improving the cache performance of a variety of graph algorithms. It achieves high scalability by enabling completely lock and atomic free computation. GPOP's built-in analytical performance model enables it to use a hybrid of source and partitioncentric communication modes in a way that ensures work-efficiency each iteration, while simultaneously boosting high bandwidth sequential memory accesses.
We extensively evaluate the performance of GPOP for a variety of graph algorithms, using several large datasets. We observe that GPOP incurs up to 9x, 6.8x and 5.5x less L2 cache misses compared to Ligra, GraphMat and Galois, respectively. In terms of execution time, GPOP is upto 19x, 9.3x and 3.6x faster than Ligra, GraphMat and Galois respectively.
△ Less
Submitted 19 November, 2019; v1 submitted 21 June, 2018;
originally announced June 2018.
-
Accelerating PageRank using Partition-Centric Processing
Authors:
Kartik Lakhotia,
Rajgopal Kannan,
Viktor Prasanna
Abstract:
PageRank is a fundamental link analysis algorithm that also functions as a key representative of the performance of Sparse Matrix-Vector (SpMV) multiplication. The traditional PageRank implementation generates fine granularity random memory accesses resulting in large amount of wasteful DRAM traffic and poor bandwidth utilization. In this paper, we present a novel Partition-Centric Processing Meth…
▽ More
PageRank is a fundamental link analysis algorithm that also functions as a key representative of the performance of Sparse Matrix-Vector (SpMV) multiplication. The traditional PageRank implementation generates fine granularity random memory accesses resulting in large amount of wasteful DRAM traffic and poor bandwidth utilization. In this paper, we present a novel Partition-Centric Processing Methodology (PCPM) to compute PageRank, that drastically reduces the amount of DRAM communication while achieving high sustained memory bandwidth. PCPM uses a Partition-centric abstraction coupled with the Gather-Apply-Scatter (GAS) programming model. By carefully examining how a PCPM based implementation impacts communication characteristics of the algorithm, we propose several system optimizations that improve the execution time substantially. More specifically, we develop (1) a new data layout that significantly reduces communication and random DRAM accesses, and (2) branch avoidance mechanisms to get rid of unpredictable data-dependent branches.
We perform detailed analytical and experimental evaluation of our approach using 6 large graphs and demonstrate an average 2.7x speedup in execution time and 1.7x reduction in communication volume, compared to the state-of-the-art. We also show that unlike other GAS based implementations, PCPM is able to further reduce main memory traffic by taking advantage of intelligent node labeling that enhances locality. Although we use PageRank as the target application in this paper, our approach can be applied to generic SpMV computation.
△ Less
Submitted 6 August, 2018; v1 submitted 20 September, 2017;
originally announced September 2017.