-
Programmable Synthetic Magnetism and Chiral Edge States in Nano-Optomechanical Quantum Hall Networks
Authors:
Jesse J. Slim,
Javier del Pino,
Ewold Verhagen
Abstract:
Artificial magnetic fields break time-reversal symmetry in engineered materials--also known as metamaterials, enabling robust, topological transport of neutral excitations, much like electronic conduction edge channels in the integer quantum Hall effect. We experimentally demonstrate the emergence of quantum-Hall-like chiral edge states in optomechanical resonator networks. Synthetic magnetic fiel…
▽ More
Artificial magnetic fields break time-reversal symmetry in engineered materials--also known as metamaterials, enabling robust, topological transport of neutral excitations, much like electronic conduction edge channels in the integer quantum Hall effect. We experimentally demonstrate the emergence of quantum-Hall-like chiral edge states in optomechanical resonator networks. Synthetic magnetic fields for phononic excitations are induced through laser drives, while cavity optomechanical control allows full reconfigurability of the effective metamaterial response of the networks, including programming of magnetic fluxes in multiple resonator plaquettes. By tuning the interplay between network connectivity and magnetic fields, we demonstrate both flux-sensitive and flux-insensitive localized mechanical states. Scaling up the system creates spectral features that are precursors to Hofstadter butterfly spectra. Site-resolved spectroscopy reveals edge-bulk separation, with stationary phononic distributions signaling chiral edge modes. We directly probe those edge modes in transport measurements to demonstrate a unidirectional acoustic channel. This work unlocks new ways of controlling topological phononic phases at the nanoscale with applications in noise management and information processing.
△ Less
Submitted 30 January, 2025;
originally announced January 2025.
-
Characterizing and Efficiently Accelerating Multimodal Generation Model Inference
Authors:
Yejin Lee,
Anna Sun,
Basil Hosmer,
Bilge Acun,
Can Balioglu,
Changhan Wang,
Charles David Hernandez,
Christian Puhrsch,
Daniel Haziza,
Driss Guessous,
Francisco Massa,
Jacob Kahn,
Jeffrey Wan,
Jeremy Reizenstein,
Jiaqi Zhai,
Joe Isaacson,
Joel Schlosser,
Juan Pino,
Kaushik Ram Sadagopan,
Leonid Shamis,
Linjian Ma,
Min-Jae Hwang,
Mingda Chen,
Mostafa Elhoushi,
Pedro Rodriguez
, et al. (5 additional authors not shown)
Abstract:
Generative artificial intelligence (AI) technology is revolutionizing the computing industry. Not only its applications have broadened to various sectors but also poses new system design and optimization opportunities. The technology is capable of understanding and responding in multiple modalities. However, the advanced capability currently comes with significant system resource demands. To susta…
▽ More
Generative artificial intelligence (AI) technology is revolutionizing the computing industry. Not only its applications have broadened to various sectors but also poses new system design and optimization opportunities. The technology is capable of understanding and responding in multiple modalities. However, the advanced capability currently comes with significant system resource demands. To sustainably scale generative AI capabilities to billions of users in the world, inference must be fast and efficient. This paper pinpoints key system design and optimization opportunities by characterizing a family of emerging multi-modal generation models on real systems. Auto-regressive token generation is a critical latency performance bottleneck, typically dominated by GPU idle time. In addition to memory-intensive attention across the generative AI models, linear operations constitute significant inference latency due to the feed forward networks in Transformer-based models. We demonstrate that state-of-the-art optimization levers, spanning from applications to system software and hardware, set a 3.88x better baseline.
△ Less
Submitted 9 May, 2025; v1 submitted 30 September, 2024;
originally announced October 2024.
-
Fluctuation instabilities via internal resonance in a multimode membrane as a mechanism for frequency combs
Authors:
Mengqi Fu,
Orjan Ameye,
Fan Yang,
Jan Košata,
Javier del Pino,
Oded Zilberberg,
Elke Scheer
Abstract:
We explore self-induced parametric coupling, also called internal resonances (IRs), in a membrane nanoelectromechanical system. Specifically, we focus on the formation of a limit cycle manifesting as a phononic frequency comb. Utilizing a pump-noisy-probe technique and theoretical modeling, we reveal the behavior of mechanical excitations revealing themselves as sidebands of the stationary IR resp…
▽ More
We explore self-induced parametric coupling, also called internal resonances (IRs), in a membrane nanoelectromechanical system. Specifically, we focus on the formation of a limit cycle manifesting as a phononic frequency comb. Utilizing a pump-noisy-probe technique and theoretical modeling, we reveal the behavior of mechanical excitations revealing themselves as sidebands of the stationary IR response. We find that when the energy-absorbing excitation of a lower mode is parametrically-upconverted to hybridize with a higher mode, significant squeezing and bimodality in the upper mode occurs. Instead, when the upconverted absorbing excitation hybridizes with an emitting sideband of the higher mode, a Hopf bifurcation occurs and a limit cycle forms, manifesting as a frequency comb. We thus reveal a unique mechanism to obtain frequency combs in parametrically-coupled modes. We furthermore demonstrate a rich variety of IR effects, the origin of which significantly extends beyond standard linear parametric coupling phenomena. Our findings enhance the understanding of energy transfer mechanisms with implications for advanced sensing technologies and novel phononic metamaterials.
△ Less
Submitted 23 September, 2024;
originally announced September 2024.
-
Slow and fast topological dynamical phase transitions in a Duffing resonator driven by two detuned tones
Authors:
Letizia Catalini,
Javier del Pino,
Soumya S. Kumar,
Vincent Dumont,
Gabriel Margiani,
Oded Zilberberg,
Alexander Eichler
Abstract:
The combination of a strong pump and a weak probe has been widely applied to investigate both optical and nanomechanical devices. Such pump-probe measurements allows for the exploration of nonlinear dynamics, driven by the large pump tone, by measuring the system response to a probe tone. In contrast, here we report on the dynamics of a mechanical Duffing resonator driven with a combination of two…
▽ More
The combination of a strong pump and a weak probe has been widely applied to investigate both optical and nanomechanical devices. Such pump-probe measurements allows for the exploration of nonlinear dynamics, driven by the large pump tone, by measuring the system response to a probe tone. In contrast, here we report on the dynamics of a mechanical Duffing resonator driven with a combination of two large tones at different frequencies. Our results indicate the presence of various distinct regimes with very different dynamics. We systematically investigate the impact of the relative strength and detuning between the two drives on the dynamical response. This provides an illustrative example of dynamical phase transitions in out-of-equilibrium systems.
△ Less
Submitted 17 December, 2024; v1 submitted 28 August, 2024;
originally announced August 2024.
-
Topological classification of driven-dissipative nonlinear systems
Authors:
Greta Villa,
Javier del Pino,
Vincent Dumont,
Gianluca Rastelli,
Mateusz Michałek,
Alexander Eichler,
Oded Zilberberg
Abstract:
In topology, one averages over local geometrical details to reveal robust global features. This approach proves crucial for understanding quantized bulk transport and exotic boundary effects of linear wave propagation in (meta-)materials. Moving beyond linear Hamiltonian systems, the study of topology in physics strives to characterize open (non-Hermitian) and interacting systems. Here, we establi…
▽ More
In topology, one averages over local geometrical details to reveal robust global features. This approach proves crucial for understanding quantized bulk transport and exotic boundary effects of linear wave propagation in (meta-)materials. Moving beyond linear Hamiltonian systems, the study of topology in physics strives to characterize open (non-Hermitian) and interacting systems. Here, we establish a framework for the topological classification of driven-dissipative nonlinear systems. Specifically, we define a graph index for the Floquet semiclassical equations of motion describing such systems. The graph index builds upon topological vector analysis theory and combines knowledge of the particle-hole nature of fluctuations around each out-of-equilibrium stationary state. To test this approach, we divulge the topological invariants arising in a micro-electromechanical nonlinear resonator subject to forcing and a time-modulated potential. Our framework classifies the complete phase diagram of the system and reveals the topological origin of driven-dissipative phase transitions, as well as that of under- to over-damped responses. Furthermore, we predict topological phase transitions between symmetry-broken phases that pertain to population inversion transitions. This rich manifesting phenomenology reveals the pervasive link between topology and nonlinear dynamics, with broad implications for all fields of science.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
The computational power of random quantum circuits in arbitrary geometries
Authors:
Matthew DeCross,
Reza Haghshenas,
Minzhao Liu,
Enrico Rinaldi,
Johnnie Gray,
Yuri Alexeev,
Charles H. Baldwin,
John P. Bartolotta,
Matthew Bohn,
Eli Chertkov,
Julia Cline,
Jonhas Colina,
Davide DelVento,
Joan M. Dreiling,
Cameron Foltz,
John P. Gaebler,
Thomas M. Gatterman,
Christopher N. Gilbreth,
Joshua Giles,
Dan Gresh,
Alex Hall,
Aaron Hankin,
Azure Hansen,
Nathan Hewitt,
Ian Hoffman
, et al. (27 additional authors not shown)
Abstract:
Empirical evidence for a gap between the computational powers of classical and quantum computers has been provided by experiments that sample the output distributions of two-dimensional quantum circuits. Many attempts to close this gap have utilized classical simulations based on tensor network techniques, and their limitations shed light on the improvements to quantum hardware required to frustra…
▽ More
Empirical evidence for a gap between the computational powers of classical and quantum computers has been provided by experiments that sample the output distributions of two-dimensional quantum circuits. Many attempts to close this gap have utilized classical simulations based on tensor network techniques, and their limitations shed light on the improvements to quantum hardware required to frustrate classical simulability. In particular, quantum computers having in excess of $\sim 50$ qubits are primarily vulnerable to classical simulation due to restrictions on their gate fidelity and their connectivity, the latter determining how many gates are required (and therefore how much infidelity is suffered) in generating highly-entangled states. Here, we describe recent hardware upgrades to Quantinuum's H2 quantum computer enabling it to operate on up to $56$ qubits with arbitrary connectivity and $99.843(5)\%$ two-qubit gate fidelity. Utilizing the flexible connectivity of H2, we present data from random circuit sampling in highly connected geometries, doing so at unprecedented fidelities and a scale that appears to be beyond the capabilities of state-of-the-art classical algorithms. The considerable difficulty of classically simulating H2 is likely limited only by qubit number, demonstrating the promise and scalability of the QCCD architecture as continued progress is made towards building larger machines.
△ Less
Submitted 21 June, 2024; v1 submitted 4 June, 2024;
originally announced June 2024.
-
Residual-based Attention Physics-informed Neural Networks for Spatio-Temporal Ageing Assessment of Transformers Operated in Renewable Power Plants
Authors:
Ibai Ramirez,
Joel Pino,
David Pardo,
Mikel Sanz,
Luis del Rio,
Alvaro Ortiz,
Kateryna Morozovska,
Jose I. Aizpurua
Abstract:
Transformers are crucial for reliable and efficient power system operations, particularly in supporting the integration of renewable energy. Effective monitoring of transformer health is critical to maintain grid stability and performance. Thermal insulation ageing is a key transformer failure mode, which is generally tracked by monitoring the hotspot temperature (HST). However, HST measurement is…
▽ More
Transformers are crucial for reliable and efficient power system operations, particularly in supporting the integration of renewable energy. Effective monitoring of transformer health is critical to maintain grid stability and performance. Thermal insulation ageing is a key transformer failure mode, which is generally tracked by monitoring the hotspot temperature (HST). However, HST measurement is complex, costly, and often estimated from indirect measurements. Existing HST models focus on space-agnostic thermal models, providing worst-case HST estimates. This article introduces a spatio-temporal model for transformer winding temperature and ageing estimation, which leverages physics-based partial differential equations (PDEs) with data-driven Neural Networks (NN) in a Physics Informed Neural Networks (PINNs) configuration to improve prediction accuracy and acquire spatio-temporal resolution. The computational accuracy of the PINN model is improved through the implementation of the Residual-Based Attention (PINN-RBA) scheme that accelerates the PINN model convergence. The PINN-RBA model is benchmarked against self-adaptive attention schemes and classical vanilla PINN configurations. For the first time, PINN based oil temperature predictions are used to estimate spatio-temporal transformer winding temperature values, validated through PDE numerical solution and fiber optic sensor measurements. Furthermore, the spatio-temporal transformer ageing model is inferred, which supports transformer health management decision-making. Results are validated with a distribution transformer operating on a floating photovoltaic power plant.
△ Less
Submitted 3 October, 2024; v1 submitted 10 May, 2024;
originally announced May 2024.
-
High-fidelity and Fault-tolerant Teleportation of a Logical Qubit using Transversal Gates and Lattice Surgery on a Trapped-ion Quantum Computer
Authors:
C. Ryan-Anderson,
N. C. Brown,
C. H. Baldwin,
J. M. Dreiling,
C. Foltz,
J. P. Gaebler,
T. M. Gatterman,
N. Hewitt,
C. Holliman,
C. V. Horst,
J. Johansen,
D. Lucchetti,
T. Mengle,
M. Matheny,
Y. Matsuoka,
K. Mayer,
M. Mills,
S. A. Moses,
B. Neyenhuis,
J. Pino,
P. Siegfried,
R. P. Stutz,
J. Walker,
D. Hayes
Abstract:
Quantum state teleportation is commonly used in designs for large-scale fault-tolerant quantum computers. Using Quantinuum's H2 trapped-ion quantum processor, we implement the first demonstration of a fault-tolerant state teleportation circuit for a quantum error correction code - in particular, the planar topological [[7,1,3]] color code, or Steane code. The circuits use up to 30 trapped ions at…
▽ More
Quantum state teleportation is commonly used in designs for large-scale fault-tolerant quantum computers. Using Quantinuum's H2 trapped-ion quantum processor, we implement the first demonstration of a fault-tolerant state teleportation circuit for a quantum error correction code - in particular, the planar topological [[7,1,3]] color code, or Steane code. The circuits use up to 30 trapped ions at the physical layer qubits and employ real-time quantum error correction - decoding mid-circuit measurement of syndromes and implementing corrections during the protocol. We conduct experiments on several variations of logical teleportation circuits using both transversal gates and lattice surgery protocols. Among the many measurements we report on, we measure the logical process fidelity of the transversal teleportation circuit to be 0.975(2) and the logical process fidelity of the lattice surgery teleportation circuit to be 0.851(9). Additionally, we run a teleportation circuit that is equivalent to Knill-style quantum error correction and measure the process fidelity to be 0.989(2).
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
Benchmarking logical three-qubit quantum Fourier transform encoded in the Steane code on a trapped-ion quantum computer
Authors:
Karl Mayer,
Ciarán Ryan-Anderson,
Natalie Brown,
Elijah Durso-Sabina,
Charles H. Baldwin,
David Hayes,
Joan M. Dreiling,
Cameron Foltz,
John P. Gaebler,
Thomas M. Gatterman,
Justin A. Gerber,
Kevin Gilmore,
Dan Gresh,
Nathan Hewitt,
Chandler V. Horst,
Jacob Johansen,
Tanner Mengle,
Michael Mills,
Steven A. Moses,
Peter E. Siegfried,
Brian Neyenhuis,
Juan Pino,
Russell Stutz
Abstract:
We implement logically encoded three-qubit circuits for the quantum Fourier transform (QFT), using the [[7,1,3]] Steane code, and benchmark the circuits on the Quantinuum H2-1 trapped-ion quantum computer. The circuits require multiple logical two-qubit gates, which are implemented transversally, as well as logical non-Clifford single-qubit rotations, which are performed by non-fault-tolerant stat…
▽ More
We implement logically encoded three-qubit circuits for the quantum Fourier transform (QFT), using the [[7,1,3]] Steane code, and benchmark the circuits on the Quantinuum H2-1 trapped-ion quantum computer. The circuits require multiple logical two-qubit gates, which are implemented transversally, as well as logical non-Clifford single-qubit rotations, which are performed by non-fault-tolerant state preparation followed by a teleportation gadget. First, we benchmark individual logical components using randomized benchmarking for the logical two-qubit gate, and a Ramsey-type experiment for the logical $T$ gate. We then implement the full QFT circuit, using two different methods for performing a logical control-$T$, and benchmark the circuits by applying it to each basis state in a set of bases that is sufficient to lower bound the process fidelity. We compare the logical QFT benchmark results to predictions based on the logical component benchmarks.
△ Less
Submitted 12 April, 2024;
originally announced April 2024.
-
Demonstration of logical qubits and repeated error correction with better-than-physical error rates
Authors:
A. Paetznick,
M. P. da Silva,
C. Ryan-Anderson,
J. M. Bello-Rivas,
J. P. Campora III,
A. Chernoguzov,
J. M. Dreiling,
C. Foltz,
F. Frachon,
J. P. Gaebler,
T. M. Gatterman,
L. Grans-Samuelsson,
D. Gresh,
D. Hayes,
N. Hewitt,
C. Holliman,
C. V. Horst,
J. Johansen,
D. Lucchetti,
Y. Matsuoka,
M. Mills,
S. A. Moses,
B. Neyenhuis,
A. Paz,
J. Pino
, et al. (7 additional authors not shown)
Abstract:
The promise of quantum computers hinges on the ability to scale to large system sizes, e.g., to run quantum computations consisting of more than 100 million operations fault-tolerantly. This in turn requires suppressing errors to levels inversely proportional to the size of the computation. As a step towards this ambitious goal, we present experiments on a trapped-ion QCCD processor where, through…
▽ More
The promise of quantum computers hinges on the ability to scale to large system sizes, e.g., to run quantum computations consisting of more than 100 million operations fault-tolerantly. This in turn requires suppressing errors to levels inversely proportional to the size of the computation. As a step towards this ambitious goal, we present experiments on a trapped-ion QCCD processor where, through the use of fault-tolerant encoding and error correction, we are able to suppress logical error rates to levels below the physical error rates. In particular, we entangled logical qubits encoded in the [[7,1,3]] code with error rates 9.8 times to 500 times lower than at the physical level, and entangled logical qubits encoded in a [[12,2,4]] code based on Knill's C4/C6 scheme with error rates 4.7 times to 800 times lower than at the physical level, depending on the judicious use of post-selection. Moreover, we demonstrate repeated error correction with the [[12,2,4]] code, with logical error rates below physical circuit baselines corresponding to repeated CNOTs, and show evidence that the error rate per error correction cycle, which consists of over 100 physical CNOTs, approaches the error rate of two physical CNOTs. These results signify a transition from noisy intermediate scale quantum computing to reliable quantum computing, and demonstrate advanced capabilities toward large-scale fault-tolerant quantum computing.
△ Less
Submitted 17 November, 2024; v1 submitted 2 April, 2024;
originally announced April 2024.
-
XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception
Authors:
HyoJung Han,
Mohamed Anwar,
Juan Pino,
Wei-Ning Hsu,
Marine Carpuat,
Bowen Shi,
Changhan Wang
Abstract:
Speech recognition and translation systems perform poorly on noisy inputs, which are frequent in realistic environments. Augmenting these systems with visual signals has the potential to improve robustness to noise. However, audio-visual (AV) data is only available in limited amounts and for fewer languages than audio-only resources. To address this gap, we present XLAVS-R, a cross-lingual audio-v…
▽ More
Speech recognition and translation systems perform poorly on noisy inputs, which are frequent in realistic environments. Augmenting these systems with visual signals has the potential to improve robustness to noise. However, audio-visual (AV) data is only available in limited amounts and for fewer languages than audio-only resources. To address this gap, we present XLAVS-R, a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation in over 100 languages. It is designed to maximize the benefits of limited multilingual AV pre-training data, by building on top of audio-only multilingual pre-training and simplifying existing pre-training schemes. Extensive evaluation on the MuAViC benchmark shows the strength of XLAVS-R on downstream audio-visual speech recognition and translation tasks, where it outperforms the previous state of the art by up to 18.5% WER and 4.7 BLEU given noisy AV inputs, and enables strong zero-shot audio-visual ability with audio-only fine-tuning.
△ Less
Submitted 12 August, 2024; v1 submitted 21 March, 2024;
originally announced March 2024.
-
Spirit LM: Interleaved Spoken and Written Language Model
Authors:
Tu Anh Nguyen,
Benjamin Muller,
Bokai Yu,
Marta R. Costa-jussa,
Maha Elbayad,
Sravya Popuri,
Christophe Ropers,
Paul-Ambroise Duquenne,
Robin Algayres,
Ruslan Mavlyutov,
Itai Gat,
Mary Williamson,
Gabriel Synnaeve,
Juan Pino,
Benoit Sagot,
Emmanuel Dupoux
Abstract:
We introduce Spirit LM, a foundation multimodal language model that freely mixes text and speech. Our model is based on a 7B pretrained text language model that we extend to the speech modality by continuously training it on text and speech units. Speech and text sequences are concatenated as a single stream of tokens, and trained with a word-level interleaving method using a small automatically-c…
▽ More
We introduce Spirit LM, a foundation multimodal language model that freely mixes text and speech. Our model is based on a 7B pretrained text language model that we extend to the speech modality by continuously training it on text and speech units. Speech and text sequences are concatenated as a single stream of tokens, and trained with a word-level interleaving method using a small automatically-curated speech-text parallel corpus. Spirit LM comes in two versions: a Base version that uses speech phonetic units (HuBERT) and an Expressive version that models expressivity using pitch and style units in addition to the phonetic units. For both versions, the text is encoded with subword BPE tokens. The resulting model displays both the semantic abilities of text models and the expressive abilities of speech models. Additionally, we demonstrate that Spirit LM can learn new tasks in a few-shot fashion across modalities (i.e. ASR, TTS, Speech Classification). We make available model weights and inference code.
△ Less
Submitted 18 October, 2024; v1 submitted 8 February, 2024;
originally announced February 2024.
-
Seamless: Multilingual Expressive and Streaming Speech Translation
Authors:
Seamless Communication,
Loïc Barrault,
Yu-An Chung,
Mariano Coria Meglioli,
David Dale,
Ning Dong,
Mark Duppenthaler,
Paul-Ambroise Duquenne,
Brian Ellis,
Hady Elsahar,
Justin Haaheim,
John Hoffman,
Min-Jae Hwang,
Hirofumi Inaguma,
Christopher Klaiber,
Ilia Kulikov,
Pengwei Li,
Daniel Licht,
Jean Maillard,
Ruslan Mavlyutov,
Alice Rakotoarison,
Kaushik Ram Sadagopan,
Abinesh Ramakrishnan,
Tuan Tran,
Guillaume Wenzek
, et al. (40 additional authors not shown)
Abstract:
Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4…
▽ More
Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model-SeamlessM4T v2. This newer model, incorporating an updated UnitY2 framework, was trained on more low-resource language data. SeamlessM4T v2 provides the foundation on which our next two models are initiated. SeamlessExpressive enables translation that preserves vocal styles and prosody. Compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one's voice. As for SeamlessStreaming, our model leverages the Efficient Monotonic Multihead Attention mechanism to generate low-latency target translations without waiting for complete source utterances. As the first of its kind, SeamlessStreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. To ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. Consequently, we bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. The contributions to this work are publicly released and accessible at https://github.com/facebookresearch/seamless_communication
△ Less
Submitted 8 December, 2023;
originally announced December 2023.
-
Near-resonant nuclear spin detection with high-frequency mechanical resonators
Authors:
Diego A. Visani,
Letizia Catalini,
Christian L. Degen,
Alexander Eichler,
Javier del Pino
Abstract:
Mechanical resonators operating in the high-frequency regime have become a versatile platform for fundamental and applied quantum research. Their exceptional properties, such as low mass and high quality factor, make them also very appealing for force sensing experiments. In this Letter, we propose a method for detecting and ultimately controlling nuclear spins by directly coupling them to high-fr…
▽ More
Mechanical resonators operating in the high-frequency regime have become a versatile platform for fundamental and applied quantum research. Their exceptional properties, such as low mass and high quality factor, make them also very appealing for force sensing experiments. In this Letter, we propose a method for detecting and ultimately controlling nuclear spins by directly coupling them to high-frequency resonators via a magnetic field gradient. Dynamical backaction between the sensor and an ensemble of nuclear spins produces a shift in the sensor's resonance frequency, which can be measured to probe the spin ensemble. Based on analytical as well as numerical results, we predict that the method will allow nanoscale magnetic resonance imaging with a range of realistic devices. At the same time, this interaction paves the way for new manipulation techniques, similar to those employed in cavity optomechanics, enriching both the sensor's and the spin ensemble's features.
△ Less
Submitted 27 November, 2023;
originally announced November 2023.
-
Measuring the Loschmidt amplitude for finite-energy properties of the Fermi-Hubbard model on an ion-trap quantum computer
Authors:
Kévin Hémery,
Khaldoon Ghanem,
Eleanor Crane,
Sara L. Campbell,
Joan M. Dreiling,
Caroline Figgatt,
Cameron Foltz,
John P. Gaebler,
Jacob Johansen,
Michael Mills,
Steven A. Moses,
Juan M. Pino,
Anthony Ransford,
Mary Rowe,
Peter Siegfried,
Russell P. Stutz,
Henrik Dreyer,
Alexander Schuckert,
Ramil Nigmatullin
Abstract:
Calculating the equilibrium properties of condensed matter systems is one of the promising applications of near-term quantum computing. Recently, hybrid quantum-classical time-series algorithms have been proposed to efficiently extract these properties from a measurement of the Loschmidt amplitude $\langle ψ| e^{-i \hat H t}|ψ\rangle$ from initial states $|ψ\rangle$ and a time evolution under the…
▽ More
Calculating the equilibrium properties of condensed matter systems is one of the promising applications of near-term quantum computing. Recently, hybrid quantum-classical time-series algorithms have been proposed to efficiently extract these properties from a measurement of the Loschmidt amplitude $\langle ψ| e^{-i \hat H t}|ψ\rangle$ from initial states $|ψ\rangle$ and a time evolution under the Hamiltonian $\hat H$ up to short times $t$. In this work, we study the operation of this algorithm on a present-day quantum computer. Specifically, we measure the Loschmidt amplitude for the Fermi-Hubbard model on a $16$-site ladder geometry (32 orbitals) on the Quantinuum H2-1 trapped-ion device. We assess the effect of noise on the Loschmidt amplitude and implement algorithm-specific error mitigation techniques. By using a thus-motivated error model, we numerically analyze the influence of noise on the full operation of the quantum-classical algorithm by measuring expectation values of local observables at finite energies. Finally, we estimate the resources needed for scaling up the algorithm.
△ Less
Submitted 22 September, 2023; v1 submitted 19 September, 2023;
originally announced September 2023.
-
Optomechanical realization of the bosonic Kitaev-Majorana chain
Authors:
Jesse J. Slim,
Clara C. Wanjura,
Matteo Brunelli,
Javier del Pino,
Andreas Nunnenkamp,
Ewold Verhagen
Abstract:
The fermionic Kitaev chain is a canonical model featuring topological Majorana zero modes. We report the experimental realization of its bosonic analogue in a nano-optomechanical network where parametric interactions induce two-mode squeezing and beamsplitter coupling among the nanomechanical modes, equivalent to hopping and superconductor pairing in the fermionic case, respectively. We observe se…
▽ More
The fermionic Kitaev chain is a canonical model featuring topological Majorana zero modes. We report the experimental realization of its bosonic analogue in a nano-optomechanical network where parametric interactions induce two-mode squeezing and beamsplitter coupling among the nanomechanical modes, equivalent to hopping and superconductor pairing in the fermionic case, respectively. We observe several extraordinary phenomena in the bosonic dynamics and transport, including quadrature-dependent chiral amplification, exponential scaling of the gain with system size, and strong sensitivity to boundary conditions. Controlling the interaction phases and amplitudes uncovers a rich dynamical phase diagram that links the observed phenomena to non-Hermitian topological phase transitions. Finally, we present an experimental demonstration of an exponentially enhanced response to a small perturbation as a consequence of non-Hermitian topology. These results represent the demonstration of a novel synthetic phase of matter whose bosonic dynamics do not have fermionic parallels, and establish a powerful system to study non-Hermitian topology and its applications in signal manipulation and sensing.
△ Less
Submitted 11 September, 2023;
originally announced September 2023.
-
SeamlessM4T: Massively Multilingual & Multimodal Machine Translation
Authors:
Seamless Communication,
Loïc Barrault,
Yu-An Chung,
Mariano Cora Meglioli,
David Dale,
Ning Dong,
Paul-Ambroise Duquenne,
Hady Elsahar,
Hongyu Gong,
Kevin Heffernan,
John Hoffman,
Christopher Klaiber,
Pengwei Li,
Daniel Licht,
Jean Maillard,
Alice Rakotoarison,
Kaushik Ram Sadagopan,
Guillaume Wenzek,
Ethan Ye,
Bapi Akula,
Peng-Jen Chen,
Naji El Hachem,
Brian Ellis,
Gabriel Mejia Gonzalez,
Justin Haaheim
, et al. (43 additional authors not shown)
Abstract:
What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded s…
▽ More
What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communication
△ Less
Submitted 24 October, 2023; v1 submitted 22 August, 2023;
originally announced August 2023.
-
Limit cycles as stationary states of an extended Harmonic Balance ansatz
Authors:
Javier del Pino,
Jan Košata,
Oded Zilberberg
Abstract:
A limit cycle is a self-sustained periodic motion appearing in autonomous ordinary differential equations. As the period of the limit cycle is a-priori unknown, it is challenging to find them as stationary states of a rotating ansatz. Correspondingly, their study commonly relies on brute-force time-evolution or on circumstantial evidence such as instabilities of fixed points. Alas, such approaches…
▽ More
A limit cycle is a self-sustained periodic motion appearing in autonomous ordinary differential equations. As the period of the limit cycle is a-priori unknown, it is challenging to find them as stationary states of a rotating ansatz. Correspondingly, their study commonly relies on brute-force time-evolution or on circumstantial evidence such as instabilities of fixed points. Alas, such approaches are unable to account for the coexistence of multiple solutions, as they rely on specific initial conditions. Here, we develop a multifrequency rotating ansatz with which we find limit cycles as stationary states. We demonstrate our approach and its performance in the simplest case of the Van der Pol oscillator. Moving beyond the simplest example, we show that our method can capture the coexistence of all fixed-point attractors and limit cycles in a modified nonlinear Van der Pol oscillator. Our results facilitate the systematic mapping of out-of-equilibrium phase diagrams, with implications across all fields of natural science.
△ Less
Submitted 11 August, 2023;
originally announced August 2023.
-
A biased Ising model using two coupled Kerr parametric oscillators with external force
Authors:
Pablo Álvarez,
Davide Pittilini,
Filippo Miserocchi,
Sathyanarayanan Raamamurthy,
Gabriel Margiani,
Orjan Ameye,
Javier del Pino,
Oded Zilberberg,
Alexander Eichler
Abstract:
Networks of coupled Kerr parametric oscillators (KPOs) are a leading physical platform for analog solving of complex optimization problems. These systems are colloquially known as ``Ising machines''. We experimentally and theoretically study such a network under the influence of an external force. The force breaks the collective phase-parity symmetry of the system and competes with the intrinsic c…
▽ More
Networks of coupled Kerr parametric oscillators (KPOs) are a leading physical platform for analog solving of complex optimization problems. These systems are colloquially known as ``Ising machines''. We experimentally and theoretically study such a network under the influence of an external force. The force breaks the collective phase-parity symmetry of the system and competes with the intrinsic coupling in ordering the network configuration, similar to how a magnetic field biases an interacting spin ensemble. Specifically, we demonstrate how the force can be used to control the system, and highlight the crucial role of the phase and symmetry of the force. Our work thereby provides a method to create Ising machines with arbitrary bias, extending even to exotic cases that are impossible to engineer in real spin systems.
△ Less
Submitted 25 July, 2023;
originally announced July 2023.
-
Multilingual Speech-to-Speech Translation into Multiple Target Languages
Authors:
Hongyu Gong,
Ning Dong,
Sravya Popuri,
Vedanuj Goswami,
Ann Lee,
Juan Pino
Abstract:
Speech-to-speech translation (S2ST) enables spoken communication between people talking in different languages. Despite a few studies on multilingual S2ST, their focus is the multilinguality on the source side, i.e., the translation from multiple source languages to one target language. We present the first work on multilingual S2ST supporting multiple target languages. Leveraging recent advance i…
▽ More
Speech-to-speech translation (S2ST) enables spoken communication between people talking in different languages. Despite a few studies on multilingual S2ST, their focus is the multilinguality on the source side, i.e., the translation from multiple source languages to one target language. We present the first work on multilingual S2ST supporting multiple target languages. Leveraging recent advance in direct S2ST with speech-to-unit and vocoder, we equip these key components with multilingual capability. Speech-to-masked-unit (S2MU) is the multilingual extension of S2U, which applies masking to units which don't belong to the given target language to reduce the language interference. We also propose multilingual vocoder which is trained with language embedding and the auxiliary loss of language identification. On benchmark translation testsets, our proposed multilingual model shows superior performance than bilingual models in the translation from English into $16$ target languages.
△ Less
Submitted 17 July, 2023;
originally announced July 2023.
-
Khovanskii bases for semimixed systems of polynomial equations -- a case of approximating stationary nonlinear Newtonian dynamics
Authors:
Viktoriia Borovik,
Paul Breiding,
Javier del Pino,
Mateusz Michałek,
Oded Zilberberg
Abstract:
We provide an approach to counting roots of polynomial systems, where each polynomial is a general linear combination of prescribed, fixed polynomials. Our tools rely on the theory of Khovanskii bases, combined with toric geometry, the Bernstein-Khovanskii-Kushnirenko (BKK) Theorem, and fiber products.
As a direct application of this theory, we solve the problem of counting the number of approxi…
▽ More
We provide an approach to counting roots of polynomial systems, where each polynomial is a general linear combination of prescribed, fixed polynomials. Our tools rely on the theory of Khovanskii bases, combined with toric geometry, the Bernstein-Khovanskii-Kushnirenko (BKK) Theorem, and fiber products.
As a direct application of this theory, we solve the problem of counting the number of approximate stationary states for coupled driven nonlinear resonators. We set up a system of polynomial equations that depends on three numbers $N, n$ and $M$ and whose solutions model the stationary states. The parameter $N$ is the number of coupled resonators, $2n - 1$ is the degree of nonlinearity of the underlying differential equation, and $M$ is the number of frequencies used in the approximation. We use our main theorems, that is, the generalized BKK Theorem and the Decoupling Theorem, to count the number of (complex) solutions of the polynomial system for an arbitrary degree of nonlinearity $2n - 1 \geq 3$, any number of resonators $N \geq 1$, and $M = 1$ harmonic. We also solve the case $N = 1, n = 2$ and $M = 2$ and give a computational way to check the number of solutions for $N = 1, n = 2$ and $M \geq 2$. This extends the results of arXiv:2208.08179.
△ Less
Submitted 13 June, 2023;
originally announced June 2023.
-
Exploration on HuBERT with Multiple Resolutions
Authors:
Jiatong Shi,
Yun Tang,
Hirofumi Inaguma,
Hongyu GOng,
Juan Pino,
Shinji Watanabe
Abstract:
Hidden-unit BERT (HuBERT) is a widely-used self-supervised learning (SSL) model in speech processing. However, we argue that its fixed 20ms resolution for hidden representations would not be optimal for various speech-processing tasks since their attributes (e.g., speaker characteristics and semantics) are based on different time scales. To address this limitation, we propose utilizing HuBERT repr…
▽ More
Hidden-unit BERT (HuBERT) is a widely-used self-supervised learning (SSL) model in speech processing. However, we argue that its fixed 20ms resolution for hidden representations would not be optimal for various speech-processing tasks since their attributes (e.g., speaker characteristics and semantics) are based on different time scales. To address this limitation, we propose utilizing HuBERT representations at multiple resolutions for downstream tasks. We explore two approaches, namely the parallel and hierarchical approaches, for integrating HuBERT features with different resolutions. Through experiments, we demonstrate that HuBERT with multiple resolutions outperforms the original model. This highlights the potential of utilizing multiple resolutions in SSL models like HuBERT to capture diverse information from speech signals.
△ Less
Submitted 22 June, 2023; v1 submitted 1 June, 2023;
originally announced June 2023.
-
A Race Track Trapped-Ion Quantum Processor
Authors:
S. A. Moses,
C. H. Baldwin,
M. S. Allman,
R. Ancona,
L. Ascarrunz,
C. Barnes,
J. Bartolotta,
B. Bjork,
P. Blanchard,
M. Bohn,
J. G. Bohnet,
N. C. Brown,
N. Q. Burdick,
W. C. Burton,
S. L. Campbell,
J. P. Campora III,
C. Carron,
J. Chambers,
J. W. Chan,
Y. H. Chen,
A. Chernoguzov,
E. Chertkov,
J. Colina,
J. P. Curtis,
R. Daniel
, et al. (71 additional authors not shown)
Abstract:
We describe and benchmark a new quantum charge-coupled device (QCCD) trapped-ion quantum computer based on a linear trap with periodic boundary conditions, which resembles a race track. The new system successfully incorporates several technologies crucial to future scalability, including electrode broadcasting, multi-layer RF routing, and magneto-optical trap (MOT) loading, while maintaining, and…
▽ More
We describe and benchmark a new quantum charge-coupled device (QCCD) trapped-ion quantum computer based on a linear trap with periodic boundary conditions, which resembles a race track. The new system successfully incorporates several technologies crucial to future scalability, including electrode broadcasting, multi-layer RF routing, and magneto-optical trap (MOT) loading, while maintaining, and in some cases exceeding, the gate fidelities of previous QCCD systems. The system is initially operated with 32 qubits, but future upgrades will allow for more. We benchmark the performance of primitive operations, including an average state preparation and measurement error of 1.6(1)$\times 10^{-3}$, an average single-qubit gate infidelity of $2.5(3)\times 10^{-5}$, and an average two-qubit gate infidelity of $1.84(5)\times 10^{-3}$. The system-level performance of the quantum processor is assessed with mirror benchmarking, linear cross-entropy benchmarking, a quantum volume measurement of $\mathrm{QV}=2^{16}$, and the creation of 32-qubit entanglement in a GHZ state. We also tested application benchmarks including Hamiltonian simulation, QAOA, error correction on a repetition code, and dynamics simulations using qubit reuse. We also discuss future upgrades to the new system aimed at adding more qubits and capabilities.
△ Less
Submitted 16 May, 2023; v1 submitted 5 May, 2023;
originally announced May 2023.
-
Non-Abelian Topological Order and Anyons on a Trapped-Ion Processor
Authors:
Mohsin Iqbal,
Nathanan Tantivasadakarn,
Ruben Verresen,
Sara L. Campbell,
Joan M. Dreiling,
Caroline Figgatt,
John P. Gaebler,
Jacob Johansen,
Michael Mills,
Steven A. Moses,
Juan M. Pino,
Anthony Ransford,
Mary Rowe,
Peter Siegfried,
Russell P. Stutz,
Michael Foss-Feig,
Ashvin Vishwanath,
Henrik Dreyer
Abstract:
Non-Abelian topological order (TO) is a coveted state of matter with remarkable properties, including quasiparticles that can remember the sequence in which they are exchanged. These anyonic excitations are promising building blocks of fault-tolerant quantum computers. However, despite extensive efforts, non-Abelian TO and its excitations have remained elusive, unlike the simpler quasiparticles or…
▽ More
Non-Abelian topological order (TO) is a coveted state of matter with remarkable properties, including quasiparticles that can remember the sequence in which they are exchanged. These anyonic excitations are promising building blocks of fault-tolerant quantum computers. However, despite extensive efforts, non-Abelian TO and its excitations have remained elusive, unlike the simpler quasiparticles or defects in Abelian TO. In this work, we present the first unambiguous realization of non-Abelian TO and demonstrate control of its anyons. Using an adaptive circuit on Quantinuum's H2 trapped-ion quantum processor, we create the ground state wavefunction of $D_4$ TO on a kagome lattice of 27 qubits, with fidelity per site exceeding $98.4\%$. By creating and moving anyons along Borromean rings in spacetime, anyon interferometry detects an intrinsically non-Abelian braiding process. Furthermore, tunneling non-Abelions around a torus creates all 22 ground states, as well as an excited state with a single anyon -- a peculiar feature of non-Abelian TO. This work illustrates the counterintuitive nature of non-Abelions and enables their study in quantum devices.
△ Less
Submitted 14 February, 2024; v1 submitted 5 May, 2023;
originally announced May 2023.
-
Hybrid Transducer and Attention based Encoder-Decoder Modeling for Speech-to-Text Tasks
Authors:
Yun Tang,
Anna Y. Sun,
Hirofumi Inaguma,
Xinyue Chen,
Ning Dong,
Xutai Ma,
Paden D. Tomasello,
Juan Pino
Abstract:
Transducer and Attention based Encoder-Decoder (AED) are two widely used frameworks for speech-to-text tasks. They are designed for different purposes and each has its own benefits and drawbacks for speech-to-text tasks. In order to leverage strengths of both modeling methods, we propose a solution by combining Transducer and Attention based Encoder-Decoder (TAED) for speech-to-text tasks. The new…
▽ More
Transducer and Attention based Encoder-Decoder (AED) are two widely used frameworks for speech-to-text tasks. They are designed for different purposes and each has its own benefits and drawbacks for speech-to-text tasks. In order to leverage strengths of both modeling methods, we propose a solution by combining Transducer and Attention based Encoder-Decoder (TAED) for speech-to-text tasks. The new method leverages AED's strength in non-monotonic sequence to sequence learning while retaining Transducer's streaming property. In the proposed framework, Transducer and AED share the same speech encoder. The predictor in Transducer is replaced by the decoder in the AED model, and the outputs of the decoder are conditioned on the speech inputs instead of outputs from an unconditioned language model. The proposed solution ensures that the model is optimized by covering all possible read/write scenarios and creates a matched environment for streaming applications. We evaluate the proposed approach on the \textsc{MuST-C} dataset and the findings demonstrate that TAED performs significantly better than Transducer for offline automatic speech recognition (ASR) and speech-to-text translation (ST) tasks. In the streaming case, TAED outperforms Transducer in the ASR task and one ST direction while comparable results are achieved in another translation direction.
△ Less
Submitted 4 May, 2023;
originally announced May 2023.
-
Enhancing Speech-to-Speech Translation with Multiple TTS Targets
Authors:
Jiatong Shi,
Yun Tang,
Ann Lee,
Hirofumi Inaguma,
Changhan Wang,
Juan Pino,
Shinji Watanabe
Abstract:
It has been known that direct speech-to-speech translation (S2ST) models usually suffer from the data scarcity issue because of the limited existing parallel materials for both source and target speech. Therefore to train a direct S2ST system, previous works usually utilize text-to-speech (TTS) systems to generate samples in the target language by augmenting the data from speech-to-text translatio…
▽ More
It has been known that direct speech-to-speech translation (S2ST) models usually suffer from the data scarcity issue because of the limited existing parallel materials for both source and target speech. Therefore to train a direct S2ST system, previous works usually utilize text-to-speech (TTS) systems to generate samples in the target language by augmenting the data from speech-to-text translation (S2TT). However, there is a limited investigation into how the synthesized target speech would affect the S2ST models. In this work, we analyze the effect of changing synthesized target speech for direct S2ST models. We find that simply combining the target speech from different TTS systems can potentially improve the S2ST performances. Following that, we also propose a multi-task framework that jointly optimizes the S2ST system with multiple targets from different TTS systems. Extensive experiments demonstrate that our proposed framework achieves consistent improvements (2.8 BLEU) over the baselines on the Fisher Spanish-English dataset.
△ Less
Submitted 10 April, 2023;
originally announced April 2023.
-
ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit
Authors:
Brian Yan,
Jiatong Shi,
Yun Tang,
Hirofumi Inaguma,
Yifan Peng,
Siddharth Dalmia,
Peter Polák,
Patrick Fernandes,
Dan Berrebbi,
Tomoki Hayashi,
Xiaohui Zhang,
Zhaoheng Ni,
Moto Hira,
Soumi Maiti,
Juan Pino,
Shinji Watanabe
Abstract:
ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitated by the broadening interests of the spoken language translation community. ESPnet-ST-v2 supports 1) offline speech-to-text translation (ST), 2) simultaneous speech-to-text translation (SST), and 3) offline speech-to-speech translation (S2ST) -- each task is supported with a wide variety of approaches, differentiating ESPnet-…
▽ More
ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitated by the broadening interests of the spoken language translation community. ESPnet-ST-v2 supports 1) offline speech-to-text translation (ST), 2) simultaneous speech-to-text translation (SST), and 3) offline speech-to-speech translation (S2ST) -- each task is supported with a wide variety of approaches, differentiating ESPnet-ST-v2 from other open source spoken language translation toolkits. This toolkit offers state-of-the-art architectures such as transducers, hybrid CTC/attention, multi-decoders with searchable intermediates, time-synchronous blockwise CTC/attention, Translatotron models, and direct discrete unit models. In this paper, we describe the overall design, example models for each task, and performance benchmarking behind ESPnet-ST-v2, which is publicly available at https://github.com/espnet/espnet.
△ Less
Submitted 6 July, 2023; v1 submitted 10 April, 2023;
originally announced April 2023.
-
MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation
Authors:
Mohamed Anwar,
Bowen Shi,
Vedanuj Goswami,
Wei-Ning Hsu,
Juan Pino,
Changhan Wang
Abstract:
We introduce MuAViC, a multilingual audio-visual corpus for robust speech recognition and robust speech-to-text translation providing 1200 hours of audio-visual speech in 9 languages. It is fully transcribed and covers 6 English-to-X translation as well as 6 X-to-English translation directions. To the best of our knowledge, this is the first open benchmark for audio-visual speech-to-text translati…
▽ More
We introduce MuAViC, a multilingual audio-visual corpus for robust speech recognition and robust speech-to-text translation providing 1200 hours of audio-visual speech in 9 languages. It is fully transcribed and covers 6 English-to-X translation as well as 6 X-to-English translation directions. To the best of our knowledge, this is the first open benchmark for audio-visual speech-to-text translation and the largest open benchmark for multilingual audio-visual speech recognition. Our baseline results show that MuAViC is effective for building noise-robust speech recognition and translation models. We make the corpus available at https://github.com/facebookresearch/muavic.
△ Less
Submitted 7 March, 2023; v1 submitted 1 March, 2023;
originally announced March 2023.
-
Pre-training for Speech Translation: CTC Meets Optimal Transport
Authors:
Phuong-Hang Le,
Hongyu Gong,
Changhan Wang,
Juan Pino,
Benjamin Lecouteux,
Didier Schwab
Abstract:
The gap between speech and text modalities is a major challenge in speech-to-text translation (ST). Different methods have been proposed to reduce this gap, but most of them require architectural changes in ST training. In this work, we propose to mitigate this issue at the pre-training stage, requiring no change in the ST model. First, we show that the connectionist temporal classification (CTC)…
▽ More
The gap between speech and text modalities is a major challenge in speech-to-text translation (ST). Different methods have been proposed to reduce this gap, but most of them require architectural changes in ST training. In this work, we propose to mitigate this issue at the pre-training stage, requiring no change in the ST model. First, we show that the connectionist temporal classification (CTC) loss can reduce the modality gap by design. We provide a quantitative comparison with the more common cross-entropy loss, showing that pre-training with CTC consistently achieves better final ST accuracy. Nevertheless, CTC is only a partial solution and thus, in our second contribution, we propose a novel pre-training method combining CTC and optimal transport to further reduce this gap. Our method pre-trains a Siamese-like model composed of two encoders, one for acoustic inputs and the other for textual inputs, such that they produce representations that are close to each other in the Wasserstein space. Extensive experiments on the standard CoVoST-2 and MuST-C datasets show that our pre-training method applied to the vanilla encoder-decoder Transformer achieves state-of-the-art performance under the no-external-data setting, and performs on par with recent strong multi-task learning systems trained with external data. Finally, our method can also be applied on top of these multi-task systems, leading to further improvements for these models. Code and pre-trained models are available at https://github.com/formiel/fairseq.
△ Less
Submitted 5 June, 2023; v1 submitted 27 January, 2023;
originally announced January 2023.
-
UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units
Authors:
Hirofumi Inaguma,
Sravya Popuri,
Ilia Kulikov,
Peng-Jen Chen,
Changhan Wang,
Yu-An Chung,
Yun Tang,
Ann Lee,
Shinji Watanabe,
Juan Pino
Abstract:
Direct speech-to-speech translation (S2ST), in which all components can be optimized jointly, is advantageous over cascaded approaches to achieve fast inference with a simplified pipeline. We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units subsequently. We enhance the model performance by subword predictio…
▽ More
Direct speech-to-speech translation (S2ST), in which all components can be optimized jointly, is advantageous over cascaded approaches to achieve fast inference with a simplified pipeline. We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units subsequently. We enhance the model performance by subword prediction in the first-pass decoder, advanced two-pass decoder architecture design and search strategy, and better training regularization. To leverage large amounts of unlabeled text data, we pre-train the first-pass text decoder based on the self-supervised denoising auto-encoding task. Experimental evaluations on benchmark datasets at various data scales demonstrate that UnitY outperforms a single-pass speech-to-unit translation model by 2.5-4.2 ASR-BLEU with 2.83x decoding speed-up. We show that the proposed methods boost the performance even when predicting spectrogram in the second pass. However, predicting discrete units achieves 2.51x decoding speed-up compared to that case.
△ Less
Submitted 26 May, 2023; v1 submitted 15 December, 2022;
originally announced December 2022.
-
Dynamical gauge fields with bosonic codes
Authors:
Javier del Pino,
Oded Zilberberg
Abstract:
The quantum simulation of dynamical gauge field theories offers the opportunity to study complex high-energy physics with controllable low-energy devices. For quantum computation, bosonic codes promise robust error correction that exploits multi-particle redundancy in bosons. Here, we demonstrate how bosonic codes can be used to simulate dynamical gauge fields. We encode both matter and dynamical…
▽ More
The quantum simulation of dynamical gauge field theories offers the opportunity to study complex high-energy physics with controllable low-energy devices. For quantum computation, bosonic codes promise robust error correction that exploits multi-particle redundancy in bosons. Here, we demonstrate how bosonic codes can be used to simulate dynamical gauge fields. We encode both matter and dynamical gauge fields in a network of resonators that are coupled via three-wave-mixing. The mapping to a $\mathbb{Z}_2$ dynamical lattice gauge theory is established when the gauge resonators operate as Schrödinger cat states. We explore the optimal conditions under which the system preserves the required gauge symmetries. Our findings promote realising high-energy models using bosonic codes.
△ Less
Submitted 7 February, 2023; v1 submitted 22 November, 2022;
originally announced November 2022.
-
Speech-to-Speech Translation For A Real-world Unwritten Language
Authors:
Peng-Jen Chen,
Kevin Tran,
Yilin Yang,
Jingfei Du,
Justine Kao,
Yu-An Chung,
Paden Tomasello,
Paul-Ambroise Duquenne,
Holger Schwenk,
Hongyu Gong,
Hirofumi Inaguma,
Sravya Popuri,
Changhan Wang,
Juan Pino,
Wei-Ning Hsu,
Ann Lee
Abstract:
We study speech-to-speech translation (S2ST) that translates speech from one language into another language and focuses on building systems to support languages without standard text writing systems. We use English-Taiwanese Hokkien as a case study, and present an end-to-end solution from training data collection, modeling choices to benchmark dataset release. First, we present efforts on creating…
▽ More
We study speech-to-speech translation (S2ST) that translates speech from one language into another language and focuses on building systems to support languages without standard text writing systems. We use English-Taiwanese Hokkien as a case study, and present an end-to-end solution from training data collection, modeling choices to benchmark dataset release. First, we present efforts on creating human annotated data, automatically mining data from large unlabeled speech datasets, and adopting pseudo-labeling to produce weakly supervised data. On the modeling, we take advantage of recent advances in applying self-supervised discrete representations as target for prediction in S2ST and show the effectiveness of leveraging additional text supervision from Mandarin, a language similar to Hokkien, in model training. Finally, we release an S2ST benchmark set to facilitate future research in this field. The demo can be found at https://huggingface.co/spaces/facebook/Hokkien_Translation .
△ Less
Submitted 11 November, 2022;
originally announced November 2022.
-
SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations
Authors:
Paul-Ambroise Duquenne,
Hongyu Gong,
Ning Dong,
Jingfei Du,
Ann Lee,
Vedanuj Goswani,
Changhan Wang,
Juan Pino,
Benoît Sagot,
Holger Schwenk
Abstract:
We present SpeechMatrix, a large-scale multilingual corpus of speech-to-speech translations mined from real speech of European Parliament recordings. It contains speech alignments in 136 language pairs with a total of 418 thousand hours of speech. To evaluate the quality of this parallel speech, we train bilingual speech-to-speech translation models on mined data only and establish extensive basel…
▽ More
We present SpeechMatrix, a large-scale multilingual corpus of speech-to-speech translations mined from real speech of European Parliament recordings. It contains speech alignments in 136 language pairs with a total of 418 thousand hours of speech. To evaluate the quality of this parallel speech, we train bilingual speech-to-speech translation models on mined data only and establish extensive baseline results on EuroParl-ST, VoxPopuli and FLEURS test sets. Enabled by the multilinguality of SpeechMatrix, we also explore multilingual speech-to-speech translation, a topic which was addressed by few other works. We also demonstrate that model pre-training and sparse scaling using Mixture-of-Experts bring large gains to translation performance. The mined data and models are freely available.
△ Less
Submitted 8 November, 2022;
originally announced November 2022.
-
Deterministic and stochastic sampling of two coupled Kerr parametric oscillators
Authors:
Gabriel Margiani,
Javier del Pino,
Toni L. Heugel,
Nicholas E. Bousse,
Sebastián Guerrero,
Thomas W. Kenny,
Oded Zilberberg,
Deividas Sabonis,
Alexander Eichler
Abstract:
The vision of building computational hardware for problem optimization has spurred large efforts in the physics community. In particular, networks of Kerr parametric oscillators (KPOs) are envisioned as simulators for finding the ground states of Ising Hamiltonians. It was shown, however, that KPO networks can feature large numbers of unexpected solutions that are difficult to sample with the exis…
▽ More
The vision of building computational hardware for problem optimization has spurred large efforts in the physics community. In particular, networks of Kerr parametric oscillators (KPOs) are envisioned as simulators for finding the ground states of Ising Hamiltonians. It was shown, however, that KPO networks can feature large numbers of unexpected solutions that are difficult to sample with the existing deterministic (i.e., adiabatic) protocols. In this work, we experimentally realize a system of two classical coupled KPOs, and we find good agreement with the predicted mapping to Ising states. We then introduce a protocol based on stochastic sampling of the system, and we show how the resulting probability distribution can be used to identify the ground state of the corresponding Ising Hamiltonian. This method is akin to a Monte Carlo sampling of multiple out-of-equilibrium stationary states and is less prone to become trapped in local minima than deterministic protocols.
△ Less
Submitted 3 March, 2023; v1 submitted 26 October, 2022;
originally announced October 2022.
-
Simple and Effective Unsupervised Speech Translation
Authors:
Changhan Wang,
Hirofumi Inaguma,
Peng-Jen Chen,
Ilia Kulikov,
Yun Tang,
Wei-Ning Hsu,
Michael Auli,
Juan Pino
Abstract:
The amount of labeled data to train models for speech tasks is limited for most languages, however, the data scarcity is exacerbated for speech translation which requires labeled data covering two different languages. To address this issue, we study a simple and effective approach to build speech translation systems without labeled data by leveraging recent advances in unsupervised speech recognit…
▽ More
The amount of labeled data to train models for speech tasks is limited for most languages, however, the data scarcity is exacerbated for speech translation which requires labeled data covering two different languages. To address this issue, we study a simple and effective approach to build speech translation systems without labeled data by leveraging recent advances in unsupervised speech recognition, machine translation and speech synthesis, either in a pipeline approach, or to generate pseudo-labels for training end-to-end speech translation models. Furthermore, we present an unsupervised domain adaptation technique for pre-trained speech models which improves the performance of downstream unsupervised speech recognition, especially for low-resource settings. Experiments show that unsupervised speech-to-text translation outperforms the previous unsupervised state of the art by 3.2 BLEU on the Libri-Trans benchmark, on CoVoST 2, our best systems outperform the best supervised end-to-end models (without pre-training) from only two years ago by an average of 5.0 BLEU over five X-En directions. We also report competitive results on MuST-C and CVSS benchmarks.
△ Less
Submitted 18 October, 2022;
originally announced October 2022.
-
Quadrature nonreciprocity: unidirectional bosonic transmission without breaking time-reversal symmetry
Authors:
Clara C. Wanjura,
Jesse J. Slim,
Javier del Pino,
Matteo Brunelli,
Ewold Verhagen,
Andreas Nunnenkamp
Abstract:
Nonreciprocity means that the transmission of a signal depends on its direction of propagation. Despite vastly different platforms and underlying working principles, the realisations of nonreciprocal transport in linear, time-independent systems rely on Aharonov-Bohm interference among several pathways and require breaking time-reversal symmetry. Here we extend the notion of nonreciprocity to unid…
▽ More
Nonreciprocity means that the transmission of a signal depends on its direction of propagation. Despite vastly different platforms and underlying working principles, the realisations of nonreciprocal transport in linear, time-independent systems rely on Aharonov-Bohm interference among several pathways and require breaking time-reversal symmetry. Here we extend the notion of nonreciprocity to unidirectional bosonic transport in systems with a time-reversal symmetric Hamiltonian by exploiting interference between beamsplitter (excitation preserving) and two-mode-squeezing (excitation non-preserving) interactions. In contrast to standard nonreciprocity, this unidirectional transport manifests when the mode quadratures are resolved with respect to an external reference phase. Hence we dub this phenomenon quadrature nonreciprocity. First, we experimentally demonstrate it in the minimal system of two coupled nanomechanical modes orchestrated by optomechanical interactions. Next, we develop a theoretical framework to characterise the class of networks exhibiting quadrature nonreciprocity based on features of their particle-hole graphs. In addition to unidirectionality, these networks can exhibit an even-odd pairing between collective quadratures, which we confirm experimentally in a four-mode system, and an exponential end-to-end gain in the case of arrays of cavities. Our work opens up new avenues for signal routing and quantum-limited amplification in bosonic systems.
△ Less
Submitted 17 April, 2023; v1 submitted 18 July, 2022;
originally announced July 2022.
-
Unified Speech-Text Pre-training for Speech Translation and Recognition
Authors:
Yun Tang,
Hongyu Gong,
Ning Dong,
Changhan Wang,
Wei-Ning Hsu,
Jiatao Gu,
Alexei Baevski,
Xian Li,
Abdelrahman Mohamed,
Michael Auli,
Juan Pino
Abstract:
We describe a method to jointly pre-train speech and text in an encoder-decoder modeling framework for speech translation and recognition. The proposed method incorporates four self-supervised and supervised subtasks for cross modality learning. A self-supervised speech subtask leverages unlabelled speech data, and a (self-)supervised text to text subtask makes use of abundant text training data.…
▽ More
We describe a method to jointly pre-train speech and text in an encoder-decoder modeling framework for speech translation and recognition. The proposed method incorporates four self-supervised and supervised subtasks for cross modality learning. A self-supervised speech subtask leverages unlabelled speech data, and a (self-)supervised text to text subtask makes use of abundant text training data. Two auxiliary supervised speech tasks are included to unify speech and text modeling space. Our contribution lies in integrating linguistic information from the text corpus into the speech pre-training. Detailed analysis reveals learning interference among subtasks. Two pre-training configurations for speech translation and recognition, respectively, are presented to alleviate subtask interference. Our experiments show the proposed method can effectively fuse speech and text information into one model. It achieves between 1.7 and 2.3 BLEU improvement above the state of the art on the MuST-C speech translation dataset and comparable WERs to wav2vec 2.0 on the Librispeech speech recognition task.
△ Less
Submitted 11 April, 2022;
originally announced April 2022.
-
Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation
Authors:
Sravya Popuri,
Peng-Jen Chen,
Changhan Wang,
Juan Pino,
Yossi Adi,
Jiatao Gu,
Wei-Ning Hsu,
Ann Lee
Abstract:
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues as there exists little parallel S2ST data, compared to the amount of data available for conventional cascaded systems that consist of automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS) synthesis. In this work, we explore self-supervised pre-training with unlabeled speech data and…
▽ More
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues as there exists little parallel S2ST data, compared to the amount of data available for conventional cascaded systems that consist of automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS) synthesis. In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue. We take advantage of a recently proposed speech-to-unit translation (S2UT) framework that encodes target speech into discrete representations, and transfer pre-training and efficient partial finetuning techniques that work well for speech-to-text translation (S2T) to the S2UT domain by studying both speech encoder and discrete unit decoder pre-training. Our experiments on Spanish-English translation show that self-supervised pre-training consistently improves model performance compared with multitask learning with an average 6.6-12.1 BLEU gain, and it can be further combined with data augmentation techniques that apply MT to create weakly supervised training data. Audio samples are available at: https://facebookresearch.github.io/speech_translation/enhanced_direct_s2st_units/index.html .
△ Less
Submitted 13 September, 2022; v1 submitted 6 April, 2022;
originally announced April 2022.
-
HarmonicBalance.jl: A Julia suite for nonlinear dynamics using harmonic balance
Authors:
Jan Košata,
Javier del Pino,
Toni L. Heugel,
Oded Zilberberg
Abstract:
HarmonicBalance.jl is a publicly available Julia package designed to simplify and solve systems of periodic time-dependent nonlinear ordinary differential equations. Time dependence of the system parameters is treated with the harmonic balance method, which approximates the system's behaviour as a set of harmonic terms with slowly-varying amplitudes. Under this approximation, the set of all possib…
▽ More
HarmonicBalance.jl is a publicly available Julia package designed to simplify and solve systems of periodic time-dependent nonlinear ordinary differential equations. Time dependence of the system parameters is treated with the harmonic balance method, which approximates the system's behaviour as a set of harmonic terms with slowly-varying amplitudes. Under this approximation, the set of all possible steady-state responses follows from the solution of a polynomial system. In HarmonicBalance.jl, we combine harmonic balance with contemporary implementations of symbolic algebra and the homotopy continuation method to numerically determine all steady-state solutions and their associated fluctuation dynamics. For the exploration of involved steady-state topologies, we provide a simple graphical user interface, allowing for arbitrary solution observables and phase diagrams. HarmonicBalance.jl is a free software available at https://github.com/NonlinearOscillations/HarmonicBalance.jl.
△ Less
Submitted 17 May, 2022; v1 submitted 1 February, 2022;
originally announced February 2022.
-
Textless Speech-to-Speech Translation on Real Data
Authors:
Ann Lee,
Hongyu Gong,
Paul-Ambroise Duquenne,
Holger Schwenk,
Peng-Jen Chen,
Changhan Wang,
Sravya Popuri,
Yossi Adi,
Juan Pino,
Jiatao Gu,
Wei-Ning Hsu
Abstract:
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language and can be built without the need of any text data. Different from existing work in the literature, we tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data. The key to our approach is a self-supervised unit-based…
▽ More
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language and can be built without the need of any text data. Different from existing work in the literature, we tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data. The key to our approach is a self-supervised unit-based speech normalization technique, which finetunes a pre-trained speech encoder with paired audios from multiple speakers and a single reference speaker to reduce the variations due to accents, while preserving the lexical content. With only 10 minutes of paired data for speech normalization, we obtain on average 3.2 BLEU gain when training the S2ST model on the VoxPopuli S2ST dataset, compared to a baseline trained on un-normalized speech target. We also incorporate automatically mined S2ST data and show an additional 2.0 BLEU gain. To our knowledge, we are the first to establish a textless S2ST technique that can be trained with real-world data and works for multiple language pairs. Audio samples are available at https://facebookresearch.github.io/speech_translation/textless_s2st_real_data/index.html .
△ Less
Submitted 4 May, 2022; v1 submitted 15 December, 2021;
originally announced December 2021.
-
XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale
Authors:
Arun Babu,
Changhan Wang,
Andros Tjandra,
Kushal Lakhotia,
Qiantong Xu,
Naman Goyal,
Kritika Singh,
Patrick von Platen,
Yatharth Saraf,
Juan Pino,
Alexei Baevski,
Alexis Conneau,
Michael Auli
Abstract:
This paper presents XLS-R, a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0. We train models with up to 2B parameters on nearly half a million hours of publicly available speech audio in 128 languages, an order of magnitude more public data than the largest known prior work. Our evaluation covers a wide range of tasks, domains, data regimes and languages, b…
▽ More
This paper presents XLS-R, a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0. We train models with up to 2B parameters on nearly half a million hours of publicly available speech audio in 128 languages, an order of magnitude more public data than the largest known prior work. Our evaluation covers a wide range of tasks, domains, data regimes and languages, both high and low-resource. On the CoVoST-2 speech translation benchmark, we improve the previous state of the art by an average of 7.4 BLEU over 21 translation directions into English. For speech recognition, XLS-R improves over the best known prior work on BABEL, MLS, CommonVoice as well as VoxPopuli, lowering error rates by 14-34% relative on average. XLS-R also sets a new state of the art on VoxLingua107 language identification. Moreover, we show that with sufficient model size, cross-lingual pretraining can outperform English-only pretraining when translating English speech into other languages, a setting which favors monolingual pretraining. We hope XLS-R can help to improve speech processing tasks for many more languages of the world.
△ Less
Submitted 16 December, 2021; v1 submitted 17 November, 2021;
originally announced November 2021.
-
Non-Hermitian chiral phononics through optomechanically-induced squeezing
Authors:
Javier del Pino,
Jesse J. Slim,
Ewold Verhagen
Abstract:
Imposing chirality on a physical system engenders unconventional energy flow and responses, such as the Aharonov-Bohm effect and the topological quantum Hall phase for electrons in a symmetry-breaking magnetic field. Recently, great interest has arisen in combining that principle with broken Hermiticity to explore novel topological phases and applications. Here, we report unique phononic states fo…
▽ More
Imposing chirality on a physical system engenders unconventional energy flow and responses, such as the Aharonov-Bohm effect and the topological quantum Hall phase for electrons in a symmetry-breaking magnetic field. Recently, great interest has arisen in combining that principle with broken Hermiticity to explore novel topological phases and applications. Here, we report unique phononic states formed when combining the controlled breaking of time-reversal symmetry with non-Hermitian dynamics, both induced through time-modulated radiation pressure forces in small nano-optomechanical networks. We observe chiral energy flow among mechanical resonators in a synthetic dimension and Aharonov-Bohm tuning of their hybridised modes. Introducing particle-non-conserving squeezing interactions, we discover a non-Hermitian Aharonov-Bohm effect in ring-shaped networks in which mechanical quasiparticles experience parametric gain. The resulting nontrivial complex mode spectra indicate flux-tuning of squeezing, exceptional points, instabilities and unidirectional phononic amplification. This rich new phenomenology points the way to the exploration of new non-Hermitian topological bosonic phases and applications in sensing and transport that exploit spatiotemporal symmetry breaking.
△ Less
Submitted 27 October, 2021;
originally announced October 2021.
-
Direct Simultaneous Speech-to-Speech Translation with Variational Monotonic Multihead Attention
Authors:
Xutai Ma,
Hongyu Gong,
Danni Liu,
Ann Lee,
Yun Tang,
Peng-Jen Chen,
Wei-Ning Hsu,
Phillip Koehn,
Juan Pino
Abstract:
We present a direct simultaneous speech-to-speech translation (Simul-S2ST) model, Furthermore, the generation of translation is independent from intermediate text representations. Our approach leverages recent progress on direct speech-to-speech translation with discrete units, in which a sequence of discrete representations, instead of continuous spectrogram features, learned in an unsupervised m…
▽ More
We present a direct simultaneous speech-to-speech translation (Simul-S2ST) model, Furthermore, the generation of translation is independent from intermediate text representations. Our approach leverages recent progress on direct speech-to-speech translation with discrete units, in which a sequence of discrete representations, instead of continuous spectrogram features, learned in an unsupervised manner, are predicted from the model and passed directly to a vocoder for speech synthesis on-the-fly. We also introduce the variational monotonic multihead attention (V-MMA), to handle the challenge of inefficient policy learning in speech simultaneous translation. The simultaneous policy then operates on source speech features and target discrete units. We carry out empirical studies to compare cascaded and direct approach on the Fisher Spanish-English and MuST-C English-Spanish datasets. Direct simultaneous model is shown to outperform the cascaded model by achieving a better tradeoff between translation quality and latency.
△ Less
Submitted 12 January, 2022; v1 submitted 15 October, 2021;
originally announced October 2021.
-
From Start to Finish: Latency Reduction Strategies for Incremental Speech Synthesis in Simultaneous Speech-to-Speech Translation
Authors:
Danni Liu,
Changhan Wang,
Hongyu Gong,
Xutai Ma,
Yun Tang,
Juan Pino
Abstract:
Speech-to-speech translation (S2ST) converts input speech to speech in another language. A challenge of delivering S2ST in real time is the accumulated delay between the translation and speech synthesis modules. While recently incremental text-to-speech (iTTS) models have shown large quality improvements, they typically require additional future text inputs to reach optimal performance. In this wo…
▽ More
Speech-to-speech translation (S2ST) converts input speech to speech in another language. A challenge of delivering S2ST in real time is the accumulated delay between the translation and speech synthesis modules. While recently incremental text-to-speech (iTTS) models have shown large quality improvements, they typically require additional future text inputs to reach optimal performance. In this work, we minimize the initial waiting time of iTTS by adapting the upstream speech translator to generate high-quality pseudo lookahead for the speech synthesizer. After mitigating the initial delay, we demonstrate that the duration of synthesized speech also plays a crucial role on latency. We formalize this as a latency metric and then present a simple yet effective duration-scaling approach for latency reduction. Our approaches consistently reduce latency by 0.2-0.5 second without sacrificing speech translation quality.
△ Less
Submitted 15 July, 2022; v1 submitted 15 October, 2021;
originally announced October 2021.
-
fairseq S^2: A Scalable and Integrable Speech Synthesis Toolkit
Authors:
Changhan Wang,
Wei-Ning Hsu,
Yossi Adi,
Adam Polyak,
Ann Lee,
Peng-Jen Chen,
Jiatao Gu,
Juan Pino
Abstract:
This paper presents fairseq S^2, a fairseq extension for speech synthesis. We implement a number of autoregressive (AR) and non-AR text-to-speech models, and their multi-speaker variants. To enable training speech synthesis models with less curated data, a number of preprocessing tools are built and their importance is shown empirically. To facilitate faster iteration of development and analysis,…
▽ More
This paper presents fairseq S^2, a fairseq extension for speech synthesis. We implement a number of autoregressive (AR) and non-AR text-to-speech models, and their multi-speaker variants. To enable training speech synthesis models with less curated data, a number of preprocessing tools are built and their importance is shown empirically. To facilitate faster iteration of development and analysis, a suite of automatic metrics is included. Apart from the features added specifically for this extension, fairseq S^2 also benefits from the scalability offered by fairseq and can be easily integrated with other state-of-the-art systems provided in this framework. The code, documentation, and pre-trained models are available at https://github.com/pytorch/fairseq/tree/master/examples/speech_synthesis.
△ Less
Submitted 14 September, 2021;
originally announced September 2021.
-
Suppression of mid-circuit measurement crosstalk errors with micromotion
Authors:
J. P. Gaebler,
C. H. Baldwin,
S. A. Moses,
J. M. Dreiling,
C. Figgatt,
M. Foss-Feig,
D. Hayes,
J. M. Pino
Abstract:
Mid-circuit measurement and reset are crucial primitives in quantum computation, but such operations require strong interactions with selected qubits while maintaining isolation of neighboring qubits, which is a significant challenge in many systems. For trapped ion systems, measurement is performed with laser-induced fluorescence. Stray light from the detection beam and fluorescence from the meas…
▽ More
Mid-circuit measurement and reset are crucial primitives in quantum computation, but such operations require strong interactions with selected qubits while maintaining isolation of neighboring qubits, which is a significant challenge in many systems. For trapped ion systems, measurement is performed with laser-induced fluorescence. Stray light from the detection beam and fluorescence from the measured ions can be significant sources of decoherence for unmeasured qubits. We present a technique using ion micromotion to reduce these sources of decoherence by over an order of magnitude. We benchmark the performance with a new method, based on randomized benchmarking, to estimate the magnitude of crosstalk errors on nearby qubits. Using the Honeywell System Model H0, we demonstrate measurement and reset on select qubits with low crosstalk errors on neighboring qubits.
△ Less
Submitted 3 January, 2022; v1 submitted 24 August, 2021;
originally announced August 2021.
-
FST: the FAIR Speech Translation System for the IWSLT21 Multilingual Shared Task
Authors:
Yun Tang,
Hongyu Gong,
Xian Li,
Changhan Wang,
Juan Pino,
Holger Schwenk,
Naman Goyal
Abstract:
In this paper, we describe our end-to-end multilingual speech translation system submitted to the IWSLT 2021 evaluation campaign on the Multilingual Speech Translation shared task. Our system is built by leveraging transfer learning across modalities, tasks and languages. First, we leverage general-purpose multilingual modules pretrained with large amounts of unlabelled and labelled data. We furth…
▽ More
In this paper, we describe our end-to-end multilingual speech translation system submitted to the IWSLT 2021 evaluation campaign on the Multilingual Speech Translation shared task. Our system is built by leveraging transfer learning across modalities, tasks and languages. First, we leverage general-purpose multilingual modules pretrained with large amounts of unlabelled and labelled data. We further enable knowledge transfer from the text task to the speech task by training two tasks jointly. Finally, our multilingual model is finetuned on speech translation task-specific data to achieve the best translation results. Experimental results show our system outperforms the reported systems, including both end-to-end and cascaded based approaches, by a large margin.
In some translation directions, our speech translation results evaluated on the public Multilingual TEDx test set are even comparable with the ones from a strong text-to-text translation system, which uses the oracle speech transcripts as input.
△ Less
Submitted 14 August, 2021; v1 submitted 14 July, 2021;
originally announced July 2021.
-
Improving Speech Translation by Understanding and Learning from the Auxiliary Text Translation Task
Authors:
Yun Tang,
Juan Pino,
Xian Li,
Changhan Wang,
Dmitriy Genzel
Abstract:
Pretraining and multitask learning are widely used to improve the speech to text translation performance. In this study, we are interested in training a speech to text translation model along with an auxiliary text to text translation task. We conduct a detailed analysis to understand the impact of the auxiliary task on the primary task within the multitask learning framework. Our analysis confirm…
▽ More
Pretraining and multitask learning are widely used to improve the speech to text translation performance. In this study, we are interested in training a speech to text translation model along with an auxiliary text to text translation task. We conduct a detailed analysis to understand the impact of the auxiliary task on the primary task within the multitask learning framework. Our analysis confirms that multitask learning tends to generate similar decoder representations from different modalities and preserve more information from the pretrained text translation modules. We observe minimal negative transfer effect between the two tasks and sharing more parameters is helpful to transfer knowledge from the text task to the speech task. The analysis also reveals that the modality representation difference at the top decoder layers is still not negligible, and those layers are critical for the translation quality. Inspired by these findings, we propose three methods to improve translation quality. First, a parameter sharing and initialization strategy is proposed to enhance information sharing between the tasks. Second, a novel attention-based regularization is proposed for the encoders and pulls the representations from different modalities closer. Third, an online knowledge distillation is proposed to enhance the knowledge transfer from the text to the speech task. Our experiments show that the proposed approach improves translation performance by more than 2 BLEU over a strong baseline and achieves state-of-the-art results on the \textsc{MuST-C} English-German, English-French and English-Spanish language pairs.
△ Less
Submitted 12 July, 2021;
originally announced July 2021.
-
Direct speech-to-speech translation with discrete units
Authors:
Ann Lee,
Peng-Jen Chen,
Changhan Wang,
Jiatao Gu,
Sravya Popuri,
Xutai Ma,
Adam Polyak,
Yossi Adi,
Qing He,
Yun Tang,
Juan Pino,
Wei-Ning Hsu
Abstract:
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation. We tackle the problem by first applying a self-supervised discrete speech encoder on the target speech and then training a sequence-to-sequence speech-to-unit translation (S2UT) model to predict the discrete representa…
▽ More
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation. We tackle the problem by first applying a self-supervised discrete speech encoder on the target speech and then training a sequence-to-sequence speech-to-unit translation (S2UT) model to predict the discrete representations of the target speech. When target text transcripts are available, we design a joint speech and text training framework that enables the model to generate dual modality output (speech and text) simultaneously in the same inference pass. Experiments on the Fisher Spanish-English dataset show that the proposed framework yields improvement of 6.7 BLEU compared with a baseline direct S2ST model that predicts spectrogram features. When trained without any text transcripts, our model performance is comparable to models that predict spectrograms and are trained with text supervision, showing the potential of our system for translation between unwritten languages. Audio samples are available at https://facebookresearch.github.io/speech_translation/direct_s2st_units/index.html .
△ Less
Submitted 21 March, 2022; v1 submitted 12 July, 2021;
originally announced July 2021.
-
Pay Better Attention to Attention: Head Selection in Multilingual and Multi-Domain Sequence Modeling
Authors:
Hongyu Gong,
Yun Tang,
Juan Pino,
Xian Li
Abstract:
Multi-head attention has each of the attention heads collect salient information from different parts of an input sequence, making it a powerful mechanism for sequence modeling. Multilingual and multi-domain learning are common scenarios for sequence modeling, where the key challenge is to maximize positive transfer and mitigate negative transfer across languages and domains. In this paper, we fin…
▽ More
Multi-head attention has each of the attention heads collect salient information from different parts of an input sequence, making it a powerful mechanism for sequence modeling. Multilingual and multi-domain learning are common scenarios for sequence modeling, where the key challenge is to maximize positive transfer and mitigate negative transfer across languages and domains. In this paper, we find that non-selective attention sharing is sub-optimal for achieving good generalization across all languages and domains. We further propose attention sharing strategies to facilitate parameter sharing and specialization in multilingual and multi-domain sequence modeling. Our approach automatically learns shared and specialized attention heads for different languages and domains to mitigate their interference. Evaluated in various tasks including speech recognition, text-to-text and speech-to-text translation, the proposed attention sharing strategies consistently bring gains to sequence models built upon multi-head attention. For speech-to-text translation, our approach yields an average of $+2.0$ BLEU over $13$ language directions in multilingual setting and $+2.0$ BLEU over $3$ domains in multi-domain setting.
△ Less
Submitted 21 June, 2021;
originally announced June 2021.