-
Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities
Authors:
George Saon,
Avihu Dekel,
Alexander Brooks,
Tohru Nagano,
Abraham Daniels,
Aharon Satt,
Ashish Mittal,
Brian Kingsbury,
David Haws,
Edmilson Morais,
Gakuto Kurata,
Hagai Aronowitz,
Ibrahim Ibrahim,
Jeff Kuo,
Kate Soule,
Luis Lastras,
Masayuki Suzuki,
Ron Hoory,
Samuel Thomas,
Sashi Novitasari,
Takashi Fukuda,
Vishal Sunder,
Xiaodong Cui,
Zvi Kons
Abstract:
Granite-speech LLMs are compact and efficient speech language models specifically designed for English ASR and automatic speech translation (AST). The models were trained by modality aligning the 2B and 8B parameter variants of granite-3.3-instruct to speech on publicly available open-source corpora containing audio inputs and text targets consisting of either human transcripts for ASR or automati…
▽ More
Granite-speech LLMs are compact and efficient speech language models specifically designed for English ASR and automatic speech translation (AST). The models were trained by modality aligning the 2B and 8B parameter variants of granite-3.3-instruct to speech on publicly available open-source corpora containing audio inputs and text targets consisting of either human transcripts for ASR or automatically generated translations for AST. Comprehensive benchmarking shows that on English ASR, which was our primary focus, they outperform several competitors' models that were trained on orders of magnitude more proprietary data, and they keep pace on English-to-X AST for major European languages, Japanese, and Chinese. The speech-specific components are: a conformer acoustic encoder using block attention and self-conditioning trained with connectionist temporal classification, a windowed query-transformer speech modality adapter used to do temporal downsampling of the acoustic embeddings and map them to the LLM text embedding space, and LoRA adapters to further fine-tune the text LLM. Granite-speech-3.3 operates in two modes: in speech mode, it performs ASR and AST by activating the encoder, projector, and LoRA adapters; in text mode, it calls the underlying granite-3.3-instruct model directly (without LoRA), essentially preserving all the text LLM capabilities and safety. Both models are freely available on HuggingFace (https://huggingface.co/ibm-granite/granite-speech-3.3-2b and https://huggingface.co/ibm-granite/granite-speech-3.3-8b) and can be used for both research and commercial purposes under a permissive Apache 2.0 license.
△ Less
Submitted 13 May, 2025; v1 submitted 13 May, 2025;
originally announced May 2025.
-
Extending RNN-T-based speech recognition systems with emotion and language classification
Authors:
Zvi Kons,
Hagai Aronowitz,
Edmilson Morais,
Matheus Damasceno,
Hong-Kwang Kuo,
Samuel Thomas,
George Saon
Abstract:
Speech transcription, emotion recognition, and language identification are usually considered to be three different tasks. Each one requires a different model with a different architecture and training process. We propose using a recurrent neural network transducer (RNN-T)-based speech-to-text (STT) system as a common component that can be used for emotion recognition and language identification a…
▽ More
Speech transcription, emotion recognition, and language identification are usually considered to be three different tasks. Each one requires a different model with a different architecture and training process. We propose using a recurrent neural network transducer (RNN-T)-based speech-to-text (STT) system as a common component that can be used for emotion recognition and language identification as well as for speech recognition. Our work extends the STT system for emotion classification through minimal changes, and shows successful results on the IEMOCAP and MELD datasets. In addition, we demonstrate that by adding a lightweight component to the RNN-T module, it can also be used for language identification. In our evaluations, this new classifier demonstrates state-of-the-art accuracy for the NIST-LRE-07 dataset.
△ Less
Submitted 28 July, 2022;
originally announced July 2022.
-
Towards a Common Speech Analysis Engine
Authors:
Hagai Aronowitz,
Itai Gat,
Edmilson Morais,
Weizhong Zhu,
Ron Hoory
Abstract:
Recent innovations in self-supervised representation learning have led to remarkable advances in natural language processing. That said, in the speech processing domain, self-supervised representation learning-based systems are not yet considered state-of-the-art. We propose leveraging recent advances in self-supervised-based speech processing to create a common speech analysis engine. Such an eng…
▽ More
Recent innovations in self-supervised representation learning have led to remarkable advances in natural language processing. That said, in the speech processing domain, self-supervised representation learning-based systems are not yet considered state-of-the-art. We propose leveraging recent advances in self-supervised-based speech processing to create a common speech analysis engine. Such an engine should be able to handle multiple speech processing tasks, using a single architecture, to obtain state-of-the-art accuracy. The engine must also enable support for new tasks with small training datasets. Beyond that, a common engine should be capable of supporting distributed training with client in-house private data. We present the architecture for a common speech analysis engine based on the HuBERT self-supervised speech representation. Based on experiments, we report our results for language identification and emotion recognition on the standard evaluations NIST-LRE 07 and IEMOCAP. Our results surpass the state-of-the-art performance reported so far on these tasks. We also analyzed our engine on the emotion recognition task using reduced amounts of training data and show how to achieve improved results.
△ Less
Submitted 1 March, 2022;
originally announced March 2022.
-
Speech Emotion Recognition using Self-Supervised Features
Authors:
Edmilson Morais,
Ron Hoory,
Weizhong Zhu,
Itai Gat,
Matheus Damasceno,
Hagai Aronowitz
Abstract:
Self-supervised pre-trained features have consistently delivered state-of-art results in the field of natural language processing (NLP); however, their merits in the field of speech emotion recognition (SER) still need further investigation. In this paper we introduce a modular End-to- End (E2E) SER system based on an Upstream + Downstream architecture paradigm, which allows easy use/integration o…
▽ More
Self-supervised pre-trained features have consistently delivered state-of-art results in the field of natural language processing (NLP); however, their merits in the field of speech emotion recognition (SER) still need further investigation. In this paper we introduce a modular End-to- End (E2E) SER system based on an Upstream + Downstream architecture paradigm, which allows easy use/integration of a large variety of self-supervised features. Several SER experiments for predicting categorical emotion classes from the IEMOCAP dataset are performed. These experiments investigate interactions among fine-tuning of self-supervised feature models, aggregation of frame-level features into utterance-level features and back-end classification networks. The proposed monomodal speechonly based system not only achieves SOTA results, but also brings light to the possibility of powerful and well finetuned self-supervised acoustic features that reach results similar to the results achieved by SOTA multimodal systems using both Speech and Text modalities.
△ Less
Submitted 6 February, 2022;
originally announced February 2022.
-
Speaker Normalization for Self-supervised Speech Emotion Recognition
Authors:
Itai Gat,
Hagai Aronowitz,
Weizhong Zhu,
Edmilson Morais,
Ron Hoory
Abstract:
Large speech emotion recognition datasets are hard to obtain, and small datasets may contain biases. Deep-net-based classifiers, in turn, are prone to exploit those biases and find shortcuts such as speaker characteristics. These shortcuts usually harm a model's ability to generalize. To address this challenge, we propose a gradient-based adversary learning framework that learns a speech emotion r…
▽ More
Large speech emotion recognition datasets are hard to obtain, and small datasets may contain biases. Deep-net-based classifiers, in turn, are prone to exploit those biases and find shortcuts such as speaker characteristics. These shortcuts usually harm a model's ability to generalize. To address this challenge, we propose a gradient-based adversary learning framework that learns a speech emotion recognition task while normalizing speaker characteristics from the feature representation. We demonstrate the efficacy of our method on both speaker-independent and speaker-dependent settings and obtain new state-of-the-art results on the challenging IEMOCAP dataset.
△ Less
Submitted 6 November, 2022; v1 submitted 2 February, 2022;
originally announced February 2022.
-
LyST: a Scalar-Tensor Theory of Gravity on Lyra Manifold
Authors:
R. R. Cuzinatto,
E. M. de Morais,
B. M. Pimentel
Abstract:
We present a scalar-tensor theory of gravity on a torsion-free and metric compatible Lyra manifold. This is obtained by generalizing the concept of physical reference frame by considering a scale function defined over the manifold. The choice of a specific frame induces a local base, naturally non-holonomic, whose structure constants give rise to extra terms in the expression of the connection coe…
▽ More
We present a scalar-tensor theory of gravity on a torsion-free and metric compatible Lyra manifold. This is obtained by generalizing the concept of physical reference frame by considering a scale function defined over the manifold. The choice of a specific frame induces a local base, naturally non-holonomic, whose structure constants give rise to extra terms in the expression of the connection coefficients and in the expression for the covariant derivative. In the Lyra manifold, transformations between reference frames involving both coordinates and scale change the transformation law of tensor fields, when compared to those of the Riemann manifold. From a direct generalization of the Einstein-Hilbert minimal action coupled with a matter term, it was possible to build a Lyra invariant action, which gives rise to the associated Lyra Scalar-Tensor theory of gravity (LyST), with field equations for $g_{μν}$ and $φ$. These equations have a well-defined Newtonian limit, from which it can be seen that both metric and scale play a role in the description gravitational interaction. We present a spherically symmetric solution for the LyST gravity field equations. It dependent on two parameters $m$ and $r_{L}$, whose physical meaning is carefully investigated. We highlight the properties of LyST spherically symmetric line element and compare it to Schwarzchild solution.
△ Less
Submitted 13 April, 2021;
originally announced April 2021.
-
Speak or Chat with Me: End-to-End Spoken Language Understanding System with Flexible Inputs
Authors:
Sujeong Cha,
Wangrui Hou,
Hyun Jung,
My Phung,
Michael Picheny,
Hong-Kwang Kuo,
Samuel Thomas,
Edmilson Morais
Abstract:
A major focus of recent research in spoken language understanding (SLU) has been on the end-to-end approach where a single model can predict intents directly from speech inputs without intermediate transcripts. However, this approach presents some challenges. First, since speech can be considered as personally identifiable information, in some cases only automatic speech recognition (ASR) transcri…
▽ More
A major focus of recent research in spoken language understanding (SLU) has been on the end-to-end approach where a single model can predict intents directly from speech inputs without intermediate transcripts. However, this approach presents some challenges. First, since speech can be considered as personally identifiable information, in some cases only automatic speech recognition (ASR) transcripts are accessible. Second, intent-labeled speech data is scarce. To address the first challenge, we propose a novel system that can predict intents from flexible types of inputs: speech, ASR transcripts, or both. We demonstrate strong performance for either modality separately, and when both speech and ASR transcripts are available, through system combination, we achieve better results than using a single input modality. To address the second challenge, we leverage a semantically robust pre-trained BERT model and adopt a cross-modal system that co-trains text embeddings and acoustic embeddings in a shared latent space. We further enhance this system by utilizing an acoustic module pre-trained on LibriSpeech and domain-adapting the text module on our target datasets. Our experiments show significant advantages for these pre-training and fine-tuning strategies, resulting in a system that achieves competitive intent-classification performance on Snips SLU and Fluent Speech Commands datasets.
△ Less
Submitted 14 June, 2021; v1 submitted 7 April, 2021;
originally announced April 2021.
-
End-to-end spoken language understanding using transformer networks and self-supervised pre-trained features
Authors:
Edmilson Morais,
Hong-Kwang J. Kuo,
Samuel Thomas,
Zoltan Tuske,
Brian Kingsbury
Abstract:
Transformer networks and self-supervised pre-training have consistently delivered state-of-art results in the field of natural language processing (NLP); however, their merits in the field of spoken language understanding (SLU) still need further investigation. In this paper we introduce a modular End-to-End (E2E) SLU transformer network based architecture which allows the use of self-supervised p…
▽ More
Transformer networks and self-supervised pre-training have consistently delivered state-of-art results in the field of natural language processing (NLP); however, their merits in the field of spoken language understanding (SLU) still need further investigation. In this paper we introduce a modular End-to-End (E2E) SLU transformer network based architecture which allows the use of self-supervised pre-trained acoustic features, pre-trained model initialization and multi-task training. Several SLU experiments for predicting intent and entity labels/values using the ATIS dataset are performed. These experiments investigate the interaction of pre-trained model initialization and multi-task training with either traditional filterbank or self-supervised pre-trained acoustic features. Results show not only that self-supervised pre-trained acoustic features outperform filterbank features in almost all the experiments, but also that when these features are used in combination with multi-task training, they almost eliminate the necessity of pre-trained model initialization.
△ Less
Submitted 16 November, 2020;
originally announced November 2020.
-
A Survey on Zero Knowledge Range Proofs and Applications
Authors:
Eduardo Morais,
Tommy Koens,
Cees van Wijk,
Aleksei Koren
Abstract:
In last years, there has been an increasing effort to leverage Distributed Ledger Technology (DLT), including blockchain. One of the main topics of interest, given its importance, is the research and development of privacy mechanisms, as for example is the case of Zero Knowledge Proofs (ZKP). ZKP is a cryptographic technique that can be used to hide information that is put into the ledger, while s…
▽ More
In last years, there has been an increasing effort to leverage Distributed Ledger Technology (DLT), including blockchain. One of the main topics of interest, given its importance, is the research and development of privacy mechanisms, as for example is the case of Zero Knowledge Proofs (ZKP). ZKP is a cryptographic technique that can be used to hide information that is put into the ledger, while still allowing to perform validation of this data. In this work we describe different strategies to construct Zero Knowledge Range Proofs (ZKRP), as for example the scheme proposed by Boudot in 2001; the one proposed in 2008 by Camenisch et al, and Bulletproofs, proposed in 2017. We also compare these strategies and discuss possible use cases. Since Bulletproofs is the most efficient construction, we will give a detailed description of its algorithms and optimizations. Bulletproofs is not only more efficient than previous schemes, but also avoids the trusted setup, which is a requirement that is not desirable in the context of Distributed Ledger Technology (DLT) and blockchain. In case of cryptocurrencies, if the setup phase is compromised, it would be possible to generate money out of thin air. Interestingly, Bulletproofs can also be used to construct generic Zero Knowledge Proofs (ZKP), in the sense that it can be used to prove generic statements, and thus it is not only restricted to ZKRP, but it can be used for any kind of Proof of Knowledge (PoK). Hence Bulletproofs leads to a more powerful tool to provide privacy for DLT. Here we describe in detail the algorithms involved in Bulletproofs protocol for ZKRP. Also, we present our implementation, which was open sourced.
△ Less
Submitted 15 July, 2019;
originally announced July 2019.
-
Analytic Study of Cosmological Perturbations in a Unified Model of Dark Matter and Dark Energy with a Sharp Transition
Authors:
Rodrigo R. Cuzinatto,
Léo G. Medeiros,
Eduardo M. de Morais,
Robert H. Brandenberger
Abstract:
We study cosmological perturbations in a model of unified dark matter and dark energy with a sharp transition in the late-time universe. The dark sector is described by a dark fluid which evolves from an early stage at redshifts $z > z_C$ when it behaves as cold dark matter (CDM) to a late time dark energy (DE) phase ($z < z_C$) when the equation of state parameter is $w = -1 + ε$, with a constant…
▽ More
We study cosmological perturbations in a model of unified dark matter and dark energy with a sharp transition in the late-time universe. The dark sector is described by a dark fluid which evolves from an early stage at redshifts $z > z_C$ when it behaves as cold dark matter (CDM) to a late time dark energy (DE) phase ($z < z_C$) when the equation of state parameter is $w = -1 + ε$, with a constant $ε$ which must be in the range $0 < ε< 2/3$. We show that fluctuations in the dark energy phase suffer from an exponential instability, the mode functions growing both as a function of comoving momentum $k$ and of conformal time $η$. In order that this exponential instability does not lead to distortions of the energy density power spectrum on scales for which we have good observational results, the redshift $z_C$ of transition between the two phases is constrained to be so close to zero that the model is unable to explain the supernova data.
△ Less
Submitted 3 September, 2018; v1 submitted 4 February, 2018;
originally announced February 2018.
-
de Broglie-Proca and Bopp-Podolsky massive photon gases in cosmology
Authors:
R. R. Cuzinatto,
E. M. de Morais,
L. G. Medeiros,
C. Naldoni de Souza,
B. M. Pimentel
Abstract:
We investigate the influence of massive photons on the evolution of the expanding universe. Two particular models for generalized electrodynamics are considered, namely de Broglie-Proca and Bopp-Podolsky electrodynamics. We obtain the equation of state (EOS) $P=P(\varepsilon)$ for each case using dispersion relations derived from both theories. The EOS are inputted into the Friedmann equations of…
▽ More
We investigate the influence of massive photons on the evolution of the expanding universe. Two particular models for generalized electrodynamics are considered, namely de Broglie-Proca and Bopp-Podolsky electrodynamics. We obtain the equation of state (EOS) $P=P(\varepsilon)$ for each case using dispersion relations derived from both theories. The EOS are inputted into the Friedmann equations of a homogeneous and isotropic space-time to determine the cosmic scale factor $a(t)$. It is shown that the photon non-null mass does not significantly alter the result $a\propto t^{1/2}$ valid for a massless photon gas; this is true either in de Broglie-Proca's case (where the photon mass $m$ is extremely small) or in Bopp-Podolsky theory (for which $m$ is extremely large).
△ Less
Submitted 8 June, 2017; v1 submitted 3 November, 2016;
originally announced November 2016.
-
Tuning the pn junction at a metal-graphene interface via H2 exposure
Authors:
Alisson R. Cadore,
Edrian Mania,
Evandro A. Morais,
Kenji Watanabe,
Takashi Taniguchi,
Rodrigo G. Lacerda,
Leonardo C. Campos
Abstract:
Combining experiment and theory, we investigate how the naturally created heterojunction at a graphene and metallic contact is modulated via interaction with molecular hydrogen (H2). Due to electrostatic interaction, a Cr/Au electrode induces a pn junction in graphene, leading to an asymmetrical resistance between the charge carriers (electron and hole). This asymmetry is well modeled by consideri…
▽ More
Combining experiment and theory, we investigate how the naturally created heterojunction at a graphene and metallic contact is modulated via interaction with molecular hydrogen (H2). Due to electrostatic interaction, a Cr/Au electrode induces a pn junction in graphene, leading to an asymmetrical resistance between the charge carriers (electron and hole). This asymmetry is well modeled by considering the preferential charge scattering at the pn junction, and we show that it can be modulated in a reversible, selective and asymmetrical manner by exposing H2 to the metal-graphene interface. Our results are valuable for understanding the nature of the metal-graphene interfaces and demonstrate a novel route towards hydrogen sensor application. KEYWORDS: graphene, contact resistance,
△ Less
Submitted 30 April, 2017; v1 submitted 15 March, 2016;
originally announced March 2016.
-
Observational constraints to a unified cosmological model
Authors:
R. R. Cuzinatto,
L. G. Medeiros,
E. M. de Morais
Abstract:
We propose a phenomenological unified model for dark matter and dark energy based on an equation of state parameter $w$ that scales with the $\arctan$ of the redshift. The free parameters of the model are three constants: $Ω_{b0}$, $α$ and $β$. Parameter $α$ dictates the transition rate between the matter dominated era and the accelerated expansion period. The ratio $β/ α$ gives the redshift of th…
▽ More
We propose a phenomenological unified model for dark matter and dark energy based on an equation of state parameter $w$ that scales with the $\arctan$ of the redshift. The free parameters of the model are three constants: $Ω_{b0}$, $α$ and $β$. Parameter $α$ dictates the transition rate between the matter dominated era and the accelerated expansion period. The ratio $β/ α$ gives the redshift of the equivalence between both regimes. Cosmological parameters are fixed by observational data from Primordial Nucleosynthesis (PN), Supernovae of the type Ia (SNIa), Gamma-Ray Bursts (GRB) and Baryon Acoustic Oscillations (BAO). The calibration of the 138 GRBs events is performed using the 580 SNIa of the Union2.1 data set and a new set of 79 high-redshift GRBs is obtained. The various sets of data are used in different combinations to constraint the parameters through statistical analysis. The unified model is compared to the $Λ$CDM model and their differences are emphasized.
△ Less
Submitted 27 August, 2015; v1 submitted 29 November, 2014;
originally announced December 2014.
-
Kronecker's and Newton's approaches to solving: A first comparison
Authors:
D. Castro,
K. Haegele,
J. E. Morais,
L. M. Pardo
Abstract:
In this extended abstract we deal with the relations between the numerical/diophantine approximation and the symbolic/algebraic geometry approachs to solving of multivariate diophentine polynomial systems, obtaining several consecuences ranging from diophantine approximation to effective number theory.
In this extended abstract we deal with the relations between the numerical/diophantine approximation and the symbolic/algebraic geometry approachs to solving of multivariate diophentine polynomial systems, obtaining several consecuences ranging from diophantine approximation to effective number theory.
△ Less
Submitted 16 August, 1999;
originally announced August 1999.
-
Straight--Line Programs in Geometric Elimination Theory
Authors:
M. Giusti,
J. Heintz,
J. E. Morais,
J. Morgenstern,
L. M. Pardo
Abstract:
We present a new method for solving symbolically zero--dimensional polynomial equation systems in the affine and toric case. The main feature of our method is the use of problem adapted data structures: arithmetic networks and straight--line programs. For sequential time complexity measured by network size we obtain the following result: it is possible to solve any affine or toric zero--dimensio…
▽ More
We present a new method for solving symbolically zero--dimensional polynomial equation systems in the affine and toric case. The main feature of our method is the use of problem adapted data structures: arithmetic networks and straight--line programs. For sequential time complexity measured by network size we obtain the following result: it is possible to solve any affine or toric zero--dimensional equation system in non--uniform sequential time which is polynomial in the length of the input description and the ``geometric degree" of the equation system. Here, the input is thought to be given by a straight--line program (or alternatively in sparse representation), and the length of the input is measured by number of variables, degree of equations and size of the program (or sparsity of the equations). The geometric degree of the input system has to be adequately defined. It is always bounded by the algebraic--combinatoric "Bézout number" of the system which is given by the Hilbert function of a suitable homogeneous ideal. However, in many important cases, the value of the geometric degree of the system is much smaller than its Bézout number since this geometric degree does not take into account multiplicities or degrees of extraneous components (which may appear at infinity in the affine case or may be contained in some coordinate hyperplane in the toric case). Our method contains a new application of a classic tool to symbolic computation: we use Newton iteration in order to simplify straight--line programs occurring in elimination procedures. Our new technique allows for practical implementations a meaningful characterization of the intrinsic {\it algebraic complexity} of typic elimination problems and reduces the still unanswered question of their intrinsic {\it bit complexity} to
△ Less
Submitted 5 September, 1996;
originally announced September 1996.
-
Lower Bounds for diophantine Approximation
Authors:
M. Giusti,
J. Heintz,
K. Hägele,
J. E. Morais,
L. M. Pardo,
J. L. Montaña
Abstract:
We introduce a subexponential algorithm for geometric solving of multivariate polynomial equation systems whose bit complexity depends mainly on intrinsic geometric invariants of the solution set. From this algorithm, we derive a new procedure for the decision of consistency of polynomial equation systems whose bit complexity is subexponential, too. As a byproduct, we analyze the division of a p…
▽ More
We introduce a subexponential algorithm for geometric solving of multivariate polynomial equation systems whose bit complexity depends mainly on intrinsic geometric invariants of the solution set. From this algorithm, we derive a new procedure for the decision of consistency of polynomial equation systems whose bit complexity is subexponential, too. As a byproduct, we analyze the division of a polynomial modulo a reduced complete intersection ideal and from this, we obtain an intrinsic lower bound for the logarithmic height of diophantine approximations to a given solution of a zero--dimensional polynomial equation system. This result represents a multivariate version of Liouville's classical theorem on approximation of algebraic numbers by rationals. A special feature of our procedures is their {\em polynomial} character with respect to the mentioned geometric invariants when instead of bit operations only arithmetic operations are counted at unit cost. Technically our paper relies on the use of straight--line programs as a data structure for the encoding of polynomials, on a new symbolic application of Newton's algorithm to the Implicit Function Theorem and on a special, basis independent trace formula for affine Gorenstein algebras.
△ Less
Submitted 13 August, 1996;
originally announced August 1996.