Skip to main content

Showing 1–11 of 11 results for author: Goswami, N

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.12103  [pdf, other

    cs.AI cs.CY cs.LG

    The Amazon Nova Family of Models: Technical Report and Model Card

    Authors: Amazon AGI, Aaron Langford, Aayush Shah, Abhanshu Gupta, Abhimanyu Bhatter, Abhinav Goyal, Abhinav Mathur, Abhinav Mohanty, Abhishek Kumar, Abhishek Sethi, Abi Komma, Abner Pena, Achin Jain, Adam Kunysz, Adam Opyrchal, Adarsh Singh, Aditya Rawal, Adok Achar Budihal Prasad, Adrià de Gispert, Agnika Kumar, Aishwarya Aryamane, Ajay Nair, Akilan M, Akshaya Iyengar, Akshaya Vishnu Kudlu Shanbhogue , et al. (761 additional authors not shown)

    Abstract: We present Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. Amazon Nova Pro is a highly-capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Lite is a low-cost multimodal model that is lightning fast for processing images, video, documents… ▽ More

    Submitted 17 March, 2025; originally announced June 2025.

    Comments: 48 pages, 10 figures

    Report number: 20250317

  2. arXiv:2506.00809  [pdf, ps, other

    cs.SD eess.AS

    FUSE: Universal Speech Enhancement using Multi-Stage Fusion of Sparse Compression and Token Generation Models for the URGENT 2025 Challenge

    Authors: Nabarun Goswami, Tatsuya Harada

    Abstract: We propose a multi-stage framework for universal speech enhancement, designed for the Interspeech 2025 URGENT Challenge. Our system first employs a Sparse Compression Network to robustly separate sources and extract an initial clean speech estimate from noisy inputs. This is followed by an efficient generative model that refines speech quality by leveraging self-supervised features and optimizing… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

    Comments: Accepted to INTERSPEECH 2025

  3. arXiv:2502.20323  [pdf, other

    cs.CV

    ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model

    Authors: Xuangeng Chu, Nabarun Goswami, Ziteng Cui, Hanqin Wang, Tatsuya Harada

    Abstract: Speech-driven 3D facial animation aims to generate realistic lip movements and facial expressions for 3D head models from arbitrary audio clips. Although existing diffusion-based methods are capable of producing natural motions, their slow generation speed limits their application potential. In this paper, we introduce a novel autoregressive model that achieves real-time generation of highly synch… ▽ More

    Submitted 28 February, 2025; v1 submitted 27 February, 2025; originally announced February 2025.

    Comments: More video demonstrations, code, models and data can be found on our project website: http://xg-chu.site/project_artalk/

  4. arXiv:2403.13015  [pdf

    eess.IV cs.LG

    HyperVQ: MLR-based Vector Quantization in Hyperbolic Space

    Authors: Nabarun Goswami, Yusuke Mukuta, Tatsuya Harada

    Abstract: The success of models operating on tokenized data has heightened the need for effective tokenization methods, particularly in vision and auditory tasks where inputs are naturally continuous. A common solution is to employ Vector Quantization (VQ) within VQ Variational Autoencoders (VQVAEs), transforming inputs into discrete tokens by clustering embeddings in Euclidean space. However, Euclidean emb… ▽ More

    Submitted 6 April, 2025; v1 submitted 17 March, 2024; originally announced March 2024.

  5. arXiv:2401.10005  [pdf, other

    cs.CV cs.CL

    Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation

    Authors: Kohei Uehara, Nabarun Goswami, Hanqin Wang, Toshiaki Baba, Kohtaro Tanaka, Tomohiro Hashimoto, Kai Wang, Rei Ito, Takagi Naoya, Ryo Umagami, Yingyi Wen, Tanachai Anakewat, Tatsuya Harada

    Abstract: The increasing demand for intelligent systems capable of interpreting and reasoning about visual content requires the development of large Vision-and-Language Models (VLMs) that are not only accurate but also have explicit reasoning capabilities. This paper presents a novel approach to develop a VLM with the ability to conduct explicit reasoning based on visual content and textual instructions. We… ▽ More

    Submitted 17 July, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

  6. arXiv:2308.06979  [pdf, other

    eess.AS cs.SD

    The Sound Demixing Challenge 2023 $\unicode{x2013}$ Music Demixing Track

    Authors: Giorgio Fabbro, Stefan Uhlich, Chieh-Hsin Lai, Woosung Choi, Marco Martínez-Ramírez, Weihsiang Liao, Igor Gadelha, Geraldo Ramos, Eddie Hsu, Hugo Rodrigues, Fabian-Robert Stöter, Alexandre Défossez, Yi Luo, Jianwei Yu, Dipam Chakraborty, Sharada Mohanty, Roman Solovyev, Alexander Stempkovskiy, Tatiana Habruseva, Nabarun Goswami, Tatsuya Harada, Minseok Kim, Jun Hyung Lee, Yuanliang Dong, Xinran Zhang , et al. (2 additional authors not shown)

    Abstract: This paper summarizes the music demixing (MDX) track of the Sound Demixing Challenge (SDX'23). We provide a summary of the challenge setup and introduce the task of robust music source separation (MSS), i.e., training MSS models in the presence of errors in the training data. We propose a formalization of the errors that can occur in the design of a training dataset for MSS systems and introduce t… ▽ More

    Submitted 19 April, 2024; v1 submitted 14 August, 2023; originally announced August 2023.

    Comments: Published in Transactions of the International Society for Music Information Retrieval (https://transactions.ismir.net/articles/10.5334/tismir.171)

    Journal ref: Transactions of the International Society for Music Information Retrieval, 7(1), pp.63-84, 2024

  7. arXiv:2207.06011  [pdf, other

    eess.AS cs.SD

    SATTS: Speaker Attractor Text to Speech, Learning to Speak by Learning to Separate

    Authors: Nabarun Goswami, Tatsuya Harada

    Abstract: The mapping of text to speech (TTS) is non-deterministic, letters may be pronounced differently based on context, or phonemes can vary depending on various physiological and stylistic factors like gender, age, accent, emotions, etc. Neural speaker embeddings, trained to identify or verify speakers are typically used to represent and transfer such characteristics from reference speech to synthesize… ▽ More

    Submitted 13 July, 2022; originally announced July 2022.

    Comments: Accepted to Interspeech 2022. Visit https://naba89.github.io/SATTS-demo/ for a demo

  8. arXiv:2011.02368  [pdf

    cs.DC cs.AR cs.GR

    An Empirical-cum-Statistical Approach to Power-Performance Characterization of Concurrent GPU Kernels

    Authors: Nilanjan Goswami, Amer Qouneh, Chao Li, Tao Li

    Abstract: Growing deployment of power and energy efficient throughput accelerators (GPU) in data centers demands enhancement of power-performance co-optimization capabilities of GPUs. Realization of exascale computing using accelerators requires further improvements in power efficiency. With hardwired kernel concurrency enablement in accelerators, inter- and intra-workload simultaneous kernels computation p… ▽ More

    Submitted 4 November, 2020; v1 submitted 4 November, 2020; originally announced November 2020.

  9. arXiv:1904.03065  [pdf, other

    cs.SD eess.AS

    Recursive speech separation for unknown number of speakers

    Authors: Naoya Takahashi, Sudarsanam Parthasaarathy, Nabarun Goswami, Yuki Mitsufuji

    Abstract: In this paper we propose a method of single-channel speaker-independent multi-speaker speech separation for an unknown number of speakers. As opposed to previous works, in which the number of speakers is assumed to be known in advance and speech separation models are specific for the number of speakers, our proposed method can be applied to cases with different numbers of speakers using a single m… ▽ More

    Submitted 1 September, 2019; v1 submitted 5 April, 2019; originally announced April 2019.

    Comments: Interspeech 2019 (oral)

  10. arXiv:1805.02410  [pdf, other

    cs.SD eess.AS

    MMDenseLSTM: An efficient combination of convolutional and recurrent neural networks for audio source separation

    Authors: Naoya Takahashi, Nabarun Goswami, Yuki Mitsufuji

    Abstract: Deep neural networks have become an indispensable technique for audio source separation (ASS). It was recently reported that a variant of CNN architecture called MMDenseNet was successfully employed to solve the ASS problem of estimating source amplitudes, and state-of-the-art results were obtained for DSD100 dataset. To further enhance MMDenseNet, here we propose a novel architecture that integra… ▽ More

    Submitted 29 May, 2018; v1 submitted 7 May, 2018; originally announced May 2018.

  11. arXiv:1705.04111  [pdf, other

    cs.DM

    Critical Graphs for Minimum Vertex Cover

    Authors: Andreas Jakoby, Naveen Kumar Goswami, Eik List, Stefan Lucks

    Abstract: In the context of the chromatic-number problem, a critical graph is an instance where the deletion of any element would decrease the graph's chromatic number. Such instances have shown to be interesting objects of study for deepen the understanding of the optimization problem. This work introduces critical graphs in context of Minimum Vertex Cover. We demonstrate their potential for the generati… ▽ More

    Submitted 12 July, 2017; v1 submitted 11 May, 2017; originally announced May 2017.

    ACM Class: F.2.2; G.2.2