Skip to main content

Showing 1–16 of 16 results for author: Wakaki, H

.
  1. arXiv:2504.05154  [pdf, ps, other

    cs.CL

    CARE: Assessing the Impact of Multilingual Human Preference Learning on Cultural Awareness

    Authors: Geyang Guo, Tarek Naous, Hiromi Wakaki, Yukiko Nishimura, Yuki Mitsufuji, Alan Ritter, Wei Xu

    Abstract: Language Models (LMs) are typically tuned with human preferences to produce helpful responses, but the impact of preference tuning on the ability to handle culturally diverse queries remains understudied. In this paper, we systematically analyze how native human cultural preferences can be incorporated into the preference learning process to train more culturally aware LMs. We introduce \textbf{CA… ▽ More

    Submitted 4 June, 2025; v1 submitted 7 April, 2025; originally announced April 2025.

    Comments: 27 pages

  2. arXiv:2503.20871  [pdf, other

    cs.CV cs.AI cs.CL

    VinaBench: Benchmark for Faithful and Consistent Visual Narratives

    Authors: Silin Gao, Sheryl Mathew, Li Mi, Sepideh Mamooler, Mengjie Zhao, Hiromi Wakaki, Yuki Mitsufuji, Syrielle Montariol, Antoine Bosselut

    Abstract: Visual narrative generation transforms textual narratives into sequences of images illustrating the content of the text. However, generating visual narratives that are faithful to the input text and self-consistent across generated images remains an open challenge, due to the lack of knowledge constraints used for planning the stories. In this work, we propose a new benchmark, VinaBench, to addres… ▽ More

    Submitted 3 April, 2025; v1 submitted 26 March, 2025; originally announced March 2025.

    Comments: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025)

  3. arXiv:2503.11190  [pdf, other

    cs.SD cs.AI cs.CL cs.MM eess.AS

    Cross-Modal Learning for Music-to-Music-Video Description Generation

    Authors: Zhuoyuan Mao, Mengjie Zhao, Qiyu Wu, Zhi Zhong, Wei-Hsiang Liao, Hiromi Wakaki, Yuki Mitsufuji

    Abstract: Music-to-music-video generation is a challenging task due to the intrinsic differences between the music and video modalities. The advent of powerful text-to-video diffusion models has opened a promising pathway for music-video (MV) generation by first addressing the music-to-MV description task and subsequently leveraging these models for video generation. In this study, we focus on the MV descri… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

    Comments: Accepted by RepL4NLP 2025 @ NAACL 2025

  4. arXiv:2502.12623  [pdf, other

    cs.SD cs.AI cs.CL cs.MM eess.AS

    DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning

    Authors: Zhuoyuan Mao, Mengjie Zhao, Qiyu Wu, Hiromi Wakaki, Yuki Mitsufuji

    Abstract: Recent advancements in music large language models (LLMs) have significantly improved music understanding tasks, which involve the model's ability to analyze and interpret various musical elements. These improvements primarily focused on integrating both music and text inputs. However, the potential of incorporating additional modalities such as images, videos and textual music features to enhance… ▽ More

    Submitted 20 May, 2025; v1 submitted 18 February, 2025; originally announced February 2025.

  5. arXiv:2501.01123  [pdf, other

    cs.CL cs.AI cs.LG

    TED: Turn Emphasis with Dialogue Feature Attention for Emotion Recognition in Conversation

    Authors: Junya Ono, Hiromi Wakaki

    Abstract: Emotion recognition in conversation (ERC) has been attracting attention by methods for modeling multi-turn contexts. The multi-turn input to a pretraining model implicitly assumes that the current turn and other turns are distinguished during the training process by inserting special tokens into the input sequence. This paper proposes a priority-based attention method to distinguish each turn expl… ▽ More

    Submitted 2 January, 2025; originally announced January 2025.

    Comments: past activity in 2021

  6. arXiv:2410.15573  [pdf, other

    cs.SD cs.AI cs.CL cs.MM eess.AS

    OpenMU: Your Swiss Army Knife for Music Understanding

    Authors: Mengjie Zhao, Zhi Zhong, Zhuoyuan Mao, Shiqi Yang, Wei-Hsiang Liao, Shusuke Takahashi, Hiromi Wakaki, Yuki Mitsufuji

    Abstract: We present OpenMU-Bench, a large-scale benchmark suite for addressing the data scarcity issue in training multimodal language models to understand music. To construct OpenMU-Bench, we leveraged existing datasets and bootstrapped new annotations. OpenMU-Bench also broadens the scope of music understanding by including lyrics understanding and music tool usage. Using OpenMU-Bench, we trained our mus… ▽ More

    Submitted 27 November, 2024; v1 submitted 20 October, 2024; originally announced October 2024.

    Comments: Resources: https://github.com/sony/openmu

  7. arXiv:2410.08709  [pdf, other

    cs.LG math.NA stat.ML

    Distillation of Discrete Diffusion through Dimensional Correlations

    Authors: Satoshi Hayakawa, Yuhta Takida, Masaaki Imaizumi, Hiromi Wakaki, Yuki Mitsufuji

    Abstract: Diffusion models have demonstrated exceptional performances in various fields of generative modeling, but suffer from slow sampling speed due to their iterative nature. While this issue is being addressed in continuous domains, discrete diffusion models face unique challenges, particularly in capturing dependencies between elements (e.g., pixel relationships in image, sequential dependencies in la… ▽ More

    Submitted 8 May, 2025; v1 submitted 11 October, 2024; originally announced October 2024.

    Comments: 39 pages, ICML 2025 accepted

  8. arXiv:2410.06154  [pdf, other

    cs.CV

    GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

    Authors: M. Jehanzeb Mirza, Mengjie Zhao, Zhuoyuan Mao, Sivan Doveh, Wei Lin, Paul Gavrikov, Michael Dorkenwald, Shiqi Yang, Saurav Jha, Hiromi Wakaki, Yuki Mitsufuji, Horst Possegger, Rogerio Feris, Leonid Karlinsky, James Glass

    Abstract: In this work, we propose GLOV, which enables Large Language Models (LLMs) to act as implicit optimizers for Vision-Language Models (VLMs) to enhance downstream vision tasks. GLOV prompts an LLM with the downstream task description, querying it for suitable VLM prompts (e.g., for zero-shot classification with CLIP). These prompts are ranked according to their fitness for the downstream vision task.… ▽ More

    Submitted 5 February, 2025; v1 submitted 8 October, 2024; originally announced October 2024.

    Comments: Code: https://github.com/jmiemirza/GLOV

  9. arXiv:2406.11228  [pdf, other

    cs.CL

    ComperDial: Commonsense Persona-grounded Dialogue Dataset and Benchmark

    Authors: Hiromi Wakaki, Yuki Mitsufuji, Yoshinori Maeda, Yukiko Nishimura, Silin Gao, Mengjie Zhao, Keiichi Yamada, Antoine Bosselut

    Abstract: We propose a new benchmark, ComperDial, which facilitates the training and evaluation of evaluation metrics for open-domain dialogue systems. ComperDial consists of human-scored responses for 10,395 dialogue turns in 1,485 conversations collected from 99 dialogue agents submitted to the Commonsense Persona-grounded Dialogue (CPD) challenge. As a result, for any dialogue, our benchmark includes mul… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  10. arXiv:2403.15737  [pdf, other

    cs.CL

    Few-shot Dialogue Strategy Learning for Motivational Interviewing via Inductive Reasoning

    Authors: Zhouhang Xie, Bodhisattwa Prasad Majumder, Mengjie Zhao, Yoshinori Maeda, Keiichi Yamada, Hiromi Wakaki, Julian McAuley

    Abstract: We consider the task of building a dialogue system that can motivate users to adopt positive lifestyle changes: Motivational Interviewing. Addressing such a task requires a system that can infer \textit{how} to motivate a user effectively. We propose DIIT, a framework that is capable of learning and applying conversation strategies in the form of natural language inductive rules from expert demons… ▽ More

    Submitted 23 March, 2024; originally announced March 2024.

  11. arXiv:2402.17011  [pdf, other

    cs.CL

    DiffuCOMET: Contextual Commonsense Knowledge Diffusion

    Authors: Silin Gao, Mete Ismayilzada, Mengjie Zhao, Hiromi Wakaki, Yuki Mitsufuji, Antoine Bosselut

    Abstract: Inferring contextually-relevant and diverse commonsense to understand narratives remains challenging for knowledge models. In this work, we develop a series of knowledge models, DiffuCOMET, that leverage diffusion to learn to reconstruct the implicit semantic connections between narrative contexts and relevant commonsense knowledge. Across multiple diffusion steps, our method progressively refines… ▽ More

    Submitted 1 October, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

  12. arXiv:2401.06742  [pdf, other

    cs.CL cs.AI

    Using Natural Language Inference to Improve Persona Extraction from Dialogue in a New Domain

    Authors: Alexandra DeLucia, Mengjie Zhao, Yoshinori Maeda, Makoto Yoda, Keiichi Yamada, Hiromi Wakaki

    Abstract: While valuable datasets such as PersonaChat provide a foundation for training persona-grounded dialogue agents, they lack diversity in conversational and narrative settings, primarily existing in the "real" world. To develop dialogue agents with unique personas, models are trained to converse given a specific persona, but hand-crafting these persona can be time-consuming, thus methods exist to aut… ▽ More

    Submitted 12 January, 2024; originally announced January 2024.

    Comments: Code and models will be released upon publication

  13. arXiv:2310.13267  [pdf, other

    cs.CL cs.CV cs.LG cs.SD eess.AS

    On the Language Encoder of Contrastive Cross-modal Models

    Authors: Mengjie Zhao, Junya Ono, Zhi Zhong, Chieh-Hsin Lai, Yuhta Takida, Naoki Murata, Wei-Hsiang Liao, Takashi Shibuya, Hiromi Wakaki, Yuki Mitsufuji

    Abstract: Contrastive cross-modal models such as CLIP and CLAP aid various vision-language (VL) and audio-language (AL) tasks. However, there has been limited investigation of and improvement in their language encoder, which is the central component of encoding natural language descriptions of image/audio into vector representations. We extensively evaluate how unsupervised and supervised sentence embedding… ▽ More

    Submitted 20 October, 2023; originally announced October 2023.

  14. arXiv:2310.01330  [pdf, other

    cs.CV

    Towards reporting bias in visual-language datasets: bimodal augmentation by decoupling object-attribute association

    Authors: Qiyu Wu, Mengjie Zhao, Yutong He, Lang Huang, Junya Ono, Hiromi Wakaki, Yuki Mitsufuji

    Abstract: Reporting bias arises when people assume that some knowledge is universally understood and hence, do not necessitate explicit elaboration. In this paper, we focus on the wide existence of reporting bias in visual-language datasets, embodied as the object-attribute association, which can subsequentially degrade models trained on them. To mitigate this bias, we propose a bimodal augmentation (BiAug)… ▽ More

    Submitted 2 October, 2023; originally announced October 2023.

  15. arXiv:2305.02364  [pdf, other

    cs.CL

    PeaCoK: Persona Commonsense Knowledge for Consistent and Engaging Narratives

    Authors: Silin Gao, Beatriz Borges, Soyoung Oh, Deniz Bayazit, Saya Kanno, Hiromi Wakaki, Yuki Mitsufuji, Antoine Bosselut

    Abstract: Sustaining coherent and engaging narratives requires dialogue or storytelling agents to understand how the personas of speakers or listeners ground the narrative. Specifically, these agents must infer personas of their listeners to produce statements that cater to their interests. They must also learn to maintain consistent speaker personas for themselves throughout the narrative, so that their co… ▽ More

    Submitted 26 May, 2023; v1 submitted 3 May, 2023; originally announced May 2023.

    Comments: ACL 2023, long paper

  16. arXiv:2210.12678  [pdf, other

    cs.CL

    ComFact: A Benchmark for Linking Contextual Commonsense Knowledge

    Authors: Silin Gao, Jena D. Hwang, Saya Kanno, Hiromi Wakaki, Yuki Mitsufuji, Antoine Bosselut

    Abstract: Understanding rich narratives, such as dialogues and stories, often requires natural language processing systems to access relevant knowledge from commonsense knowledge graphs. However, these systems typically retrieve facts from KGs using simple heuristics that disregard the complex challenges of identifying situationally-relevant commonsense knowledge (e.g., contextualization, implicitness, ambi… ▽ More

    Submitted 23 October, 2022; originally announced October 2022.

    Comments: Findings of EMNLP 2022, long paper