Showing 1–2 of 2 results for author: Voas, J

Search v0.5.6 released 2020-02-24

arXiv:2406.06438 [pdf, other]

cs.CL cs.CV cs.HC cs.LG cs.SD eess.AS

Multimodal Contextualized Semantic Parsing from Speech

Authors: Jordan Voas, Raymond Mooney, David Harwath

Abstract: We introduce Semantic Parsing in Contextual Environments (SPICE), a task designed to enhance artificial agents' contextual awareness by integrating multimodal inputs with prior contexts. SPICE goes beyond traditional semantic parsing by offering a structured, interpretable framework for dynamically updating an agent's knowledge with new information, mirroring the complexity of human communication.… ▽ More We introduce Semantic Parsing in Contextual Environments (SPICE), a task designed to enhance artificial agents' contextual awareness by integrating multimodal inputs with prior contexts. SPICE goes beyond traditional semantic parsing by offering a structured, interpretable framework for dynamically updating an agent's knowledge with new information, mirroring the complexity of human communication. We develop the VG-SPICE dataset, crafted to challenge agents with visual scene graph construction from spoken conversational exchanges, highlighting speech and visual data integration. We also present the Audio-Vision Dialogue Scene Parser (AViD-SP) developed for use on VG-SPICE. These innovations aim to improve multimodal information processing and integration. Both the VG-SPICE dataset and the AViD-SP model are publicly available. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: 10 Pages, 3 figures, ACL 2024 Main
arXiv:2309.10248 [pdf, other]

cs.CL cs.GR cs.LG

doi 10.1145/3588432.3591550

What is the Best Automated Metric for Text to Motion Generation?

Authors: Jordan Voas, Yili Wang, Qixing Huang, Raymond Mooney

Abstract: There is growing interest in generating skeleton-based human motions from natural language descriptions. While most efforts have focused on developing better neural architectures for this task, there has been no significant work on determining the proper evaluation metric. Human evaluation is the ultimate accuracy measure for this task, and automated metrics should correlate well with human qualit… ▽ More There is growing interest in generating skeleton-based human motions from natural language descriptions. While most efforts have focused on developing better neural architectures for this task, there has been no significant work on determining the proper evaluation metric. Human evaluation is the ultimate accuracy measure for this task, and automated metrics should correlate well with human quality judgments. Since descriptions are compatible with many motions, determining the right metric is critical for evaluating and designing effective generative models. This paper systematically studies which metrics best align with human evaluations and proposes new metrics that align even better. Our findings indicate that none of the metrics currently used for this task show even a moderate correlation with human judgments on a sample level. However, for assessing average model performance, commonly used metrics such as R-Precision and less-used coordinate errors show strong correlations. Additionally, several recently developed metrics are not recommended due to their low correlation compared to alternatives. We also introduce a novel metric based on a multimodal BERT-like model, MoBERT, which offers strongly human-correlated sample-level evaluations while maintaining near-perfect model-level correlation. Our results demonstrate that this new metric exhibits extensive benefits over all current alternatives. △ Less

Submitted 18 September, 2023; originally announced September 2023.

Comments: 8 pages, SIGGRAPH Asia 2023 Conference

Search v0.5.6 released 2020-02-24