-
Fine-Tuning Whisper for Inclusive Prosodic Stress Analysis
Authors:
Samuel S. Sohn,
Sten Knutsen,
Karin Stromswold
Abstract:
Prosody plays a crucial role in speech perception, influencing both human understanding and automatic speech recognition (ASR) systems. Despite its importance, prosodic stress remains under-studied due to the challenge of efficiently analyzing it. This study explores fine-tuning OpenAI's Whisper large-v2 ASR model to recognize phrasal, lexical, and contrastive stress in speech. Using a dataset of…
▽ More
Prosody plays a crucial role in speech perception, influencing both human understanding and automatic speech recognition (ASR) systems. Despite its importance, prosodic stress remains under-studied due to the challenge of efficiently analyzing it. This study explores fine-tuning OpenAI's Whisper large-v2 ASR model to recognize phrasal, lexical, and contrastive stress in speech. Using a dataset of 66 native English speakers, including male, female, neurotypical, and neurodivergent individuals, we assess the model's ability to generalize stress patterns and classify speakers by neurotype and gender based on brief speech samples. Our results highlight near-human accuracy in ASR performance across all three stress types and near-perfect precision in classifying gender and neurotype. By improving prosody-aware ASR, this work contributes to equitable and robust transcription technologies for diverse populations.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
Cardiverse: Harnessing LLMs for Novel Card Game Prototyping
Authors:
Danrui Li,
Sen Zhang,
Sam S. Sohn,
Kaidong Hu,
Muhammad Usman,
Mubbasir Kapadia
Abstract:
The prototyping of computer games, particularly card games, requires extensive human effort in creative ideation and gameplay evaluation. Recent advances in Large Language Models (LLMs) offer opportunities to automate and streamline these processes. However, it remains challenging for LLMs to design novel game mechanics beyond existing databases, generate consistent gameplay environments, and deve…
▽ More
The prototyping of computer games, particularly card games, requires extensive human effort in creative ideation and gameplay evaluation. Recent advances in Large Language Models (LLMs) offer opportunities to automate and streamline these processes. However, it remains challenging for LLMs to design novel game mechanics beyond existing databases, generate consistent gameplay environments, and develop scalable gameplay AI for large-scale evaluations. This paper addresses these challenges by introducing a comprehensive automated card game prototyping framework. The approach highlights a graph-based indexing method for generating novel game designs, an LLM-driven system for consistent game code generation validated by gameplay records, and a gameplay AI constructing method that uses an ensemble of LLM-generated action-value functions optimized through self-play. These contributions aim to accelerate card game prototyping, reduce human labor, and lower barriers to entry for game developers.
△ Less
Submitted 10 February, 2025;
originally announced February 2025.
-
TrajDiffuse: A Conditional Diffusion Model for Environment-Aware Trajectory Prediction
Authors:
Qingze,
Liu,
Danrui Li,
Samuel S. Sohn,
Sejong Yoon,
Mubbasir Kapadia,
Vladimir Pavlovic
Abstract:
Accurate prediction of human or vehicle trajectories with good diversity that captures their stochastic nature is an essential task for many applications. However, many trajectory prediction models produce unreasonable trajectory samples that focus on improving diversity or accuracy while neglecting other key requirements, such as collision avoidance with the surrounding environment. In this work,…
▽ More
Accurate prediction of human or vehicle trajectories with good diversity that captures their stochastic nature is an essential task for many applications. However, many trajectory prediction models produce unreasonable trajectory samples that focus on improving diversity or accuracy while neglecting other key requirements, such as collision avoidance with the surrounding environment. In this work, we propose TrajDiffuse, a planning-based trajectory prediction method using a novel guided conditional diffusion model. We form the trajectory prediction problem as a denoising impaint task and design a map-based guidance term for the diffusion process. TrajDiffuse is able to generate trajectory predictions that match or exceed the accuracy and diversity of the SOTA, while adhering almost perfectly to environmental constraints. We demonstrate the utility of our model through experiments on the nuScenes and PFSD datasets and provide an extensive benchmark analysis against the SOTA methods.
△ Less
Submitted 14 October, 2024;
originally announced October 2024.
-
From Words to Worlds: Transforming One-line Prompt into Immersive Multi-modal Digital Stories with Communicative LLM Agent
Authors:
Samuel S. Sohn,
Danrui Li,
Sen Zhang,
Che-Jui Chang,
Mubbasir Kapadia
Abstract:
Digital storytelling, essential in entertainment, education, and marketing, faces challenges in production scalability and flexibility. The StoryAgent framework, introduced in this paper, utilizes Large Language Models and generative tools to automate and refine digital storytelling. Employing a top-down story drafting and bottom-up asset generation approach, StoryAgent tackles key issues such as…
▽ More
Digital storytelling, essential in entertainment, education, and marketing, faces challenges in production scalability and flexibility. The StoryAgent framework, introduced in this paper, utilizes Large Language Models and generative tools to automate and refine digital storytelling. Employing a top-down story drafting and bottom-up asset generation approach, StoryAgent tackles key issues such as manual intervention, interactive scene orchestration, and narrative consistency. This framework enables efficient production of interactive and consistent narratives across multiple modalities, democratizing content creation and enhancing engagement. Our results demonstrate the framework's capability to produce coherent digital stories without reference videos, marking a significant advancement in automated digital storytelling.
△ Less
Submitted 21 June, 2024; v1 submitted 14 June, 2024;
originally announced June 2024.
-
MaTableGPT: GPT-based Table Data Extractor from Materials Science Literature
Authors:
Gyeong Hoon Yi,
Jiwoo Choi,
Hyeongyun Song,
Olivia Miano,
Jaewoong Choi,
Kihoon Bang,
Byungju Lee,
Seok Su Sohn,
David Buttler,
Anna Hiszpanski,
Sang Soo Han,
Donghun Kim
Abstract:
Efficiently extracting data from tables in the scientific literature is pivotal for building large-scale databases. However, the tables reported in materials science papers exist in highly diverse forms; thus, rule-based extractions are an ineffective approach. To overcome this challenge, we present MaTableGPT, which is a GPT-based table data extractor from the materials science literature. MaTabl…
▽ More
Efficiently extracting data from tables in the scientific literature is pivotal for building large-scale databases. However, the tables reported in materials science papers exist in highly diverse forms; thus, rule-based extractions are an ineffective approach. To overcome this challenge, we present MaTableGPT, which is a GPT-based table data extractor from the materials science literature. MaTableGPT features key strategies of table data representation and table splitting for better GPT comprehension and filtering hallucinated information through follow-up questions. When applied to a vast volume of water splitting catalysis literature, MaTableGPT achieved an extraction accuracy (total F1 score) of up to 96.8%. Through comprehensive evaluations of the GPT usage cost, labeling cost, and extraction accuracy for the learning methods of zero-shot, few-shot and fine-tuning, we present a Pareto-front mapping where the few-shot learning method was found to be the most balanced solution owing to both its high extraction accuracy (total F1 score>95%) and low cost (GPT usage cost of 5.97 US dollars and labeling cost of 10 I/O paired examples). The statistical analyses conducted on the database generated by MaTableGPT revealed valuable insights into the distribution of the overpotential and elemental utilization across the reported catalysts in the water splitting literature.
△ Less
Submitted 8 June, 2024;
originally announced June 2024.
-
The Importance of Multimodal Emotion Conditioning and Affect Consistency for Embodied Conversational Agents
Authors:
Che-Jui Chang,
Samuel S. Sohn,
Sen Zhang,
Rajath Jayashankar,
Muhammad Usman,
Mubbasir Kapadia
Abstract:
Previous studies regarding the perception of emotions for embodied virtual agents have shown the effectiveness of using virtual characters in conveying emotions through interactions with humans. However, creating an autonomous embodied conversational agent with expressive behaviors presents two major challenges. The first challenge is the difficulty of synthesizing the conversational behaviors for…
▽ More
Previous studies regarding the perception of emotions for embodied virtual agents have shown the effectiveness of using virtual characters in conveying emotions through interactions with humans. However, creating an autonomous embodied conversational agent with expressive behaviors presents two major challenges. The first challenge is the difficulty of synthesizing the conversational behaviors for each modality that are as expressive as real human behaviors. The second challenge is that the affects are modeled independently, which makes it difficult to generate multimodal responses with consistent emotions across all modalities. In this work, we propose a conceptual framework, ACTOR (Affect-Consistent mulTimodal behaviOR generation), that aims to increase the perception of affects by generating multimodal behaviors conditioned on a consistent driving affect. We have conducted a user study with 199 participants to assess how the average person judges the affects perceived from multimodal behaviors that are consistent and inconsistent with respect to a driving affect. The result shows that among all model conditions, our affect-consistent framework receives the highest Likert scores for the perception of driving affects. Our statistical analysis suggests that making a modality affect-inconsistent significantly decreases the perception of driving affects. We also observe that multimodal behaviors conditioned on consistent affects are more expressive compared to behaviors with inconsistent affects. Therefore, we conclude that multimodal emotion conditioning and affect consistency are vital to enhancing the perception of affects for embodied conversational agents.
△ Less
Submitted 6 December, 2023; v1 submitted 26 September, 2023;
originally announced September 2023.
-
M3Act: Learning from Synthetic Human Group Activities
Authors:
Che-Jui Chang,
Danrui Li,
Deep Patel,
Parth Goel,
Honglu Zhou,
Seonghyeon Moon,
Samuel S. Sohn,
Sejong Yoon,
Vladimir Pavlovic,
Mubbasir Kapadia
Abstract:
The study of complex human interactions and group activities has become a focal point in human-centric computer vision. However, progress in related tasks is often hindered by the challenges of obtaining large-scale labeled datasets from real-world scenarios. To address the limitation, we introduce M3Act, a synthetic data generator for multi-view multi-group multi-person human atomic actions and g…
▽ More
The study of complex human interactions and group activities has become a focal point in human-centric computer vision. However, progress in related tasks is often hindered by the challenges of obtaining large-scale labeled datasets from real-world scenarios. To address the limitation, we introduce M3Act, a synthetic data generator for multi-view multi-group multi-person human atomic actions and group activities. Powered by Unity Engine, M3Act features multiple semantic groups, highly diverse and photorealistic images, and a comprehensive set of annotations, which facilitates the learning of human-centered tasks across single-person, multi-person, and multi-group conditions. We demonstrate the advantages of M3Act across three core experiments. The results suggest our synthetic dataset can significantly improve the performance of several downstream methods and replace real-world datasets to reduce cost. Notably, M3Act improves the state-of-the-art MOTRv2 on DanceTrack dataset, leading to a hop on the leaderboard from 10th to 2nd place. Moreover, M3Act opens new research for controllable 3D group activity generation. We define multiple metrics and propose a competitive baseline for the novel task. Our code and data are available at our project page: http://cjerry1243.github.io/M3Act.
△ Less
Submitted 2 May, 2024; v1 submitted 29 June, 2023;
originally announced June 2023.
-
MSI: Maximize Support-Set Information for Few-Shot Segmentation
Authors:
Seonghyeon Moon,
Samuel S. Sohn,
Honglu Zhou,
Sejong Yoon,
Vladimir Pavlovic,
Muhammad Haris Khan,
Mubbasir Kapadia
Abstract:
FSS(Few-shot segmentation) aims to segment a target class using a small number of labeled images(support set). To extract information relevant to the target class, a dominant approach in best-performing FSS methods removes background features using a support mask. We observe that this feature excision through a limiting support mask introduces an information bottleneck in several challenging FSS c…
▽ More
FSS(Few-shot segmentation) aims to segment a target class using a small number of labeled images(support set). To extract information relevant to the target class, a dominant approach in best-performing FSS methods removes background features using a support mask. We observe that this feature excision through a limiting support mask introduces an information bottleneck in several challenging FSS cases, e.g., for small targets and/or inaccurate target boundaries. To this end, we present a novel method(MSI), which maximizes the support-set information by exploiting two complementary sources of features to generate super correlation maps. We validate the effectiveness of our approach by instantiating it into three recent and strong FSS methods. Experimental results on several publicly available FSS benchmarks show that our proposed method consistently improves performance by visible margins and leads to faster convergence. Our code and trained models are available at: https://github.com/moonsh/MSI-Maximize-Support-Set-Information
△ Less
Submitted 10 November, 2023; v1 submitted 9 December, 2022;
originally announced December 2022.
-
An Information-Theoretic Approach for Estimating Scenario Generalization in Crowd Motion Prediction
Authors:
Gang Qiao,
Kaidong Hu,
Seonghyeon Moon,
Samuel S. Sohn,
Sejong Yoon,
Mubbasir Kapadia,
Vladimir Pavlovic
Abstract:
Learning-based approaches to modeling crowd motion have become increasingly successful but require training and evaluation on large datasets, coupled with complex model selection and parameter tuning. To circumvent this tremendously time-consuming process, we propose a novel scoring method, which characterizes generalization of models trained on source crowd scenarios and applied to target crowd s…
▽ More
Learning-based approaches to modeling crowd motion have become increasingly successful but require training and evaluation on large datasets, coupled with complex model selection and parameter tuning. To circumvent this tremendously time-consuming process, we propose a novel scoring method, which characterizes generalization of models trained on source crowd scenarios and applied to target crowd scenarios using a training-free, model-agnostic Interaction + Diversity Quantification score, ISDQ. The Interaction component aims to characterize the difficulty of scenario domains, while the diversity of a scenario domain is captured in the Diversity score. Both scores can be computed in a computation tractable manner. Our experimental results validate the efficacy of the proposed method on several simulated and real-world (source,target) generalization tasks, demonstrating its potential to select optimal domain pairs before training and testing a model.
△ Less
Submitted 1 November, 2022;
originally announced November 2022.
-
Predicting failure characteristics of structural materials via deep learning based on nondestructive void topology
Authors:
Leslie Ching Ow Tiong,
Gunjick Lee,
Seok Su Sohn,
Donghun Kim
Abstract:
Accurate predictions of the failure progression of structural materials is critical for preventing failure-induced accidents. Despite considerable mechanics modeling-based efforts, accurate prediction remains a challenging task in real-world environments due to unexpected damage factors and defect evolutions. Here, we report a novel method for predicting material failure characteristics that uniqu…
▽ More
Accurate predictions of the failure progression of structural materials is critical for preventing failure-induced accidents. Despite considerable mechanics modeling-based efforts, accurate prediction remains a challenging task in real-world environments due to unexpected damage factors and defect evolutions. Here, we report a novel method for predicting material failure characteristics that uniquely combines nondestructive X-ray computed tomography (X-CT), persistent homology (PH), and deep multimodal learning (DML). The combined method exploits the microstructural defect state at the time of material examination as an input, and outputs the failure-related properties. Our method is demonstrated to be effective using two types of fracture datasets (tensile and fatigue datasets) with ferritic low alloy steel as a representative structural material. The method achieves a mean absolute error (MAE) of 0.09 in predicting the local strain with the tensile dataset and an MAE of 0.14 in predicting the fracture progress with the fatigue dataset. These high accuracies are mainly due to PH processing of the X-CT images, which transforms complex and noisy three-dimensional X-CT images into compact two-dimensional persistence diagrams that preserve key topological features such as the internal void size, density, and distribution. The combined PH and DML processing of 3D X-CT data is our unique approach enabling reliable failure predictions at the time of material examination based on void topology progressions, and the method can be extended to various nondestructive failure tests for practical use.
△ Less
Submitted 17 May, 2022;
originally announced May 2022.
-
HM: Hybrid Masking for Few-Shot Segmentation
Authors:
Seonghyeon Moon,
Samuel S. Sohn,
Honglu Zhou,
Sejong Yoon,
Vladimir Pavlovic,
Muhammad Haris Khan,
Mubbasir Kapadia
Abstract:
We study few-shot semantic segmentation that aims to segment a target object from a query image when provided with a few annotated support images of the target class. Several recent methods resort to a feature masking (FM) technique to discard irrelevant feature activations which eventually facilitates the reliable prediction of segmentation mask. A fundamental limitation of FM is the inability to…
▽ More
We study few-shot semantic segmentation that aims to segment a target object from a query image when provided with a few annotated support images of the target class. Several recent methods resort to a feature masking (FM) technique to discard irrelevant feature activations which eventually facilitates the reliable prediction of segmentation mask. A fundamental limitation of FM is the inability to preserve the fine-grained spatial details that affect the accuracy of segmentation mask, especially for small target objects. In this paper, we develop a simple, effective, and efficient approach to enhance feature masking (FM). We dub the enhanced FM as hybrid masking (HM). Specifically, we compensate for the loss of fine-grained spatial details in FM technique by investigating and leveraging a complementary basic input masking method. Experiments have been conducted on three publicly available benchmarks with strong few-shot segmentation (FSS) baselines. We empirically show improved performance against the current state-of-the-art methods by visible margins across different benchmarks. Our code and trained models are available at: https://github.com/moonsh/HM-Hybrid-Masking
△ Less
Submitted 24 July, 2022; v1 submitted 23 March, 2022;
originally announced March 2022.
-
MUSE-VAE: Multi-Scale VAE for Environment-Aware Long Term Trajectory Prediction
Authors:
Mihee Lee,
Samuel S. Sohn,
Seonghyeon Moon,
Sejong Yoon,
Mubbasir Kapadia,
Vladimir Pavlovic
Abstract:
Accurate long-term trajectory prediction in complex scenes, where multiple agents (e.g., pedestrians or vehicles) interact with each other and the environment while attempting to accomplish diverse and often unknown goals, is a challenging stochastic forecasting problem. In this work, we propose MUSE, a new probabilistic modeling framework based on a cascade of Conditional VAEs, which tackles the…
▽ More
Accurate long-term trajectory prediction in complex scenes, where multiple agents (e.g., pedestrians or vehicles) interact with each other and the environment while attempting to accomplish diverse and often unknown goals, is a challenging stochastic forecasting problem. In this work, we propose MUSE, a new probabilistic modeling framework based on a cascade of Conditional VAEs, which tackles the long-term, uncertain trajectory prediction task using a coarse-to-fine multi-factor forecasting architecture. In its Macro stage, the model learns a joint pixel-space representation of two key factors, the underlying environment and the agent movements, to predict the long and short-term motion goals. Conditioned on them, the Micro stage learns a fine-grained spatio-temporal representation for the prediction of individual agent trajectories. The VAE backbones across the two stages make it possible to naturally account for the joint uncertainty at both levels of granularity. As a result, MUSE offers diverse and simultaneously more accurate predictions compared to the current state-of-the-art. We demonstrate these assertions through a comprehensive set of experiments on nuScenes and SDD benchmarks as well as PFSD, a new synthetic dataset, which challenges the forecasting ability of models on complex agent-environment interaction scenarios.
△ Less
Submitted 18 January, 2022;
originally announced January 2022.
-
D-HYPR: Harnessing Neighborhood Modeling and Asymmetry Preservation for Digraph Representation Learning
Authors:
Honglu Zhou,
Advith Chegu,
Samuel S. Sohn,
Zuohui Fu,
Gerard de Melo,
Mubbasir Kapadia
Abstract:
Digraph Representation Learning (DRL) aims to learn representations for directed homogeneous graphs (digraphs). Prior work in DRL is largely constrained (e.g., limited to directed acyclic graphs), or has poor generalizability across tasks (e.g., evaluated solely on one task). Most Graph Neural Networks (GNNs) exhibit poor performance on digraphs due to the neglect of modeling neighborhoods and pre…
▽ More
Digraph Representation Learning (DRL) aims to learn representations for directed homogeneous graphs (digraphs). Prior work in DRL is largely constrained (e.g., limited to directed acyclic graphs), or has poor generalizability across tasks (e.g., evaluated solely on one task). Most Graph Neural Networks (GNNs) exhibit poor performance on digraphs due to the neglect of modeling neighborhoods and preserving asymmetry. In this paper, we address these notable challenges by leveraging hyperbolic collaborative learning from multi-ordered and partitioned neighborhoods, and regularizers inspired by socio-psychological factors. Our resulting formalism, Digraph Hyperbolic Networks (D-HYPR) - albeit conceptually simple - generalizes to digraphs where cycles and non-transitive relations are common, and is applicable to multiple downstream tasks including node classification, link presence prediction, and link property prediction. In order to assess the effectiveness of D-HYPR, extensive evaluations were performed across 8 real-world digraph datasets involving 21 prior techniques. D-HYPR statistically significantly outperforms the current state of the art. We release our code at https://github.com/hongluzhou/dhypr
△ Less
Submitted 28 September, 2022; v1 submitted 22 December, 2021;
originally announced December 2021.
-
Deep Crowd-Flow Prediction in Built Environments
Authors:
Samuel S. Sohn,
Seonghyeon Moon,
Honglu Zhou,
Sejong Yoon,
Vladimir Pavlovic,
Mubbasir Kapadia
Abstract:
Predicting the behavior of crowds in complex environments is a key requirement in a multitude of application areas, including crowd and disaster management, architectural design, and urban planning. Given a crowd's immediate state, current approaches simulate crowd movement to arrive at a future state. However, most applications require the ability to predict hundreds of possible simulation outcom…
▽ More
Predicting the behavior of crowds in complex environments is a key requirement in a multitude of application areas, including crowd and disaster management, architectural design, and urban planning. Given a crowd's immediate state, current approaches simulate crowd movement to arrive at a future state. However, most applications require the ability to predict hundreds of possible simulation outcomes (e.g., under different environment and crowd situations) at real-time rates, for which these approaches are prohibitively expensive.
In this paper, we propose an approach to instantly predict the long-term flow of crowds in arbitrarily large, realistic environments. Central to our approach is a novel CAGE representation consisting of Capacity, Agent, Goal, and Environment-oriented information, which efficiently encodes and decodes crowd scenarios into compact, fixed-size representations that are environmentally lossless. We present a framework to facilitate the accurate and efficient prediction of crowd flow in never-before-seen crowd scenarios. We conduct a series of experiments to evaluate the efficacy of our approach and showcase positive results.
△ Less
Submitted 13 October, 2019;
originally announced October 2019.
-
Cognitive Agent Based Simulation Model For Improving Disaster Response Procedures
Authors:
Rohit K. Dubey,
Samuel S. Sohn,
Christoph Hoelscher,
Mubbasir Kapadia
Abstract:
In the event of a disaster, saving human lives is of utmost importance. For developing proper evacuation procedures and guidance systems, behavioural data on how people respond during panic and stress is crucial. In the absence of real human data on building evacuation, there is a need for a crowd simulator to model egress and decision-making under uncertainty. In this paper, we propose an agent-b…
▽ More
In the event of a disaster, saving human lives is of utmost importance. For developing proper evacuation procedures and guidance systems, behavioural data on how people respond during panic and stress is crucial. In the absence of real human data on building evacuation, there is a need for a crowd simulator to model egress and decision-making under uncertainty. In this paper, we propose an agent-based simulation tool, which is grounded in human cognition and decision-making, for evaluating and improving the effectiveness of building evacuation procedures and guidance systems during a disaster. Specifically, we propose a predictive agent-wayfinding framework based on information theory that is applied at intersections with variable route choices where it fuses N dynamic information sources. The proposed framework can be used to visualize trajectories and prediction results (i.e., total evacuation time, number of people evacuated) for different combinations of reinforcing or contradicting information sources (i.e., signage, crowd flow, familiarity, and spatial layout). This tool can enable designers to recreate various disaster scenarios and generate simulation data for improving the evacuation procedures and existing guidance systems.
△ Less
Submitted 1 October, 2019;
originally announced October 2019.