-
Thoughts without Thinking: Reconsidering the Explanatory Value of Chain-of-Thought Reasoning in LLMs through Agentic Pipelines
Authors:
Ramesh Manuvinakurike,
Emanuel Moss,
Elizabeth Anne Watkins,
Saurav Sahay,
Giuseppe Raffa,
Lama Nachman
Abstract:
Agentic pipelines present novel challenges and opportunities for human-centered explainability. The HCXAI community is still grappling with how best to make the inner workings of LLMs transparent in actionable ways. Agentic pipelines consist of multiple LLMs working in cooperation with minimal human control. In this research paper, we present early findings from an agentic pipeline implementation…
▽ More
Agentic pipelines present novel challenges and opportunities for human-centered explainability. The HCXAI community is still grappling with how best to make the inner workings of LLMs transparent in actionable ways. Agentic pipelines consist of multiple LLMs working in cooperation with minimal human control. In this research paper, we present early findings from an agentic pipeline implementation of a perceptive task guidance system. Through quantitative and qualitative analysis, we analyze how Chain-of-Thought (CoT) reasoning, a common vehicle for explainability in LLMs, operates within agentic pipelines. We demonstrate that CoT reasoning alone does not lead to better outputs, nor does it offer explainability, as it tends to produce explanations without explainability, in that they do not improve the ability of end users to better understand systems or achieve their goals.
△ Less
Submitted 1 May, 2025;
originally announced May 2025.
-
What's So Human about Human-AI Collaboration, Anyway? Generative AI and Human-Computer Interaction
Authors:
Elizabeth Anne Watkins,
Emanuel Moss,
Giuseppe Raffa,
Lama Nachman
Abstract:
While human-AI collaboration has been a longstanding goal and topic of study for computational research, the emergence of increasingly naturalistic generative AI language models has greatly inflected the trajectory of such research. In this paper we identify how, given the language capabilities of generative AI, common features of human-human collaboration derived from the social sciences can be a…
▽ More
While human-AI collaboration has been a longstanding goal and topic of study for computational research, the emergence of increasingly naturalistic generative AI language models has greatly inflected the trajectory of such research. In this paper we identify how, given the language capabilities of generative AI, common features of human-human collaboration derived from the social sciences can be applied to the study of human-computer interaction. We provide insights drawn from interviews with industry personnel working on building human-AI collaboration systems, as well as our collaborations with end-users to build a multimodal AI assistant for task support.
△ Less
Submitted 7 March, 2025;
originally announced March 2025.
-
QA-TOOLBOX: Conversational Question-Answering for process task guidance in manufacturing
Authors:
Ramesh Manuvinakurike,
Elizabeth Watkins,
Celal Savur,
Anthony Rhodes,
Sovan Biswas,
Gesem Gudino Mejia,
Richard Beckwith,
Saurav Sahay,
Giuseppe Raffa,
Lama Nachman
Abstract:
In this work we explore utilizing LLMs for data augmentation for manufacturing task guidance system. The dataset consists of representative samples of interactions with technicians working in an advanced manufacturing setting. The purpose of this work to explore the task, data augmentation for the supported tasks and evaluating the performance of the existing LLMs. We observe that that task is com…
▽ More
In this work we explore utilizing LLMs for data augmentation for manufacturing task guidance system. The dataset consists of representative samples of interactions with technicians working in an advanced manufacturing setting. The purpose of this work to explore the task, data augmentation for the supported tasks and evaluating the performance of the existing LLMs. We observe that that task is complex requiring understanding from procedure specification documents, actions and objects sequenced temporally. The dataset consists of 200,000+ question/answer pairs that refer to the spec document and are grounded in narrations and/or video demonstrations. We compared the performance of several popular open-sourced LLMs by developing a baseline using each LLM and then compared the responses in a reference-free setting using LLM-as-a-judge and compared the ratings with crowd-workers whilst validating the ratings with experts.
△ Less
Submitted 3 December, 2024;
originally announced December 2024.
-
Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models
Authors:
Shachi H Kumar,
Saurav Sahay,
Sahisnu Mazumder,
Eda Okur,
Ramesh Manuvinakurike,
Nicole Beckage,
Hsuan Su,
Hung-yi Lee,
Lama Nachman
Abstract:
Large Language Models (LLMs) have excelled at language understanding and generating human-level text. However, even with supervised training and human alignment, these LLMs are susceptible to adversarial attacks where malicious users can prompt the model to generate undesirable text. LLMs also inherently encode potential biases that can cause various harmful effects during interactions. Bias evalu…
▽ More
Large Language Models (LLMs) have excelled at language understanding and generating human-level text. However, even with supervised training and human alignment, these LLMs are susceptible to adversarial attacks where malicious users can prompt the model to generate undesirable text. LLMs also inherently encode potential biases that can cause various harmful effects during interactions. Bias evaluation metrics lack standards as well as consensus and existing methods often rely on human-generated templates and annotations which are expensive and labor intensive. In this work, we train models to automatically create adversarial prompts to elicit biased responses from target LLMs. We present LLM- based bias evaluation metrics and also analyze several existing automatic evaluation methods and metrics. We analyze the various nuances of model responses, identify the strengths and weaknesses of model families, and assess where evaluation methods fall short. We compare these metrics to human evaluation and validate that the LLM-as-a-Judge metric aligns with human judgement on bias in response generation.
△ Less
Submitted 7 August, 2024;
originally announced August 2024.
-
Introducing v0.5 of the AI Safety Benchmark from MLCommons
Authors:
Bertie Vidgen,
Adarsh Agrawal,
Ahmed M. Ahmed,
Victor Akinwande,
Namir Al-Nuaimi,
Najla Alfaraj,
Elie Alhajjar,
Lora Aroyo,
Trupti Bavalatti,
Max Bartolo,
Borhane Blili-Hamelin,
Kurt Bollacker,
Rishi Bomassani,
Marisa Ferrara Boston,
Siméon Campos,
Kal Chakra,
Canyu Chen,
Cody Coleman,
Zacharie Delpierre Coudert,
Leon Derczynski,
Debojyoti Dutta,
Ian Eisenberg,
James Ezick,
Heather Frase,
Brian Fuller
, et al. (75 additional authors not shown)
Abstract:
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-pu…
▽ More
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark.
△ Less
Submitted 13 May, 2024; v1 submitted 18 April, 2024;
originally announced April 2024.
-
Zero-shot Conversational Summarization Evaluations with small Large Language Models
Authors:
Ramesh Manuvinakurike,
Saurav Sahay,
Sangeeta Manepalli,
Lama Nachman
Abstract:
Large Language Models (LLMs) exhibit powerful summarization abilities. However, their capabilities on conversational summarization remains under explored. In this work we evaluate LLMs (approx. 10 billion parameters) on conversational summarization and showcase their performance on various prompts. We show that the summaries generated by models depend on the instructions and the performance of LLM…
▽ More
Large Language Models (LLMs) exhibit powerful summarization abilities. However, their capabilities on conversational summarization remains under explored. In this work we evaluate LLMs (approx. 10 billion parameters) on conversational summarization and showcase their performance on various prompts. We show that the summaries generated by models depend on the instructions and the performance of LLMs vary with different instructions sometimes resulting steep drop in ROUGE scores if prompts are not selected carefully. We also evaluate the models with human evaluations and discuss the limitations of the models on conversational summarization
△ Less
Submitted 29 November, 2023;
originally announced November 2023.
-
Inspecting Spoken Language Understanding from Kids for Basic Math Learning at Home
Authors:
Eda Okur,
Roddy Fuentes Alba,
Saurav Sahay,
Lama Nachman
Abstract:
Enriching the quality of early childhood education with interactive math learning at home systems, empowered by recent advances in conversational AI technologies, is slowly becoming a reality. With this motivation, we implement a multimodal dialogue system to support play-based learning experiences at home, guiding kids to master basic math concepts. This work explores Spoken Language Understandin…
▽ More
Enriching the quality of early childhood education with interactive math learning at home systems, empowered by recent advances in conversational AI technologies, is slowly becoming a reality. With this motivation, we implement a multimodal dialogue system to support play-based learning experiences at home, guiding kids to master basic math concepts. This work explores Spoken Language Understanding (SLU) pipeline within a task-oriented dialogue system developed for Kid Space, with cascading Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU) components evaluated on our home deployment data with kids going through gamified math learning activities. We validate the advantages of a multi-task architecture for NLU and experiment with a diverse set of pretrained language representations for Intent Recognition and Entity Extraction tasks in the math learning domain. To recognize kids' speech in realistic home environments, we investigate several ASR systems, including the commercial Google Cloud and the latest open-source Whisper solutions with varying model sizes. We evaluate the SLU pipeline by testing our best-performing NLU models on noisy ASR output to inspect the challenges of understanding children for math learning in authentic homes.
△ Less
Submitted 1 June, 2023;
originally announced June 2023.
-
Position Matters! Empirical Study of Order Effect in Knowledge-grounded Dialogue
Authors:
Hsuan Su,
Shachi H Kumar,
Sahisnu Mazumder,
Wenda Chen,
Ramesh Manuvinakurike,
Eda Okur,
Saurav Sahay,
Lama Nachman,
Shang-Tse Chen,
Hung-yi Lee
Abstract:
With the power of large pretrained language models, various research works have integrated knowledge into dialogue systems. The traditional techniques treat knowledge as part of the input sequence for the dialogue system, prepending a set of knowledge statements in front of dialogue history. However, such a mechanism forces knowledge sets to be concatenated in an ordered manner, making models impl…
▽ More
With the power of large pretrained language models, various research works have integrated knowledge into dialogue systems. The traditional techniques treat knowledge as part of the input sequence for the dialogue system, prepending a set of knowledge statements in front of dialogue history. However, such a mechanism forces knowledge sets to be concatenated in an ordered manner, making models implicitly pay imbalanced attention to the sets during training. In this paper, we first investigate how the order of the knowledge set can influence autoregressive dialogue systems' responses. We conduct experiments on two commonly used dialogue datasets with two types of transformer-based models and find that models view the input knowledge unequally. To this end, we propose a simple and novel technique to alleviate the order effect by modifying the position embeddings of knowledge input in these models. With the proposed position embedding method, the experimental results show that each knowledge statement is uniformly considered to generate responses.
△ Less
Submitted 12 February, 2023;
originally announced February 2023.
-
End-to-End Evaluation of a Spoken Dialogue System for Learning Basic Mathematics
Authors:
Eda Okur,
Saurav Sahay,
Roddy Fuentes Alba,
Lama Nachman
Abstract:
The advances in language-based Artificial Intelligence (AI) technologies applied to build educational applications can present AI for social-good opportunities with a broader positive impact. Across many disciplines, enhancing the quality of mathematics education is crucial in building critical thinking and problem-solving skills at younger ages. Conversational AI systems have started maturing to…
▽ More
The advances in language-based Artificial Intelligence (AI) technologies applied to build educational applications can present AI for social-good opportunities with a broader positive impact. Across many disciplines, enhancing the quality of mathematics education is crucial in building critical thinking and problem-solving skills at younger ages. Conversational AI systems have started maturing to a point where they could play a significant role in helping students learn fundamental math concepts. This work presents a task-oriented Spoken Dialogue System (SDS) built to support play-based learning of basic math concepts for early childhood education. The system has been evaluated via real-world deployments at school while the students are practicing early math concepts with multimodal interactions. We discuss our efforts to improve the SDS pipeline built for math learning, for which we explore utilizing MathBERT representations for potential enhancement to the Natural Language Understanding (NLU) module. We perform an end-to-end evaluation using real-world deployment outputs from the Automatic Speech Recognition (ASR), Intent Recognition, and Dialogue Manager (DM) components to understand how error propagation affects the overall performance in real-world scenarios.
△ Less
Submitted 7 November, 2022;
originally announced November 2022.
-
Human in the loop approaches in multi-modal conversational task guidance system development
Authors:
Ramesh Manuvinakurike,
Sovan Biswas,
Giuseppe Raffa,
Richard Beckwith,
Anthony Rhodes,
Meng Shi,
Gesem Gudino Mejia,
Saurav Sahay,
Lama Nachman
Abstract:
Development of task guidance systems for aiding humans in a situated task remains a challenging problem. The role of search (information retrieval) and conversational systems for task guidance has immense potential to help the task performers achieve various goals. However, there are several technical challenges that need to be addressed to deliver such conversational systems, where common supervi…
▽ More
Development of task guidance systems for aiding humans in a situated task remains a challenging problem. The role of search (information retrieval) and conversational systems for task guidance has immense potential to help the task performers achieve various goals. However, there are several technical challenges that need to be addressed to deliver such conversational systems, where common supervised approaches fail to deliver the expected results in terms of overall performance, user experience and adaptation to realistic conditions. In this preliminary work we first highlight some of the challenges involved during the development of such systems. We then provide an overview of existing datasets available and highlight their limitations. We finally develop a model-in-the-loop wizard-of-oz based data collection tool and perform a pilot experiment.
△ Less
Submitted 3 November, 2022;
originally announced November 2022.
-
NLU for Game-based Learning in Real: Initial Evaluations
Authors:
Eda Okur,
Saurav Sahay,
Lama Nachman
Abstract:
Intelligent systems designed for play-based interactions should be contextually aware of the users and their surroundings. Spoken Dialogue Systems (SDS) are critical for these interactive agents to carry out effective goal-oriented communication with users in real-time. For the real-world (i.e., in-the-wild) deployment of such conversational agents, improving the Natural Language Understanding (NL…
▽ More
Intelligent systems designed for play-based interactions should be contextually aware of the users and their surroundings. Spoken Dialogue Systems (SDS) are critical for these interactive agents to carry out effective goal-oriented communication with users in real-time. For the real-world (i.e., in-the-wild) deployment of such conversational agents, improving the Natural Language Understanding (NLU) module of the goal-oriented SDS pipeline is crucial, especially with limited task-specific datasets. This study explores the potential benefits of a recently proposed transformer-based multi-task NLU architecture, mainly to perform Intent Recognition on small-size domain-specific educational game datasets. The evaluation datasets were collected from children practicing basic math concepts via play-based interactions in game-based learning settings. We investigate the NLU performances on the initial proof-of-concept game datasets versus the real-world deployment datasets and observe anticipated performance drops in-the-wild. We have shown that compared to the more straightforward baseline approaches, Dual Intent and Entity Transformer (DIET) architecture is robust enough to handle real-world data to a large extent for the Intent Recognition task on these domain-specific in-the-wild game datasets.
△ Less
Submitted 26 May, 2022;
originally announced May 2022.
-
Intuitive and Efficient Human-robot Collaboration via Real-time Approximate Bayesian Inference
Authors:
Javier Felip Leon,
David Gonzalez-Aguirre,
Lama Nachman
Abstract:
The combination of collaborative robots and end-to-end AI, promises flexible automation of human tasks in factories and warehouses. However, such promise seems a few breakthroughs away. In the meantime, humans and cobots will collaborate helping each other. For these collaborations to be effective and safe, robots need to model, predict and exploit human's intents for responsive decision making pr…
▽ More
The combination of collaborative robots and end-to-end AI, promises flexible automation of human tasks in factories and warehouses. However, such promise seems a few breakthroughs away. In the meantime, humans and cobots will collaborate helping each other. For these collaborations to be effective and safe, robots need to model, predict and exploit human's intents for responsive decision making processes.
Approximate Bayesian Computation (ABC) is an analysis-by-synthesis approach to perform probabilistic predictions upon uncertain quantities. ABC includes priors conveniently, leverages sampling algorithms for inference and is flexible to benefit from complex models, e.g. via simulators. However, ABC is known to be computationally too intensive to run at interactive frame rates required for effective human-robot collaboration tasks.
In this paper, we formulate human reaching intent prediction as an ABC problem and describe two key performance innovations which allow computations at interactive rates. Our real-world experiments with a collaborative robot set-up, demonstrate the viability of our proposed approach. Experimental evaluations convey the advantages and value of human intent prediction for packing cooperative tasks. Qualitative results show how anticipating human's reaching intent improves human-robot collaboration without compromising safety. Quantitative task fluency metrics confirm the qualitative claims.
△ Less
Submitted 17 May, 2022;
originally announced May 2022.
-
Data Augmentation with Paraphrase Generation and Entity Extraction for Multimodal Dialogue System
Authors:
Eda Okur,
Saurav Sahay,
Lama Nachman
Abstract:
Contextually aware intelligent agents are often required to understand the users and their surroundings in real-time. Our goal is to build Artificial Intelligence (AI) systems that can assist children in their learning process. Within such complex frameworks, Spoken Dialogue Systems (SDS) are crucial building blocks to handle efficient task-oriented communication with children in game-based learni…
▽ More
Contextually aware intelligent agents are often required to understand the users and their surroundings in real-time. Our goal is to build Artificial Intelligence (AI) systems that can assist children in their learning process. Within such complex frameworks, Spoken Dialogue Systems (SDS) are crucial building blocks to handle efficient task-oriented communication with children in game-based learning settings. We are working towards a multimodal dialogue system for younger kids learning basic math concepts. Our focus is on improving the Natural Language Understanding (NLU) module of the task-oriented SDS pipeline with limited datasets. This work explores the potential benefits of data augmentation with paraphrase generation for the NLU models trained on small task-specific datasets. We also investigate the effects of extracting entities for conceivably further data expansion. We have shown that paraphrasing with model-in-the-loop (MITL) strategies using small seed data is a promising approach yielding improved performance results for the Intent Recognition task.
△ Less
Submitted 8 May, 2022;
originally announced May 2022.
-
Controllable Response Generation for Assistive Use-cases
Authors:
Shachi H Kumar,
Hsuan Su,
Ramesh Manuvinakurike,
Saurav Sahay,
Lama Nachman
Abstract:
Conversational agents have become an integral part of the general population for simple task enabling situations. However, these systems are yet to have any social impact on the diverse and minority population, for example, helping people with neurological disorders, for example ALS, and people with speech, language and social communication disorders. Language model technology can play a huge role…
▽ More
Conversational agents have become an integral part of the general population for simple task enabling situations. However, these systems are yet to have any social impact on the diverse and minority population, for example, helping people with neurological disorders, for example ALS, and people with speech, language and social communication disorders. Language model technology can play a huge role to help these users carry out daily communication and social interactions. To enable this population, we build a dialog system that can be controlled by users using cues or keywords. We build models that can suggest relevant cues in the dialog response context which is used to control response generation and can speed up communication. We also introduce a keyword loss to lexically constrain the model output. We show both qualitatively and quantitatively that our models can effectively induce the keyword into the model response without degrading the quality of response. In the context of usage of such systems for people with degenerative disorders, we present human evaluation of our cue or keyword predictor and the controllable dialog system and show that our models perform significantly better than models without control. Our study shows that keyword control on end to end response generation models is powerful and can enable and empower users with degenerative disorders to carry out their day to day communication.
△ Less
Submitted 4 December, 2021;
originally announced December 2021.
-
Semi-supervised Interactive Intent Labeling
Authors:
Saurav Sahay,
Eda Okur,
Nagib Hakim,
Lama Nachman
Abstract:
Building the Natural Language Understanding (NLU) modules of task-oriented Spoken Dialogue Systems (SDS) involves a definition of intents and entities, collection of task-relevant data, annotating the data with intents and entities, and then repeating the same process over and over again for adding any functionality/enhancement to the SDS. In this work, we showcase an Intent Bulk Labeling system w…
▽ More
Building the Natural Language Understanding (NLU) modules of task-oriented Spoken Dialogue Systems (SDS) involves a definition of intents and entities, collection of task-relevant data, annotating the data with intents and entities, and then repeating the same process over and over again for adding any functionality/enhancement to the SDS. In this work, we showcase an Intent Bulk Labeling system where SDS developers can interactively label and augment training data from unlabeled utterance corpora using advanced clustering and visual labeling methods. We extend the Deep Aligned Clustering work with a better backbone BERT model, explore techniques to select the seed data for labeling, and develop a data balancing method using an oversampling technique that utilizes paraphrasing models. We also look at the effect of data augmentation on the clustering process. Our results show that we can achieve over 10% gain in clustering accuracy on some datasets using the combination of the above techniques. Finally, we extract utterance embeddings from the clustering model and plot the data to interactively bulk label the samples, reducing the time and effort for data labeling of the whole dataset significantly.
△ Less
Submitted 11 May, 2021; v1 submitted 27 April, 2021;
originally announced April 2021.
-
Uncertainty as a Form of Transparency: Measuring, Communicating, and Using Uncertainty
Authors:
Umang Bhatt,
Javier Antorán,
Yunfeng Zhang,
Q. Vera Liao,
Prasanna Sattigeri,
Riccardo Fogliato,
Gabrielle Gauthier Melançon,
Ranganath Krishnan,
Jason Stanley,
Omesh Tickoo,
Lama Nachman,
Rumi Chunara,
Madhulika Srikumar,
Adrian Weller,
Alice Xiang
Abstract:
Algorithmic transparency entails exposing system properties to various stakeholders for purposes that include understanding, improving, and contesting predictions. Until now, most research into algorithmic transparency has predominantly focused on explainability. Explainability attempts to provide reasons for a machine learning model's behavior to stakeholders. However, understanding a model's spe…
▽ More
Algorithmic transparency entails exposing system properties to various stakeholders for purposes that include understanding, improving, and contesting predictions. Until now, most research into algorithmic transparency has predominantly focused on explainability. Explainability attempts to provide reasons for a machine learning model's behavior to stakeholders. However, understanding a model's specific behavior alone might not be enough for stakeholders to gauge whether the model is wrong or lacks sufficient knowledge to solve the task at hand. In this paper, we argue for considering a complementary form of transparency by estimating and communicating the uncertainty associated with model predictions. First, we discuss methods for assessing uncertainty. Then, we characterize how uncertainty can be used to mitigate model unfairness, augment decision-making, and build trustworthy systems. Finally, we outline methods for displaying uncertainty to stakeholders and recommend how to collect information required for incorporating uncertainty into existing ML pipelines. This work constitutes an interdisciplinary review drawn from literature spanning machine learning, visualization/HCI, design, decision-making, and fairness. We aim to encourage researchers and practitioners to measure, communicate, and use uncertainty as a form of transparency.
△ Less
Submitted 4 May, 2021; v1 submitted 15 November, 2020;
originally announced November 2020.
-
Audio-Visual Understanding of Passenger Intents for In-Cabin Conversational Agents
Authors:
Eda Okur,
Shachi H Kumar,
Saurav Sahay,
Lama Nachman
Abstract:
Building multimodal dialogue understanding capabilities situated in the in-cabin context is crucial to enhance passenger comfort in autonomous vehicle (AV) interaction systems. To this end, understanding passenger intents from spoken interactions and vehicle vision systems is a crucial component for developing contextual and visually grounded conversational agents for AV. Towards this goal, we exp…
▽ More
Building multimodal dialogue understanding capabilities situated in the in-cabin context is crucial to enhance passenger comfort in autonomous vehicle (AV) interaction systems. To this end, understanding passenger intents from spoken interactions and vehicle vision systems is a crucial component for developing contextual and visually grounded conversational agents for AV. Towards this goal, we explore AMIE (Automated-vehicle Multimodal In-cabin Experience), the in-cabin agent responsible for handling multimodal passenger-vehicle interactions. In this work, we discuss the benefits of a multimodal understanding of in-cabin utterances by incorporating verbal/language input together with the non-verbal/acoustic and visual clues from inside and outside the vehicle. Our experimental results outperformed text-only baselines as we achieved improved performances for intent detection with a multimodal approach.
△ Less
Submitted 7 July, 2020;
originally announced July 2020.
-
Low Rank Fusion based Transformers for Multimodal Sequences
Authors:
Saurav Sahay,
Eda Okur,
Shachi H Kumar,
Lama Nachman
Abstract:
Our senses individually work in a coordinated fashion to express our emotional intentions. In this work, we experiment with modeling modality-specific sensory signals to attend to our latent multimodal emotional intentions and vice versa expressed via low-rank multimodal fusion and multimodal transformers. The low-rank factorization of multimodal fusion amongst the modalities helps represent appro…
▽ More
Our senses individually work in a coordinated fashion to express our emotional intentions. In this work, we experiment with modeling modality-specific sensory signals to attend to our latent multimodal emotional intentions and vice versa expressed via low-rank multimodal fusion and multimodal transformers. The low-rank factorization of multimodal fusion amongst the modalities helps represent approximate multiplicative latent signal interactions. Motivated by the work of~\cite{tsai2019MULT} and~\cite{Liu_2018}, we present our transformer-based cross-fusion architecture without any over-parameterization of the model. The low-rank fusion helps represent the latent signal interactions while the modality-specific attention helps focus on relevant parts of the signal. We present two methods for the Multimodal Sentiment and Emotion Recognition results on CMU-MOSEI, CMU-MOSI, and IEMOCAP datasets and show that our models have lesser parameters, train faster and perform comparably to many larger fusion-based architectures.
△ Less
Submitted 4 July, 2020;
originally announced July 2020.
-
Optimizing User Interface Layouts via Gradient Descent
Authors:
Peitong Duan,
Casimir Wierzynski,
Lama Nachman
Abstract:
Automating parts of the user interface (UI) design process has been a longstanding challenge. We present an automated technique for optimizing the layouts of mobile UIs. Our method uses gradient descent on a neural network model of task performance with respect to the model's inputs to make layout modifications that result in improved predicted error rates and task completion times. We start by ex…
▽ More
Automating parts of the user interface (UI) design process has been a longstanding challenge. We present an automated technique for optimizing the layouts of mobile UIs. Our method uses gradient descent on a neural network model of task performance with respect to the model's inputs to make layout modifications that result in improved predicted error rates and task completion times. We start by extending prior work on neural network based performance prediction to 2-dimensional mobile UIs with an expanded interaction space. We then apply our method to two UIs, including one that the model had not been trained on, to discover layout alternatives with significantly improved predicted performance. Finally, we confirm these predictions experimentally, showing improvements up to 9.2 percent in the optimized layouts. This demonstrates the algorithm's efficacy in improving the task performance of a layout, and its ability to generalize and improve layouts of new interfaces.
△ Less
Submitted 25 February, 2020;
originally announced February 2020.
-
Exploring Context, Attention and Audio Features for Audio Visual Scene-Aware Dialog
Authors:
Shachi H Kumar,
Eda Okur,
Saurav Sahay,
Jonathan Huang,
Lama Nachman
Abstract:
We are witnessing a confluence of vision, speech and dialog system technologies that are enabling the IVAs to learn audio-visual groundings of utterances and have conversations with users about the objects, activities and events surrounding them. Recent progress in visual grounding techniques and Audio Understanding are enabling machines to understand shared semantic concepts and listen to the var…
▽ More
We are witnessing a confluence of vision, speech and dialog system technologies that are enabling the IVAs to learn audio-visual groundings of utterances and have conversations with users about the objects, activities and events surrounding them. Recent progress in visual grounding techniques and Audio Understanding are enabling machines to understand shared semantic concepts and listen to the various sensory events in the environment. With audio and visual grounding methods, end-to-end multimodal SDS are trained to meaningfully communicate with us in natural language about the real dynamic audio-visual sensory world around us. In this work, we explore the role of `topics' as the context of the conversation along with multimodal attention into such an end-to-end audio-visual scene-aware dialog system architecture. We also incorporate an end-to-end audio classification ConvNet, AclNet, into our models. We develop and test our approaches on the Audio Visual Scene-Aware Dialog (AVSD) dataset released as a part of the DSTC7. We present the analysis of our experiments and show that some of our model variations outperform the baseline system released for AVSD.
△ Less
Submitted 20 December, 2019;
originally announced December 2019.
-
Leveraging Topics and Audio Features with Multimodal Attention for Audio Visual Scene-Aware Dialog
Authors:
Shachi H Kumar,
Eda Okur,
Saurav Sahay,
Jonathan Huang,
Lama Nachman
Abstract:
With the recent advancements in Artificial Intelligence (AI), Intelligent Virtual Assistants (IVA) such as Alexa, Google Home, etc., have become a ubiquitous part of many homes. Currently, such IVAs are mostly audio-based, but going forward, we are witnessing a confluence of vision, speech and dialog system technologies that are enabling the IVAs to learn audio-visual groundings of utterances. Thi…
▽ More
With the recent advancements in Artificial Intelligence (AI), Intelligent Virtual Assistants (IVA) such as Alexa, Google Home, etc., have become a ubiquitous part of many homes. Currently, such IVAs are mostly audio-based, but going forward, we are witnessing a confluence of vision, speech and dialog system technologies that are enabling the IVAs to learn audio-visual groundings of utterances. This will enable agents to have conversations with users about the objects, activities and events surrounding them. In this work, we present three main architectural explorations for the Audio Visual Scene-Aware Dialog (AVSD): 1) investigating `topics' of the dialog as an important contextual feature for the conversation, 2) exploring several multimodal attention mechanisms during response generation, 3) incorporating an end-to-end audio classification ConvNet, AclNet, into our architecture. We discuss detailed analysis of the experimental results and show that our model variations outperform the baseline system presented for the AVSD task.
△ Less
Submitted 20 December, 2019;
originally announced December 2019.
-
Modeling Intent, Dialog Policies and Response Adaptation for Goal-Oriented Interactions
Authors:
Saurav Sahay,
Shachi H Kumar,
Eda Okur,
Haroon Syed,
Lama Nachman
Abstract:
Building a machine learning driven spoken dialog system for goal-oriented interactions involves careful design of intents and data collection along with development of intent recognition models and dialog policy learning algorithms. The models should be robust enough to handle various user distractions during the interaction flow and should steer the user back into an engaging interaction for succ…
▽ More
Building a machine learning driven spoken dialog system for goal-oriented interactions involves careful design of intents and data collection along with development of intent recognition models and dialog policy learning algorithms. The models should be robust enough to handle various user distractions during the interaction flow and should steer the user back into an engaging interaction for successful completion of the interaction. In this work, we have designed a goal-oriented interaction system where children can engage with agents for a series of interactions involving `Meet \& Greet' and `Simon Says' game play. We have explored various feature extractors and models for improved intent recognition and looked at leveraging previous user and system interactions in novel ways with attention models. We have also looked at dialog adaptation methods for entrained response selection. Our bootstrapped models from limited training data perform better than many baseline approaches we have looked at for intent recognition and dialog action prediction.
△ Less
Submitted 20 December, 2019;
originally announced December 2019.
-
Towards Multimodal Understanding of Passenger-Vehicle Interactions in Autonomous Vehicles: Intent/Slot Recognition Utilizing Audio-Visual Data
Authors:
Eda Okur,
Shachi H Kumar,
Saurav Sahay,
Lama Nachman
Abstract:
Understanding passenger intents from spoken interactions and car's vision (both inside and outside the vehicle) are important building blocks towards developing contextual dialog systems for natural interactions in autonomous vehicles (AV). In this study, we continued exploring AMIE (Automated-vehicle Multimodal In-cabin Experience), the in-cabin agent responsible for handling certain multimodal p…
▽ More
Understanding passenger intents from spoken interactions and car's vision (both inside and outside the vehicle) are important building blocks towards developing contextual dialog systems for natural interactions in autonomous vehicles (AV). In this study, we continued exploring AMIE (Automated-vehicle Multimodal In-cabin Experience), the in-cabin agent responsible for handling certain multimodal passenger-vehicle interactions. When the passengers give instructions to AMIE, the agent should parse such commands properly considering available three modalities (language/text, audio, video) and trigger the appropriate functionality of the AV system. We had collected a multimodal in-cabin dataset with multi-turn dialogues between the passengers and AMIE using a Wizard-of-Oz scheme via realistic scavenger hunt game. In our previous explorations, we experimented with various RNN-based models to detect utterance-level intents (set destination, change route, go faster, go slower, stop, park, pull over, drop off, open door, and others) along with intent keywords and relevant slots (location, position/direction, object, gesture/gaze, time-guidance, person) associated with the action to be performed in our AV scenarios. In this recent work, we propose to discuss the benefits of multimodal understanding of in-cabin utterances by incorporating verbal/language input (text and speech embeddings) together with the non-verbal/acoustic and visual input from inside and outside the vehicle (i.e., passenger gestures and gaze from in-cabin video stream, referred objects outside of the vehicle from the road view camera stream). Our experimental results outperformed text-only baselines and with multimodality, we achieved improved performances for utterance-level intent detection and slot filling.
△ Less
Submitted 19 September, 2019;
originally announced September 2019.
-
Natural Language Interactions in Autonomous Vehicles: Intent Detection and Slot Filling from Passenger Utterances
Authors:
Eda Okur,
Shachi H Kumar,
Saurav Sahay,
Asli Arslan Esme,
Lama Nachman
Abstract:
Understanding passenger intents and extracting relevant slots are important building blocks towards developing contextual dialogue systems for natural interactions in autonomous vehicles (AV). In this work, we explored AMIE (Automated-vehicle Multi-modal In-cabin Experience), the in-cabin agent responsible for handling certain passenger-vehicle interactions. When the passengers give instructions t…
▽ More
Understanding passenger intents and extracting relevant slots are important building blocks towards developing contextual dialogue systems for natural interactions in autonomous vehicles (AV). In this work, we explored AMIE (Automated-vehicle Multi-modal In-cabin Experience), the in-cabin agent responsible for handling certain passenger-vehicle interactions. When the passengers give instructions to AMIE, the agent should parse such commands properly and trigger the appropriate functionality of the AV system. In our current explorations, we focused on AMIE scenarios describing usages around setting or changing the destination and route, updating driving behavior or speed, finishing the trip and other use-cases to support various natural commands. We collected a multi-modal in-cabin dataset with multi-turn dialogues between the passengers and AMIE using a Wizard-of-Oz scheme via a realistic scavenger hunt game activity. After exploring various recent Recurrent Neural Networks (RNN) based techniques, we introduced our own hierarchical joint models to recognize passenger intents along with relevant slots associated with the action to be performed in AV scenarios. Our experimental results outperformed certain competitive baselines and achieved overall F1 scores of 0.91 for utterance-level intent detection and 0.96 for slot filling tasks. In addition, we conducted initial speech-to-text explorations by comparing intent/slot models trained and tested on human transcriptions versus noisy Automatic Speech Recognition (ASR) outputs. Finally, we compared the results with single passenger rides versus the rides with multiple passengers.
△ Less
Submitted 23 April, 2019;
originally announced April 2019.
-
Conversational Intent Understanding for Passengers in Autonomous Vehicles
Authors:
Eda Okur,
Shachi H Kumar,
Saurav Sahay,
Asli Arslan Esme,
Lama Nachman
Abstract:
Understanding passenger intents and extracting relevant slots are important building blocks towards developing a contextual dialogue system responsible for handling certain vehicle-passenger interactions in autonomous vehicles (AV). When the passengers give instructions to AMIE (Automated-vehicle Multimodal In-cabin Experience), the agent should parse such commands properly and trigger the appropr…
▽ More
Understanding passenger intents and extracting relevant slots are important building blocks towards developing a contextual dialogue system responsible for handling certain vehicle-passenger interactions in autonomous vehicles (AV). When the passengers give instructions to AMIE (Automated-vehicle Multimodal In-cabin Experience), the agent should parse such commands properly and trigger the appropriate functionality of the AV system. In our AMIE scenarios, we describe usages and support various natural commands for interacting with the vehicle. We collected a multimodal in-cabin data-set with multi-turn dialogues between the passengers and AMIE using a Wizard-of-Oz scheme. We explored various recent Recurrent Neural Networks (RNN) based techniques and built our own hierarchical models to recognize passenger intents along with relevant slots associated with the action to be performed in AV scenarios. Our experimental results achieved F1-score of 0.91 on utterance-level intent recognition and 0.96 on slot extraction models.
△ Less
Submitted 13 December, 2018;
originally announced January 2019.
-
Context, Attention and Audio Feature Explorations for Audio Visual Scene-Aware Dialog
Authors:
Shachi H Kumar,
Eda Okur,
Saurav Sahay,
Juan Jose Alvarado Leanos,
Jonathan Huang,
Lama Nachman
Abstract:
With the recent advancements in AI, Intelligent Virtual Assistants (IVA) have become a ubiquitous part of every home. Going forward, we are witnessing a confluence of vision, speech and dialog system technologies that are enabling the IVAs to learn audio-visual groundings of utterances and have conversations with users about the objects, activities and events surrounding them. As a part of the 7th…
▽ More
With the recent advancements in AI, Intelligent Virtual Assistants (IVA) have become a ubiquitous part of every home. Going forward, we are witnessing a confluence of vision, speech and dialog system technologies that are enabling the IVAs to learn audio-visual groundings of utterances and have conversations with users about the objects, activities and events surrounding them. As a part of the 7th Dialog System Technology Challenges (DSTC7), for Audio Visual Scene-Aware Dialog (AVSD) track, We explore `topics' of the dialog as an important contextual feature into the architecture along with explorations around multimodal Attention. We also incorporate an end-to-end audio classification ConvNet, AclNet, into our models. We present detailed analysis of the experiments and show that some of our model variations outperform the baseline system presented for this task.
△ Less
Submitted 20 December, 2018;
originally announced December 2018.
-
Multimodal Relational Tensor Network for Sentiment and Emotion Classification
Authors:
Saurav Sahay,
Shachi H Kumar,
Rui Xia,
Jonathan Huang,
Lama Nachman
Abstract:
Understanding Affect from video segments has brought researchers from the language, audio and video domains together. Most of the current multimodal research in this area deals with various techniques to fuse the modalities, and mostly treat the segments of a video independently. Motivated by the work of (Zadeh et al., 2017) and (Poria et al., 2017), we present our architecture, Relational Tensor…
▽ More
Understanding Affect from video segments has brought researchers from the language, audio and video domains together. Most of the current multimodal research in this area deals with various techniques to fuse the modalities, and mostly treat the segments of a video independently. Motivated by the work of (Zadeh et al., 2017) and (Poria et al., 2017), we present our architecture, Relational Tensor Network, where we use the inter-modal interactions within a segment (intra-segment) and also consider the sequence of segments in a video to model the inter-segment inter-modal interactions. We also generate rich representations of text and audio modalities by leveraging richer audio and linguistic context alongwith fusing fine-grained knowledge based polarity scores from text. We present the results of our model on CMU-MOSEI dataset and show that our model outperforms many baselines and state of the art methods for sentiment classification and emotion recognition.
△ Less
Submitted 7 June, 2018;
originally announced June 2018.