-
Object Learning and Robust 3D Reconstruction
Authors:
Sara Sabour
Abstract:
In this thesis we discuss architectural designs and training methods for a neural network to have the ability of dissecting an image into objects of interest without supervision. The main challenge in 2D unsupervised object segmentation is distinguishing between foreground objects of interest and background. FlowCapsules uses motion as a cue for the objects of interest in 2D scenarios. The last pa…
▽ More
In this thesis we discuss architectural designs and training methods for a neural network to have the ability of dissecting an image into objects of interest without supervision. The main challenge in 2D unsupervised object segmentation is distinguishing between foreground objects of interest and background. FlowCapsules uses motion as a cue for the objects of interest in 2D scenarios. The last part of this thesis focuses on 3D applications where the goal is detecting and removal of the object of interest from the input images. In these tasks, we leverage the geometric consistency of scenes in 3D to detect the inconsistent dynamic objects. Our transient object masks are then used for designing robust optimization kernels to improve 3D modelling in a casual capture setup. One of our goals in this thesis is to show the merits of unsupervised object based approaches in computer vision. Furthermore, we suggest possible directions for defining objects of interest or foreground objects without requiring supervision. Our hope is to motivate and excite the community into further exploring explicit object representations in image understanding tasks.
△ Less
Submitted 22 April, 2025;
originally announced April 2025.
-
Human Decision-making is Susceptible to AI-driven Manipulation
Authors:
Sahand Sabour,
June M. Liu,
Siyang Liu,
Chris Z. Yao,
Shiyao Cui,
Xuanming Zhang,
Wen Zhang,
Yaru Cao,
Advait Bhat,
Jian Guan,
Wei Wu,
Rada Mihalcea,
Hongning Wang,
Tim Althoff,
Tatia M. C. Lee,
Minlie Huang
Abstract:
Artificial Intelligence (AI) systems are increasingly intertwined with daily life, assisting users in executing various tasks and providing guidance on decision-making. This integration introduces risks of AI-driven manipulation, where such systems may exploit users' cognitive biases and emotional vulnerabilities to steer them toward harmful outcomes. Through a randomized controlled trial with 233…
▽ More
Artificial Intelligence (AI) systems are increasingly intertwined with daily life, assisting users in executing various tasks and providing guidance on decision-making. This integration introduces risks of AI-driven manipulation, where such systems may exploit users' cognitive biases and emotional vulnerabilities to steer them toward harmful outcomes. Through a randomized controlled trial with 233 participants, we examined human susceptibility to such manipulation in financial (e.g., purchases) and emotional (e.g., conflict resolution) decision-making contexts. Participants interacted with one of three AI agents: a neutral agent (NA) optimizing for user benefit without explicit influence, a manipulative agent (MA) designed to covertly influence beliefs and behaviors, or a strategy-enhanced manipulative agent (SEMA) employing explicit psychological tactics to reach its hidden objectives. By analyzing participants' decision patterns and shifts in their preference ratings post-interaction, we found significant susceptibility to AI-driven manipulation. Particularly, across both decision-making domains, participants interacting with the manipulative agents shifted toward harmful options at substantially higher rates (financial, MA: 62.3%, SEMA: 59.6%; emotional, MA: 42.3%, SEMA: 41.5%) compared to the NA group (financial, 35.8%; emotional, 12.8%). Notably, our findings reveal that even subtle manipulative objectives (MA) can be as effective as employing explicit psychological strategies (SEMA) in swaying human decision-making. By revealing the potential for covert AI influence, this study highlights a critical vulnerability in human-AI interactions, emphasizing the need for ethical safeguards and regulatory frameworks to ensure responsible deployment of AI technologies and protect human autonomy.
△ Less
Submitted 24 February, 2025; v1 submitted 11 February, 2025;
originally announced February 2025.
-
Enhanced Large Language Models for Effective Screening of Depression and Anxiety
Authors:
June M. Liu,
Mengxia Gao,
Sahand Sabour,
Zhuang Chen,
Minlie Huang,
Tatia M. C. Lee
Abstract:
Depressive and anxiety disorders are widespread, necessitating timely identification and management. Recent advances in Large Language Models (LLMs) offer potential solutions, yet high costs and ethical concerns about training data remain challenges. This paper introduces a pipeline for synthesizing clinical interviews, resulting in 1,157 interactive dialogues (PsyInterview), and presents EmoScan,…
▽ More
Depressive and anxiety disorders are widespread, necessitating timely identification and management. Recent advances in Large Language Models (LLMs) offer potential solutions, yet high costs and ethical concerns about training data remain challenges. This paper introduces a pipeline for synthesizing clinical interviews, resulting in 1,157 interactive dialogues (PsyInterview), and presents EmoScan, an LLM-based emotional disorder screening system. EmoScan distinguishes between coarse (e.g., anxiety or depressive disorders) and fine disorders (e.g., major depressive disorders) and conducts high-quality interviews. Evaluations showed that EmoScan exceeded the performance of base models and other LLMs like GPT-4 in screening emotional disorders (F1-score=0.7467). It also delivers superior explanations (BERTScore=0.9408) and demonstrates robust generalizability (F1-score of 0.67 on an external dataset). Furthermore, EmoScan outperforms baselines in interviewing skills, as validated by automated ratings and human evaluations. This work highlights the importance of scalable data-generative pipelines for developing effective mental health LLM tools.
△ Less
Submitted 25 January, 2025; v1 submitted 15 January, 2025;
originally announced January 2025.
-
RoMo: Robust Motion Segmentation Improves Structure from Motion
Authors:
Lily Goli,
Sara Sabour,
Mark Matthews,
Marcus Brubaker,
Dmitry Lagun,
Alec Jacobson,
David J. Fleet,
Saurabh Saxena,
Andrea Tagliasacchi
Abstract:
There has been extensive progress in the reconstruction and generation of 4D scenes from monocular casually-captured video. While these tasks rely heavily on known camera poses, the problem of finding such poses using structure-from-motion (SfM) often depends on robustly separating static from dynamic parts of a video. The lack of a robust solution to this problem limits the performance of SfM cam…
▽ More
There has been extensive progress in the reconstruction and generation of 4D scenes from monocular casually-captured video. While these tasks rely heavily on known camera poses, the problem of finding such poses using structure-from-motion (SfM) often depends on robustly separating static from dynamic parts of a video. The lack of a robust solution to this problem limits the performance of SfM camera-calibration pipelines. We propose a novel approach to video-based motion segmentation to identify the components of a scene that are moving w.r.t. a fixed world frame. Our simple but effective iterative method, RoMo, combines optical flow and epipolar cues with a pre-trained video segmentation model. It outperforms unsupervised baselines for motion segmentation as well as supervised baselines trained from synthetic data. More importantly, the combination of an off-the-shelf SfM pipeline with our segmentation masks establishes a new state-of-the-art on camera calibration for scenes with dynamic content, outperforming existing methods by a substantial margin.
△ Less
Submitted 26 November, 2024;
originally announced November 2024.
-
SpotlessSplats: Ignoring Distractors in 3D Gaussian Splatting
Authors:
Sara Sabour,
Lily Goli,
George Kopanas,
Mark Matthews,
Dmitry Lagun,
Leonidas Guibas,
Alec Jacobson,
David J. Fleet,
Andrea Tagliasacchi
Abstract:
3D Gaussian Splatting (3DGS) is a promising technique for 3D reconstruction, offering efficient training and rendering speeds, making it suitable for real-time applications.However, current methods require highly controlled environments (no moving people or wind-blown elements, and consistent lighting) to meet the inter-view consistency assumption of 3DGS. This makes reconstruction of real-world c…
▽ More
3D Gaussian Splatting (3DGS) is a promising technique for 3D reconstruction, offering efficient training and rendering speeds, making it suitable for real-time applications.However, current methods require highly controlled environments (no moving people or wind-blown elements, and consistent lighting) to meet the inter-view consistency assumption of 3DGS. This makes reconstruction of real-world captures problematic. We present SpotLessSplats, an approach that leverages pre-trained and general-purpose features coupled with robust optimization to effectively ignore transient distractors. Our method achieves state-of-the-art reconstruction quality both visually and quantitatively, on casual captures. Additional results available at: https://spotlesssplats.github.io
△ Less
Submitted 29 July, 2024; v1 submitted 28 June, 2024;
originally announced June 2024.
-
EmoBench: Evaluating the Emotional Intelligence of Large Language Models
Authors:
Sahand Sabour,
Siyang Liu,
Zheyuan Zhang,
June M. Liu,
Jinfeng Zhou,
Alvionna S. Sunaryo,
Juanzi Li,
Tatia M. C. Lee,
Rada Mihalcea,
Minlie Huang
Abstract:
Recent advances in Large Language Models (LLMs) have highlighted the need for robust, comprehensive, and challenging benchmarks. Yet, research on evaluating their Emotional Intelligence (EI) is considerably limited. Existing benchmarks have two major shortcomings: first, they mainly focus on emotion recognition, neglecting essential EI capabilities such as emotion regulation and thought facilitati…
▽ More
Recent advances in Large Language Models (LLMs) have highlighted the need for robust, comprehensive, and challenging benchmarks. Yet, research on evaluating their Emotional Intelligence (EI) is considerably limited. Existing benchmarks have two major shortcomings: first, they mainly focus on emotion recognition, neglecting essential EI capabilities such as emotion regulation and thought facilitation through emotion understanding; second, they are primarily constructed from existing datasets, which include frequent patterns, explicit information, and annotation errors, leading to unreliable evaluation. We propose EmoBench, a benchmark that draws upon established psychological theories and proposes a comprehensive definition for machine EI, including Emotional Understanding and Emotional Application. EmoBench includes a set of 400 hand-crafted questions in English and Chinese, which are meticulously designed to require thorough reasoning and understanding. Our findings reveal a considerable gap between the EI of existing LLMs and the average human, highlighting a promising direction for future research. Our code and data are publicly available at https://github.com/Sahandfer/EmoBench.
△ Less
Submitted 17 July, 2024; v1 submitted 19 February, 2024;
originally announced February 2024.
-
CharacterGLM: Customizing Chinese Conversational AI Characters with Large Language Models
Authors:
Jinfeng Zhou,
Zhuang Chen,
Dazhen Wan,
Bosi Wen,
Yi Song,
Jifan Yu,
Yongkang Huang,
Libiao Peng,
Jiaming Yang,
Xiyao Xiao,
Sahand Sabour,
Xiaohan Zhang,
Wenjing Hou,
Yijia Zhang,
Yuxiao Dong,
Jie Tang,
Minlie Huang
Abstract:
In this paper, we present CharacterGLM, a series of models built upon ChatGLM, with model sizes ranging from 6B to 66B parameters. Our CharacterGLM is designed for generating Character-based Dialogues (CharacterDial), which aims to equip a conversational AI system with character customization for satisfying people's inherent social desires and emotional needs. On top of CharacterGLM, we can custom…
▽ More
In this paper, we present CharacterGLM, a series of models built upon ChatGLM, with model sizes ranging from 6B to 66B parameters. Our CharacterGLM is designed for generating Character-based Dialogues (CharacterDial), which aims to equip a conversational AI system with character customization for satisfying people's inherent social desires and emotional needs. On top of CharacterGLM, we can customize various AI characters or social agents by configuring their attributes (identities, interests, viewpoints, experiences, achievements, social relationships, etc.) and behaviors (linguistic features, emotional expressions, interaction patterns, etc.). Our model outperforms most mainstream close-source large langauge models, including the GPT series, especially in terms of consistency, human-likeness, and engagement according to manual evaluations. We will release our 6B version of CharacterGLM and a subset of training data to facilitate further research development in the direction of character-based dialogue generation.
△ Less
Submitted 28 November, 2023;
originally announced November 2023.
-
Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond
Authors:
Siyang Liu,
Naihao Deng,
Sahand Sabour,
Yilin Jia,
Minlie Huang,
Rada Mihalcea
Abstract:
We propose task-adaptive tokenization as a way to adapt the generation pipeline to the specifics of a downstream task and enhance long-form generation in mental health. Inspired by insights from cognitive science, our task-adaptive tokenizer samples variable segmentations from multiple outcomes, with sampling probabilities optimized based on task-specific data. We introduce a strategy for building…
▽ More
We propose task-adaptive tokenization as a way to adapt the generation pipeline to the specifics of a downstream task and enhance long-form generation in mental health. Inspired by insights from cognitive science, our task-adaptive tokenizer samples variable segmentations from multiple outcomes, with sampling probabilities optimized based on task-specific data. We introduce a strategy for building a specialized vocabulary and introduce a vocabulary merging protocol that allows for the integration of task-specific tokens into the pre-trained model's tokenization step. Through extensive experiments on psychological question-answering tasks in both Chinese and English, we find that our task-adaptive tokenization approach brings a significant improvement in generation performance while using up to 60% fewer tokens. Preliminary experiments point to promising results when using our tokenization approach with very large language models.
△ Less
Submitted 13 November, 2023; v1 submitted 8 October, 2023;
originally announced October 2023.
-
COKE: A Cognitive Knowledge Graph for Machine Theory of Mind
Authors:
Jincenzi Wu,
Zhuang Chen,
Jiawen Deng,
Sahand Sabour,
Helen Meng,
Minlie Huang
Abstract:
Theory of mind (ToM) refers to humans' ability to understand and infer the desires, beliefs, and intentions of others. The acquisition of ToM plays a key role in humans' social cognition and interpersonal relations. Though indispensable for social intelligence, ToM is still lacking for modern AI and NLP systems since they cannot access the human mental state and cognitive process beneath the train…
▽ More
Theory of mind (ToM) refers to humans' ability to understand and infer the desires, beliefs, and intentions of others. The acquisition of ToM plays a key role in humans' social cognition and interpersonal relations. Though indispensable for social intelligence, ToM is still lacking for modern AI and NLP systems since they cannot access the human mental state and cognitive process beneath the training corpus. To empower AI systems with the ToM ability and narrow the gap between them and humans, in this paper, we propose COKE: the first cognitive knowledge graph for machine theory of mind. Specifically, COKE formalizes ToM as a collection of 45k+ manually verified cognitive chains that characterize human mental activities and subsequent behavioral/affective responses when facing specific social circumstances. In addition, we further generalize COKE using LLMs and build a powerful generation model COLM tailored for cognitive reasoning. Experimental results in both automatic and human evaluation demonstrate the high quality of COKE, the superior ToM ability of COLM, and its potential to significantly enhance social applications.
△ Less
Submitted 18 May, 2024; v1 submitted 9 May, 2023;
originally announced May 2023.
-
RobustNeRF: Ignoring Distractors with Robust Losses
Authors:
Sara Sabour,
Suhani Vora,
Daniel Duckworth,
Ivan Krasin,
David J. Fleet,
Andrea Tagliasacchi
Abstract:
Neural radiance fields (NeRF) excel at synthesizing new views given multi-view, calibrated images of a static scene. When scenes include distractors, which are not persistent during image capture (moving objects, lighting variations, shadows), artifacts appear as view-dependent effects or 'floaters'. To cope with distractors, we advocate a form of robust estimation for NeRF training, modeling dist…
▽ More
Neural radiance fields (NeRF) excel at synthesizing new views given multi-view, calibrated images of a static scene. When scenes include distractors, which are not persistent during image capture (moving objects, lighting variations, shadows), artifacts appear as view-dependent effects or 'floaters'. To cope with distractors, we advocate a form of robust estimation for NeRF training, modeling distractors in training data as outliers of an optimization problem. Our method successfully removes outliers from a scene and improves upon our baselines, on synthetic and real-world scenes. Our technique is simple to incorporate in modern NeRF frameworks, with few hyper-parameters. It does not assume a priori knowledge of the types of distractors, and is instead focused on the optimization problem rather than pre-processing or modeling transient objects. More results on our page https://robustnerf.github.io.
△ Less
Submitted 26 July, 2024; v1 submitted 1 February, 2023;
originally announced February 2023.
-
PAL: Persona-Augmented Emotional Support Conversation Generation
Authors:
Jiale Cheng,
Sahand Sabour,
Hao Sun,
Zhuang Chen,
Minlie Huang
Abstract:
Due to the lack of human resources for mental health support, there is an increasing demand for employing conversational agents for support. Recent work has demonstrated the effectiveness of dialogue models in providing emotional support. As previous studies have demonstrated that seekers' persona is an important factor for effective support, we investigate whether there are benefits to modeling s…
▽ More
Due to the lack of human resources for mental health support, there is an increasing demand for employing conversational agents for support. Recent work has demonstrated the effectiveness of dialogue models in providing emotional support. As previous studies have demonstrated that seekers' persona is an important factor for effective support, we investigate whether there are benefits to modeling such information in dialogue models for support. In this paper, our empirical analysis verifies that persona has an important impact on emotional support. Therefore, we propose a framework for dynamically inferring and modeling seekers' persona. We first train a model for inferring the seeker's persona from the conversation history. Accordingly, we propose PAL, a model that leverages persona information and, in conjunction with our strategy-based controllable generation method, provides personalized emotional support. Automatic and manual evaluations demonstrate that PAL achieves state-of-the-art results, outperforming the baselines on the studied benchmark. Our code and data are publicly available at https://github.com/chengjl19/PAL.
△ Less
Submitted 29 May, 2023; v1 submitted 18 December, 2022;
originally announced December 2022.
-
Testing GLOM's ability to infer wholes from ambiguous parts
Authors:
Laura Culp,
Sara Sabour,
Geoffrey E. Hinton
Abstract:
The GLOM architecture proposed by Hinton [2021] is a recurrent neural network for parsing an image into a hierarchy of wholes and parts. When a part is ambiguous, GLOM assumes that the ambiguity can be resolved by allowing the part to make multi-modal predictions for the pose and identity of the whole to which it belongs and then using attention to similar predictions coming from other possibly am…
▽ More
The GLOM architecture proposed by Hinton [2021] is a recurrent neural network for parsing an image into a hierarchy of wholes and parts. When a part is ambiguous, GLOM assumes that the ambiguity can be resolved by allowing the part to make multi-modal predictions for the pose and identity of the whole to which it belongs and then using attention to similar predictions coming from other possibly ambiguous parts to settle on a common mode that is predicted by several different parts. In this study, we describe a highly simplified version of GLOM that allows us to assess the effectiveness of this way of dealing with ambiguity. Our results show that, with supervised training, GLOM is able to successfully form islands of very similar embedding vectors for all of the locations occupied by the same object and it is also robust to strong noise injections in the input and to out-of-distribution input transformations.
△ Less
Submitted 29 November, 2022;
originally announced November 2022.
-
nerf2nerf: Pairwise Registration of Neural Radiance Fields
Authors:
Lily Goli,
Daniel Rebain,
Sara Sabour,
Animesh Garg,
Andrea Tagliasacchi
Abstract:
We introduce a technique for pairwise registration of neural fields that extends classical optimization-based local registration (i.e. ICP) to operate on Neural Radiance Fields (NeRF) -- neural 3D scene representations trained from collections of calibrated images. NeRF does not decompose illumination and color, so to make registration invariant to illumination, we introduce the concept of a ''sur…
▽ More
We introduce a technique for pairwise registration of neural fields that extends classical optimization-based local registration (i.e. ICP) to operate on Neural Radiance Fields (NeRF) -- neural 3D scene representations trained from collections of calibrated images. NeRF does not decompose illumination and color, so to make registration invariant to illumination, we introduce the concept of a ''surface field'' -- a field distilled from a pre-trained NeRF model that measures the likelihood of a point being on the surface of an object. We then cast nerf2nerf registration as a robust optimization that iteratively seeks a rigid transformation that aligns the surface fields of the two scenes. We evaluate the effectiveness of our technique by introducing a dataset of pre-trained NeRF scenes -- our synthetic scenes enable quantitative evaluations and comparisons to classical registration techniques, while our real scenes demonstrate the validity of our technique in real-world scenarios. Additional results available at: https://nerf2nerf.github.io
△ Less
Submitted 3 November, 2022;
originally announced November 2022.
-
Chatbots for Mental Health Support: Exploring the Impact of Emohaa on Reducing Mental Distress in China
Authors:
Sahand Sabour,
Wen Zhang,
Xiyao Xiao,
Yuwei Zhang,
Yinhe Zheng,
Jiaxin Wen,
Jialu Zhao,
Minlie Huang
Abstract:
The growing demand for mental health support has highlighted the importance of conversational agents as human supporters worldwide and in China. These agents could increase availability and reduce the relative costs of mental health support. The provided support can be divided into two main types: cognitive and emotional support. Existing work on this topic mainly focuses on constructing agents th…
▽ More
The growing demand for mental health support has highlighted the importance of conversational agents as human supporters worldwide and in China. These agents could increase availability and reduce the relative costs of mental health support. The provided support can be divided into two main types: cognitive and emotional support. Existing work on this topic mainly focuses on constructing agents that adopt Cognitive Behavioral Therapy (CBT) principles. Such agents operate based on pre-defined templates and exercises to provide cognitive support. However, research on emotional support using such agents is limited. In addition, most of the constructed agents operate in English, highlighting the importance of conducting such studies in China. In this study, we analyze the effectiveness of Emohaa in reducing symptoms of mental distress. Emohaa is a conversational agent that provides cognitive support through CBT-based exercises and guided conversations. It also emotionally supports users by enabling them to vent their desired emotional problems. The study included 134 participants, split into three groups: Emohaa (CBT-based), Emohaa (Full), and control. Experimental results demonstrated that compared to the control group, participants who used Emohaa experienced considerably more significant improvements in symptoms of mental distress. We also found that adding the emotional support agent had a complementary effect on such improvements, mainly depression and insomnia. Based on the obtained results and participants' satisfaction with the platform, we concluded that Emohaa is a practical and effective tool for reducing mental distress.
△ Less
Submitted 21 September, 2022;
originally announced September 2022.
-
Kubric: A scalable dataset generator
Authors:
Klaus Greff,
Francois Belletti,
Lucas Beyer,
Carl Doersch,
Yilun Du,
Daniel Duckworth,
David J. Fleet,
Dan Gnanapragasam,
Florian Golemo,
Charles Herrmann,
Thomas Kipf,
Abhijit Kundu,
Dmitry Lagun,
Issam Laradji,
Hsueh-Ti,
Liu,
Henning Meyer,
Yishu Miao,
Derek Nowrouzezahrai,
Cengiz Oztireli,
Etienne Pot,
Noha Radwan,
Daniel Rebain,
Sara Sabour,
Mehdi S. M. Sajjadi
, et al. (10 additional authors not shown)
Abstract:
Data is the driving force of machine learning, with the amount and quality of training data often being more important for the performance of a system than architecture and training details. But collecting, processing and annotating real data at scale is difficult, expensive, and frequently raises additional privacy, fairness and legal concerns. Synthetic data is a powerful tool with the potential…
▽ More
Data is the driving force of machine learning, with the amount and quality of training data often being more important for the performance of a system than architecture and training details. But collecting, processing and annotating real data at scale is difficult, expensive, and frequently raises additional privacy, fairness and legal concerns. Synthetic data is a powerful tool with the potential to address these shortcomings: 1) it is cheap 2) supports rich ground-truth annotations 3) offers full control over data and 4) can circumvent or mitigate problems regarding bias, privacy and licensing. Unfortunately, software tools for effective data generation are less mature than those for architecture design and training, which leads to fragmented generation efforts. To address these problems we introduce Kubric, an open-source Python framework that interfaces with PyBullet and Blender to generate photo-realistic scenes, with rich annotations, and seamlessly scales to large jobs distributed over thousands of machines, and generating TBs of data. We demonstrate the effectiveness of Kubric by presenting a series of 13 different generated datasets for tasks ranging from studying 3D NeRF models to optical flow estimation. We release Kubric, the used assets, all of the generation code, as well as the rendered datasets for reuse and modification.
△ Less
Submitted 7 March, 2022;
originally announced March 2022.
-
Rethinking and Refining the Distinct Metric
Authors:
Siyang Liu,
Sahand Sabour,
Yinhe Zheng,
Pei Ke,
Xiaoyan Zhu,
Minlie Huang
Abstract:
Distinct-$n$ score\cite{Li2016} is a widely used automatic metric for evaluating diversity in language generation tasks. However, we observed that the original approach for calculating distinct scores has evident biases that tend to assign higher penalties to longer sequences. We refine the calculation of distinct scores by scaling the number of distinct tokens based on their expectations. We prov…
▽ More
Distinct-$n$ score\cite{Li2016} is a widely used automatic metric for evaluating diversity in language generation tasks. However, we observed that the original approach for calculating distinct scores has evident biases that tend to assign higher penalties to longer sequences. We refine the calculation of distinct scores by scaling the number of distinct tokens based on their expectations. We provide both empirical and theoretical evidence to show that our method effectively removes the biases existing in the original distinct score. Our experiments show that our proposed metric, \textit{Expectation-Adjusted Distinct (EAD)}, correlates better with human judgment in evaluating response diversity. To foster future research, we provide an example implementation at \url{https://github.com/lsy641/Expectation-Adjusted-Distinct}.
△ Less
Submitted 3 April, 2022; v1 submitted 28 February, 2022;
originally announced February 2022.
-
AugESC: Dialogue Augmentation with Large Language Models for Emotional Support Conversation
Authors:
Chujie Zheng,
Sahand Sabour,
Jiaxin Wen,
Zheng Zhang,
Minlie Huang
Abstract:
Crowdsourced dialogue corpora are usually limited in scale and topic coverage due to the expensive cost of data curation. This would hinder the generalization of downstream dialogue models to open-domain topics. In this work, we leverage large language models for dialogue augmentation in the task of emotional support conversation (ESC). By treating dialogue augmentation as a dialogue completion ta…
▽ More
Crowdsourced dialogue corpora are usually limited in scale and topic coverage due to the expensive cost of data curation. This would hinder the generalization of downstream dialogue models to open-domain topics. In this work, we leverage large language models for dialogue augmentation in the task of emotional support conversation (ESC). By treating dialogue augmentation as a dialogue completion task, we prompt a fine-tuned language model to complete full dialogues from available dialogue posts of various topics, which are then postprocessed based on heuristics. Applying this approach, we construct AugESC, an augmented dataset for the ESC task, which largely extends the scale and topic coverage of the crowdsourced ESConv corpus. Through comprehensive human evaluation, we demonstrate that our approach is superior to strong baselines of dialogue augmentation and that AugESC has comparable dialogue quality to the crowdsourced corpus. We also conduct human interactive evaluation and prove that post-training on AugESC improves downstream dialogue models' generalization ability to open-domain topics. These results suggest the utility of AugESC and highlight the potential of large language models in improving data-scarce dialogue generation tasks.
△ Less
Submitted 18 May, 2023; v1 submitted 25 February, 2022;
originally announced February 2022.
-
Conditional Object-Centric Learning from Video
Authors:
Thomas Kipf,
Gamaleldin F. Elsayed,
Aravindh Mahendran,
Austin Stone,
Sara Sabour,
Georg Heigold,
Rico Jonschkowski,
Alexey Dosovitskiy,
Klaus Greff
Abstract:
Object-centric representations are a promising path toward more systematic generalization by providing flexible abstractions upon which compositional world models can be built. Recent work on simple 2D and 3D datasets has shown that models with object-centric inductive biases can learn to segment and represent meaningful objects from the statistical structure of the data alone without the need for…
▽ More
Object-centric representations are a promising path toward more systematic generalization by providing flexible abstractions upon which compositional world models can be built. Recent work on simple 2D and 3D datasets has shown that models with object-centric inductive biases can learn to segment and represent meaningful objects from the statistical structure of the data alone without the need for any supervision. However, such fully-unsupervised methods still fail to scale to diverse realistic data, despite the use of increasingly complex inductive biases such as priors for the size of objects or the 3D geometry of the scene. In this paper, we instead take a weakly-supervised approach and focus on how 1) using the temporal dynamics of video data in the form of optical flow and 2) conditioning the model on simple object location cues can be used to enable segmenting and tracking objects in significantly more realistic synthetic data. We introduce a sequential extension to Slot Attention which we train to predict optical flow for realistic looking synthetic scenes and show that conditioning the initial state of this model on a small set of hints, such as center of mass of objects in the first frame, is sufficient to significantly improve instance segmentation. These benefits generalize beyond the training distribution to novel objects, novel backgrounds, and to longer video sequences. We also find that such initial-state-conditioning can be used during inference as a flexible interface to query the model for specific objects or parts of objects, which could pave the way for a range of weakly-supervised approaches and allow more effective interaction with trained models.
△ Less
Submitted 15 March, 2022; v1 submitted 24 November, 2021;
originally announced November 2021.
-
CEM: Commonsense-aware Empathetic Response Generation
Authors:
Sahand Sabour,
Chujie Zheng,
Minlie Huang
Abstract:
A key trait of daily conversations between individuals is the ability to express empathy towards others, and exploring ways to implement empathy is a crucial step towards human-like dialogue systems. Previous approaches on this topic mainly focus on detecting and utilizing the user's emotion for generating empathetic responses. However, since empathy includes both aspects of affection and cognitio…
▽ More
A key trait of daily conversations between individuals is the ability to express empathy towards others, and exploring ways to implement empathy is a crucial step towards human-like dialogue systems. Previous approaches on this topic mainly focus on detecting and utilizing the user's emotion for generating empathetic responses. However, since empathy includes both aspects of affection and cognition, we argue that in addition to identifying the user's emotion, cognitive understanding of the user's situation should also be considered. To this end, we propose a novel approach for empathetic response generation, which leverages commonsense to draw more information about the user's situation and uses this additional information to further enhance the empathy expression in generated responses. We evaluate our approach on EmpatheticDialogues, which is a widely-used benchmark dataset for empathetic response generation. Empirical results demonstrate that our approach outperforms the baseline models in both automatic and human evaluations and can generate more informative and empathetic responses.
△ Less
Submitted 6 December, 2021; v1 submitted 13 September, 2021;
originally announced September 2021.
-
DeepFlow: Abnormal Traffic Flow Detection Using Siamese Networks
Authors:
Sepehr Sabour,
Sanjeev Rao,
Majid Ghaderi
Abstract:
Nowadays, many cities are equipped with surveillance systems and traffic control centers to monitor vehicular traffic for road safety and efficiency. The monitoring process is mostly done manually which is inefficient and expensive. In recent years, several data-driven solutions have been proposed in the literature to automatically analyze traffic flow data using machine learning techniques. Howev…
▽ More
Nowadays, many cities are equipped with surveillance systems and traffic control centers to monitor vehicular traffic for road safety and efficiency. The monitoring process is mostly done manually which is inefficient and expensive. In recent years, several data-driven solutions have been proposed in the literature to automatically analyze traffic flow data using machine learning techniques. However, existing solutions require large and comprehensive datasets for training which are not readily available, thus limiting their application. In this paper, we develop a traffic anomaly detection system, referred to as DeepFlow, based on Siamese neural networks, which are suitable in scenarios where only small datasets are available for training. Our model can detect abnormal traffic flows by analyzing the trajectory data collected from the vehicles in a fleet. To evaluate DeepFlow, we use realistic vehicular traffic simulations in SUMO. Our results show that DeepFlow detects abnormal traffic patterns with an F1 score of 78%, while outperforming other existing approaches including: Dynamic Time Warping (DTW), Global Alignment Kernels (GAK), and iForest.
△ Less
Submitted 26 August, 2021;
originally announced August 2021.
-
Towards Emotional Support Dialog Systems
Authors:
Siyang Liu,
Chujie Zheng,
Orianna Demasi,
Sahand Sabour,
Yu Li,
Zhou Yu,
Yong Jiang,
Minlie Huang
Abstract:
Emotional support is a crucial ability for many conversation scenarios, including social interactions, mental health support, and customer service chats. Following reasonable procedures and using various support skills can help to effectively provide support. However, due to the lack of a well-designed task and corpora of effective emotional support conversations, research on building emotional su…
▽ More
Emotional support is a crucial ability for many conversation scenarios, including social interactions, mental health support, and customer service chats. Following reasonable procedures and using various support skills can help to effectively provide support. However, due to the lack of a well-designed task and corpora of effective emotional support conversations, research on building emotional support into dialog systems remains untouched. In this paper, we define the Emotional Support Conversation (ESC) task and propose an ESC Framework, which is grounded on the Helping Skills Theory. We construct an Emotion Support Conversation dataset (ESConv) with rich annotation (especially support strategy) in a help-seeker and supporter mode. To ensure a corpus of high-quality conversations that provide examples of effective emotional support, we take extensive effort to design training tutorials for supporters and several mechanisms for quality control during data collection. Finally, we evaluate state-of-the-art dialog models with respect to the ability to provide emotional support. Our results show the importance of support strategies in providing effective emotional support and the utility of ESConv in training more emotional support systems.
△ Less
Submitted 2 June, 2021;
originally announced June 2021.
-
Canonical Capsules: Self-Supervised Capsules in Canonical Pose
Authors:
Weiwei Sun,
Andrea Tagliasacchi,
Boyang Deng,
Sara Sabour,
Soroosh Yazdani,
Geoffrey Hinton,
Kwang Moo Yi
Abstract:
We propose a self-supervised capsule architecture for 3D point clouds. We compute capsule decompositions of objects through permutation-equivariant attention, and self-supervise the process by training with pairs of randomly rotated objects. Our key idea is to aggregate the attention masks into semantic keypoints, and use these to supervise a decomposition that satisfies the capsule invariance/equ…
▽ More
We propose a self-supervised capsule architecture for 3D point clouds. We compute capsule decompositions of objects through permutation-equivariant attention, and self-supervise the process by training with pairs of randomly rotated objects. Our key idea is to aggregate the attention masks into semantic keypoints, and use these to supervise a decomposition that satisfies the capsule invariance/equivariance properties. This not only enables the training of a semantically consistent decomposition, but also allows us to learn a canonicalization operation that enables object-centric reasoning. To train our neural network we require neither classification labels nor manually-aligned training datasets. Yet, by learning an object-centric representation in a self-supervised manner, our method outperforms the state-of-the-art on 3D point cloud reconstruction, canonicalization, and unsupervised classification.
△ Less
Submitted 24 November, 2021; v1 submitted 8 December, 2020;
originally announced December 2020.
-
Unsupervised part representation by Flow Capsules
Authors:
Sara Sabour,
Andrea Tagliasacchi,
Soroosh Yazdani,
Geoffrey E. Hinton,
David J. Fleet
Abstract:
Capsule networks aim to parse images into a hierarchy of objects, parts and relations. While promising, they remain limited by an inability to learn effective low level part descriptions. To address this issue we propose a way to learn primary capsule encoders that detect atomic parts from a single image. During training we exploit motion as a powerful perceptual cue for part definition, with an e…
▽ More
Capsule networks aim to parse images into a hierarchy of objects, parts and relations. While promising, they remain limited by an inability to learn effective low level part descriptions. To address this issue we propose a way to learn primary capsule encoders that detect atomic parts from a single image. During training we exploit motion as a powerful perceptual cue for part definition, with an expressive decoder for part generation within a layered image model with occlusion. Experiments demonstrate robust part discovery in the presence of multiple objects, cluttered backgrounds, and occlusion. The part decoder infers the underlying shape masks, effectively filling in occluded regions of the detected shapes. We evaluate FlowCapsules on unsupervised part segmentation and unsupervised image classification.
△ Less
Submitted 19 February, 2021; v1 submitted 27 November, 2020;
originally announced November 2020.
-
Detecting and Diagnosing Adversarial Images with Class-Conditional Capsule Reconstructions
Authors:
Yao Qin,
Nicholas Frosst,
Sara Sabour,
Colin Raffel,
Garrison Cottrell,
Geoffrey Hinton
Abstract:
Adversarial examples raise questions about whether neural network models are sensitive to the same visual features as humans. In this paper, we first detect adversarial examples or otherwise corrupted images based on a class-conditional reconstruction of the input. To specifically attack our detection mechanism, we propose the Reconstructive Attack which seeks both to cause a misclassification and…
▽ More
Adversarial examples raise questions about whether neural network models are sensitive to the same visual features as humans. In this paper, we first detect adversarial examples or otherwise corrupted images based on a class-conditional reconstruction of the input. To specifically attack our detection mechanism, we propose the Reconstructive Attack which seeks both to cause a misclassification and a low reconstruction error. This reconstructive attack produces undetected adversarial examples but with much smaller success rate. Among all these attacks, we find that CapsNets always perform better than convolutional networks. Then, we diagnose the adversarial examples for CapsNets and find that the success of the reconstructive attack is highly related to the visual similarity between the source and target class. Additionally, the resulting perturbations can cause the input image to appear visually more like the target class and hence become non-adversarial. This suggests that CapsNets use features that are more aligned with human perception and have the potential to address the central issue raised by adversarial examples.
△ Less
Submitted 18 February, 2020; v1 submitted 5 July, 2019;
originally announced July 2019.
-
Stacked Capsule Autoencoders
Authors:
Adam R. Kosiorek,
Sara Sabour,
Yee Whye Teh,
Geoffrey E. Hinton
Abstract:
Objects are composed of a set of geometrically organized parts. We introduce an unsupervised capsule autoencoder (SCAE), which explicitly uses geometric relationships between parts to reason about objects. Since these relationships do not depend on the viewpoint, our model is robust to viewpoint changes. SCAE consists of two stages. In the first stage, the model predicts presences and poses of par…
▽ More
Objects are composed of a set of geometrically organized parts. We introduce an unsupervised capsule autoencoder (SCAE), which explicitly uses geometric relationships between parts to reason about objects. Since these relationships do not depend on the viewpoint, our model is robust to viewpoint changes. SCAE consists of two stages. In the first stage, the model predicts presences and poses of part templates directly from the image and tries to reconstruct the image by appropriately arranging the templates. In the second stage, SCAE predicts parameters of a few object capsules, which are then used to reconstruct part poses. Inference in this model is amortized and performed by off-the-shelf neural encoders, unlike in previous capsule networks. We find that object capsule presences are highly informative of the object class, which leads to state-of-the-art results for unsupervised classification on SVHN (55%) and MNIST (98.7%). The code is available at https://github.com/google-research/google-research/tree/master/stacked_capsule_autoencoders
△ Less
Submitted 2 December, 2019; v1 submitted 16 June, 2019;
originally announced June 2019.
-
Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling
Authors:
Jonathan Shen,
Patrick Nguyen,
Yonghui Wu,
Zhifeng Chen,
Mia X. Chen,
Ye Jia,
Anjuli Kannan,
Tara Sainath,
Yuan Cao,
Chung-Cheng Chiu,
Yanzhang He,
Jan Chorowski,
Smit Hinsu,
Stella Laurenzo,
James Qin,
Orhan Firat,
Wolfgang Macherey,
Suyog Gupta,
Ankur Bapna,
Shuyuan Zhang,
Ruoming Pang,
Ron J. Weiss,
Rohit Prabhavalkar,
Qiao Liang,
Benoit Jacob
, et al. (66 additional authors not shown)
Abstract:
Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models. Lingvo models are composed of modular building blocks that are flexible and easily extensible, and experiment configurations are centralized and highly customizable. Distributed training and quantized inference are supported directly w…
▽ More
Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models. Lingvo models are composed of modular building blocks that are flexible and easily extensible, and experiment configurations are centralized and highly customizable. Distributed training and quantized inference are supported directly within the framework, and it contains existing implementations of a large number of utilities, helper functions, and the newest research ideas. Lingvo has been used in collaboration by dozens of researchers in more than 20 papers over the last two years. This document outlines the underlying design of Lingvo and serves as an introduction to the various pieces of the framework, while also offering examples of advanced features that showcase the capabilities of the framework.
△ Less
Submitted 21 February, 2019;
originally announced February 2019.
-
DARCCC: Detecting Adversaries by Reconstruction from Class Conditional Capsules
Authors:
Nicholas Frosst,
Sara Sabour,
Geoffrey Hinton
Abstract:
We present a simple technique that allows capsule models to detect adversarial images. In addition to being trained to classify images, the capsule model is trained to reconstruct the images from the pose parameters and identity of the correct top-level capsule. Adversarial images do not look like a typical member of the predicted class and they have much larger reconstruction errors when the reco…
▽ More
We present a simple technique that allows capsule models to detect adversarial images. In addition to being trained to classify images, the capsule model is trained to reconstruct the images from the pose parameters and identity of the correct top-level capsule. Adversarial images do not look like a typical member of the predicted class and they have much larger reconstruction errors when the reconstruction is produced from the top-level capsule for that class. We show that setting a threshold on the $l2$ distance between the input image and its reconstruction from the winning capsule is very effective at detecting adversarial images for three different datasets. The same technique works quite well for CNNs that have been trained to reconstruct the image from all or part of the last hidden layer before the softmax. We then explore a stronger, white-box attack that takes the reconstruction error into account. This attack is able to fool our detection technique but in order to make the model change its prediction to another class, the attack must typically make the "adversarial" image resemble images of the other class.
△ Less
Submitted 16 November, 2018;
originally announced November 2018.
-
Optimal Completion Distillation for Sequence Learning
Authors:
Sara Sabour,
William Chan,
Mohammad Norouzi
Abstract:
We present Optimal Completion Distillation (OCD), a training procedure for optimizing sequence to sequence models based on edit distance. OCD is efficient, has no hyper-parameters of its own, and does not require pretraining or joint optimization with conditional log-likelihood. Given a partial sequence generated by the model, we first identify the set of optimal suffixes that minimize the total e…
▽ More
We present Optimal Completion Distillation (OCD), a training procedure for optimizing sequence to sequence models based on edit distance. OCD is efficient, has no hyper-parameters of its own, and does not require pretraining or joint optimization with conditional log-likelihood. Given a partial sequence generated by the model, we first identify the set of optimal suffixes that minimize the total edit distance, using an efficient dynamic programming algorithm. Then, for each position of the generated sequence, we use a target distribution that puts equal probability on the first token of all the optimal suffixes. OCD achieves the state-of-the-art performance on end-to-end speech recognition, on both Wall Street Journal and Librispeech datasets, achieving $9.3\%$ WER and $4.5\%$ WER respectively.
△ Less
Submitted 14 January, 2019; v1 submitted 2 October, 2018;
originally announced October 2018.
-
Dynamic Routing Between Capsules
Authors:
Sara Sabour,
Nicholas Frosst,
Geoffrey E Hinton
Abstract:
A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or an object part. We use the length of the activity vector to represent the probability that the entity exists and its orientation to represent the instantiation parameters. Active capsules at one level make predictions, via transformation matrices, for the…
▽ More
A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or an object part. We use the length of the activity vector to represent the probability that the entity exists and its orientation to represent the instantiation parameters. Active capsules at one level make predictions, via transformation matrices, for the instantiation parameters of higher-level capsules. When multiple predictions agree, a higher level capsule becomes active. We show that a discrimininatively trained, multi-layer capsule system achieves state-of-the-art performance on MNIST and is considerably better than a convolutional net at recognizing highly overlapping digits. To achieve these results we use an iterative routing-by-agreement mechanism: A lower-level capsule prefers to send its output to higher level capsules whose activity vectors have a big scalar product with the prediction coming from the lower-level capsule.
△ Less
Submitted 7 November, 2017; v1 submitted 26 October, 2017;
originally announced October 2017.
-
Adversarial Manipulation of Deep Representations
Authors:
Sara Sabour,
Yanshuai Cao,
Fartash Faghri,
David J. Fleet
Abstract:
We show that the representation of an image in a deep neural network (DNN) can be manipulated to mimic those of other natural images, with only minor, imperceptible perturbations to the original image. Previous methods for generating adversarial images focused on image perturbations designed to produce erroneous class labels, while we concentrate on the internal layers of DNN representations. In t…
▽ More
We show that the representation of an image in a deep neural network (DNN) can be manipulated to mimic those of other natural images, with only minor, imperceptible perturbations to the original image. Previous methods for generating adversarial images focused on image perturbations designed to produce erroneous class labels, while we concentrate on the internal layers of DNN representations. In this way our new class of adversarial images differs qualitatively from others. While the adversary is perceptually similar to one image, its internal representation appears remarkably similar to a different image, one from a different class, bearing little if any apparent similarity to the input; they appear generic and consistent with the space of natural images. This phenomenon raises questions about DNN representations, as well as the properties of natural images themselves.
△ Less
Submitted 4 March, 2016; v1 submitted 16 November, 2015;
originally announced November 2015.