-
Eyelid Fold Consistency in Facial Modeling
Authors:
Lohit Petikam,
Charlie Hewitt,
Fatemeh Saleh,
Tadas Baltrušaitis
Abstract:
Eyelid shape is integral to identity and likeness in human facial modeling. Human eyelids are diverse in appearance with varied skin fold and epicanthal fold morphology between individuals. Existing parametric face models express eyelid shape variation to an extent, but do not preserve sufficient likeness across a diverse range of individuals. We propose a new definition of eyelid fold consistency…
▽ More
Eyelid shape is integral to identity and likeness in human facial modeling. Human eyelids are diverse in appearance with varied skin fold and epicanthal fold morphology between individuals. Existing parametric face models express eyelid shape variation to an extent, but do not preserve sufficient likeness across a diverse range of individuals. We propose a new definition of eyelid fold consistency and implement geometric processing techniques to model diverse eyelid shapes in a unified topology. Using this method we reprocess data used to train a parametric face model and demonstrate significant improvements in face-related machine learning tasks.
△ Less
Submitted 17 October, 2024;
originally announced October 2024.
-
Look Ma, no markers: holistic performance capture without the hassle
Authors:
Charlie Hewitt,
Fatemeh Saleh,
Sadegh Aliakbarian,
Lohit Petikam,
Shideh Rezaeifar,
Louis Florentin,
Zafiirah Hosenie,
Thomas J Cashman,
Julien Valentin,
Darren Cosker,
Tadas Baltrusaitis
Abstract:
We tackle the problem of highly-accurate, holistic performance capture for the face, body and hands simultaneously. Motion-capture technologies used in film and game production typically focus only on face, body or hand capture independently, involve complex and expensive hardware and a high degree of manual intervention from skilled operators. While machine-learning-based approaches exist to over…
▽ More
We tackle the problem of highly-accurate, holistic performance capture for the face, body and hands simultaneously. Motion-capture technologies used in film and game production typically focus only on face, body or hand capture independently, involve complex and expensive hardware and a high degree of manual intervention from skilled operators. While machine-learning-based approaches exist to overcome these problems, they usually only support a single camera, often operate on a single part of the body, do not produce precise world-space results, and rarely generalize outside specific contexts. In this work, we introduce the first technique for marker-free, high-quality reconstruction of the complete human body, including eyes and tongue, without requiring any calibration, manual intervention or custom hardware. Our approach produces stable world-space results from arbitrary camera rigs as well as supporting varied capture environments and clothing. We achieve this through a hybrid approach that leverages machine learning models trained exclusively on synthetic data and powerful parametric models of human shape and motion. We evaluate our method on a number of body, face and hand reconstruction benchmarks and demonstrate state-of-the-art results that generalize on diverse datasets.
△ Less
Submitted 15 October, 2024;
originally announced October 2024.
-
MEV Capture and Decentralization in Execution Tickets
Authors:
Jonah Burian,
Davide Crapis,
Fahad Saleh
Abstract:
We provide an economic model of Execution Tickets and use it to study the ability of the Ethereum protocol to capture MEV from block construction. We demonstrate that Execution Tickets extract all MEV when all buyers are homogeneous, risk neutral and face no capital costs. We also show that MEV capture decreases with risk aversion and capital costs. Moreover, when buyers are heterogeneous, MEV cap…
▽ More
We provide an economic model of Execution Tickets and use it to study the ability of the Ethereum protocol to capture MEV from block construction. We demonstrate that Execution Tickets extract all MEV when all buyers are homogeneous, risk neutral and face no capital costs. We also show that MEV capture decreases with risk aversion and capital costs. Moreover, when buyers are heterogeneous, MEV capture can be especially low and a single dominant buyer can extract much of the MEV. This adverse effect can be partially mitigated by the presence of a Proposer Builder Separation (PBS) mechanism, which gives ET buyers access to a market of specialized builders, but in practice centralization vectors still persist. With PBS, ETs are concentrated among those with the highest ex-ante MEV extraction ability and lowest cost of capital. We show how it is possible that large investors that are not builders but have substantial advantage in capital cost can come to dominate the ET market.
△ Less
Submitted 20 August, 2024;
originally announced August 2024.
-
HMD-NeMo: Online 3D Avatar Motion Generation From Sparse Observations
Authors:
Sadegh Aliakbarian,
Fatemeh Saleh,
David Collier,
Pashmina Cameron,
Darren Cosker
Abstract:
Generating both plausible and accurate full body avatar motion is the key to the quality of immersive experiences in mixed reality scenarios. Head-Mounted Devices (HMDs) typically only provide a few input signals, such as head and hands 6-DoF. Recently, different approaches achieved impressive performance in generating full body motion given only head and hands signal. However, to the best of our…
▽ More
Generating both plausible and accurate full body avatar motion is the key to the quality of immersive experiences in mixed reality scenarios. Head-Mounted Devices (HMDs) typically only provide a few input signals, such as head and hands 6-DoF. Recently, different approaches achieved impressive performance in generating full body motion given only head and hands signal. However, to the best of our knowledge, all existing approaches rely on full hand visibility. While this is the case when, e.g., using motion controllers, a considerable proportion of mixed reality experiences do not involve motion controllers and instead rely on egocentric hand tracking. This introduces the challenge of partial hand visibility owing to the restricted field of view of the HMD. In this paper, we propose the first unified approach, HMD-NeMo, that addresses plausible and accurate full body motion generation even when the hands may be only partially visible. HMD-NeMo is a lightweight neural network that predicts the full body motion in an online and real-time fashion. At the heart of HMD-NeMo is the spatio-temporal encoder with novel temporally adaptable mask tokens that encourage plausible motion in the absence of hand observations. We perform extensive analysis of the impact of different components in HMD-NeMo and introduce a new state-of-the-art on AMASS dataset through our evaluation.
△ Less
Submitted 22 August, 2023;
originally announced August 2023.
-
Time is Money: Strategic Timing Games in Proof-of-Stake Protocols
Authors:
Caspar Schwarz-Schilling,
Fahad Saleh,
Thomas Thiery,
Jennifer Pan,
Nihar Shah,
Barnabé Monnot
Abstract:
We propose a model suggesting that honest-but-rational consensus participants may play timing games, and strategically delay their block proposal to optimize MEV capture, while still ensuring the proposal's timely inclusion in the canonical chain. In this context, ensuring economic fairness among consensus participants is critical to preserving decentralization. We contend that a model grounded in…
▽ More
We propose a model suggesting that honest-but-rational consensus participants may play timing games, and strategically delay their block proposal to optimize MEV capture, while still ensuring the proposal's timely inclusion in the canonical chain. In this context, ensuring economic fairness among consensus participants is critical to preserving decentralization. We contend that a model grounded in honest-but-rational consensus participation provides a more accurate portrayal of behavior in economically incentivized systems such as blockchain protocols. We empirically investigate timing games on the Ethereum network and demonstrate that while timing games are worth playing, they are not currently being exploited by consensus participants. By quantifying the marginal value of time, we uncover strong evidence pointing towards their future potential, despite the limited exploitation of MEV capture observed at present.
△ Less
Submitted 15 May, 2023;
originally announced May 2023.
-
Effective Self-supervised Pre-training on Low-compute Networks without Distillation
Authors:
Fuwen Tan,
Fatemeh Saleh,
Brais Martinez
Abstract:
Despite the impressive progress of self-supervised learning (SSL), its applicability to low-compute networks has received limited attention. Reported performance has trailed behind standard supervised pre-training by a large margin, barring self-supervised learning from making an impact on models that are deployed on device. Most prior works attribute this poor performance to the capacity bottlene…
▽ More
Despite the impressive progress of self-supervised learning (SSL), its applicability to low-compute networks has received limited attention. Reported performance has trailed behind standard supervised pre-training by a large margin, barring self-supervised learning from making an impact on models that are deployed on device. Most prior works attribute this poor performance to the capacity bottleneck of the low-compute networks and opt to bypass the problem through the use of knowledge distillation (KD). In this work, we revisit SSL for efficient neural networks, taking a closer at what are the detrimental factors causing the practical limitations, and whether they are intrinsic to the self-supervised low-compute setting. We find that, contrary to accepted knowledge, there is no intrinsic architectural bottleneck, we diagnose that the performance bottleneck is related to the model complexity vs regularization strength trade-off. In particular, we start by empirically observing that the use of local views can have a dramatic impact on the effectiveness of the SSL methods. This hints at view sampling being one of the performance bottlenecks for SSL on low-capacity networks. We hypothesize that the view sampling strategy for large neural networks, which requires matching views in very diverse spatial scales and contexts, is too demanding for low-capacity architectures. We systematize the design of the view sampling mechanism, leading to a new training methodology that consistently improves the performance across different SSL methods (e.g. MoCo-v2, SwAV, DINO), different low-size networks (e.g. MobileNetV2, ResNet18, ResNet34, ViT-Ti), and different tasks (linear probe, object detection, instance segmentation and semi-supervised learning). Our best models establish a new state-of-the-art for SSL methods on low-compute networks despite not using a KD loss term.
△ Less
Submitted 2 October, 2023; v1 submitted 6 October, 2022;
originally announced October 2022.
-
iBoot: Image-bootstrapped Self-Supervised Video Representation Learning
Authors:
Fatemeh Saleh,
Fuwen Tan,
Adrian Bulat,
Georgios Tzimiropoulos,
Brais Martinez
Abstract:
Learning visual representations through self-supervision is an extremely challenging task as the network needs to sieve relevant patterns from spurious distractors without the active guidance provided by supervision. This is achieved through heavy data augmentation, large-scale datasets and prohibitive amounts of compute. Video self-supervised learning (SSL) suffers from added challenges: video da…
▽ More
Learning visual representations through self-supervision is an extremely challenging task as the network needs to sieve relevant patterns from spurious distractors without the active guidance provided by supervision. This is achieved through heavy data augmentation, large-scale datasets and prohibitive amounts of compute. Video self-supervised learning (SSL) suffers from added challenges: video datasets are typically not as large as image datasets, compute is an order of magnitude larger, and the amount of spurious patterns the optimizer has to sieve through is multiplied several fold. Thus, directly learning self-supervised representations from video data might result in sub-optimal performance. To address this, we propose to utilize a strong image-based model, pre-trained with self- or language supervision, in a video representation learning framework, enabling the model to learn strong spatial and temporal information without relying on the video labeled data. To this end, we modify the typical video-based SSL design and objective to encourage the video encoder to \textit{subsume} the semantic content of an image-based model trained on a general domain. The proposed algorithm is shown to learn much more efficiently (i.e. in less epochs and with a smaller batch) and results in a new state-of-the-art performance on standard downstream tasks among single-modality SSL methods.
△ Less
Submitted 16 June, 2022;
originally announced June 2022.
-
Multilingual Neural Machine Translation:Can Linguistic Hierarchies Help?
Authors:
Fahimeh Saleh,
Wray Buntine,
Gholamreza Haffari,
Lan Du
Abstract:
Multilingual Neural Machine Translation (MNMT) trains a single NMT model that supports translation between multiple languages, rather than training separate models for different languages. Learning a single model can enhance the low-resource translation by leveraging data from multiple languages. However, the performance of an MNMT model is highly dependent on the type of languages used in trainin…
▽ More
Multilingual Neural Machine Translation (MNMT) trains a single NMT model that supports translation between multiple languages, rather than training separate models for different languages. Learning a single model can enhance the low-resource translation by leveraging data from multiple languages. However, the performance of an MNMT model is highly dependent on the type of languages used in training, as transferring knowledge from a diverse set of languages degrades the translation performance due to negative transfer. In this paper, we propose a Hierarchical Knowledge Distillation (HKD) approach for MNMT which capitalises on language groups generated according to typological features and phylogeny of languages to overcome the issue of negative transfer. HKD generates a set of multilingual teacher-assistant models via a selective knowledge distillation mechanism based on the language groups, and then distils the ultimate multilingual model from those assistants in an adaptive way. Experimental results derived from the TED dataset with 53 languages demonstrate the effectiveness of our approach in avoiding the negative transfer effect in MNMT, leading to an improved translation performance (about 1 BLEU score on average) compared to strong baselines.
△ Less
Submitted 14 October, 2021;
originally announced October 2021.
-
JRDB-Act: A Large-scale Dataset for Spatio-temporal Action, Social Group and Activity Detection
Authors:
Mahsa Ehsanpour,
Fatemeh Saleh,
Silvio Savarese,
Ian Reid,
Hamid Rezatofighi
Abstract:
The availability of large-scale video action understanding datasets has facilitated advances in the interpretation of visual scenes containing people. However, learning to recognise human actions and their social interactions in an unconstrained real-world environment comprising numerous people, with potentially highly unbalanced and long-tailed distributed action labels from a stream of sensory d…
▽ More
The availability of large-scale video action understanding datasets has facilitated advances in the interpretation of visual scenes containing people. However, learning to recognise human actions and their social interactions in an unconstrained real-world environment comprising numerous people, with potentially highly unbalanced and long-tailed distributed action labels from a stream of sensory data captured from a mobile robot platform remains a significant challenge, not least owing to the lack of a reflective large-scale dataset. In this paper, we introduce JRDB-Act, as an extension of the existing JRDB, which is captured by a social mobile manipulator and reflects a real distribution of human daily-life actions in a university campus environment. JRDB-Act has been densely annotated with atomic actions, comprises over 2.8M action labels, constituting a large-scale spatio-temporal action detection dataset. Each human bounding box is labeled with one pose-based action label and multiple~(optional) interaction-based action labels. Moreover JRDB-Act provides social group annotation, conducive to the task of grouping individuals based on their interactions in the scene to infer their social activities~(common activities in each social group). Each annotated label in JRDB-Act is tagged with the annotators' confidence level which contributes to the development of reliable evaluation strategies. In order to demonstrate how one can effectively utilise such annotations, we develop an end-to-end trainable pipeline to learn and infer these tasks, i.e. individual action and social group detection. The data and the evaluation code is publicly available at https://jrdb.erc.monash.edu/.
△ Less
Submitted 23 November, 2021; v1 submitted 16 June, 2021;
originally announced June 2021.
-
Probabilistic Tracklet Scoring and Inpainting for Multiple Object Tracking
Authors:
Fatemeh Saleh,
Sadegh Aliakbarian,
Hamid Rezatofighi,
Mathieu Salzmann,
Stephen Gould
Abstract:
Despite the recent advances in multiple object tracking (MOT), achieved by joint detection and tracking, dealing with long occlusions remains a challenge. This is due to the fact that such techniques tend to ignore the long-term motion information. In this paper, we introduce a probabilistic autoregressive motion model to score tracklet proposals by directly measuring their likelihood. This is ach…
▽ More
Despite the recent advances in multiple object tracking (MOT), achieved by joint detection and tracking, dealing with long occlusions remains a challenge. This is due to the fact that such techniques tend to ignore the long-term motion information. In this paper, we introduce a probabilistic autoregressive motion model to score tracklet proposals by directly measuring their likelihood. This is achieved by training our model to learn the underlying distribution of natural tracklets. As such, our model allows us not only to assign new detections to existing tracklets, but also to inpaint a tracklet when an object has been lost for a long time, e.g., due to occlusion, by sampling tracklets so as to fill the gap caused by misdetections. Our experiments demonstrate the superiority of our approach at tracking objects in challenging sequences; it outperforms the state of the art in most standard MOT metrics on multiple MOT benchmark datasets, including MOT16, MOT17, and MOT20.
△ Less
Submitted 9 December, 2020; v1 submitted 3 December, 2020;
originally announced December 2020.
-
Collective Wisdom: Improving Low-resource Neural Machine Translation using Adaptive Knowledge Distillation
Authors:
Fahimeh Saleh,
Wray Buntine,
Gholamreza Haffari
Abstract:
Scarcity of parallel sentence-pairs poses a significant hurdle for training high-quality Neural Machine Translation (NMT) models in bilingually low-resource scenarios. A standard approach is transfer learning, which involves taking a model trained on a high-resource language-pair and fine-tuning it on the data of the low-resource MT condition of interest. However, it is not clear generally which h…
▽ More
Scarcity of parallel sentence-pairs poses a significant hurdle for training high-quality Neural Machine Translation (NMT) models in bilingually low-resource scenarios. A standard approach is transfer learning, which involves taking a model trained on a high-resource language-pair and fine-tuning it on the data of the low-resource MT condition of interest. However, it is not clear generally which high-resource language-pair offers the best transfer learning for the target MT setting. Furthermore, different transferred models may have complementary semantic and/or syntactic strengths, hence using only one model may be sub-optimal. In this paper, we tackle this problem using knowledge distillation, where we propose to distill the knowledge of ensemble of teacher models to a single student model. As the quality of these teacher models varies, we propose an effective adaptive knowledge distillation approach to dynamically adjust the contribution of the teacher models during the distillation process. Experiments on transferring from a collection of six language pairs from IWSLT to five low-resource language-pairs from TED Talks demonstrate the effectiveness of our approach, achieving up to +0.9 BLEU score improvement compared to strong baselines.
△ Less
Submitted 12 October, 2020;
originally announced October 2020.
-
Uncertainty Inspired RGB-D Saliency Detection
Authors:
Jing Zhang,
Deng-Ping Fan,
Yuchao Dai,
Saeed Anwar,
Fatemeh Saleh,
Sadegh Aliakbarian,
Nick Barnes
Abstract:
We propose the first stochastic framework to employ uncertainty for RGB-D saliency detection by learning from the data labeling process. Existing RGB-D saliency detection models treat this task as a point estimation problem by predicting a single saliency map following a deterministic learning pipeline. We argue that, however, the deterministic solution is relatively ill-posed. Inspired by the sal…
▽ More
We propose the first stochastic framework to employ uncertainty for RGB-D saliency detection by learning from the data labeling process. Existing RGB-D saliency detection models treat this task as a point estimation problem by predicting a single saliency map following a deterministic learning pipeline. We argue that, however, the deterministic solution is relatively ill-posed. Inspired by the saliency data labeling process, we propose a generative architecture to achieve probabilistic RGB-D saliency detection which utilizes a latent variable to model the labeling variations. Our framework includes two main models: 1) a generator model, which maps the input image and latent variable to stochastic saliency prediction, and 2) an inference model, which gradually updates the latent variable by sampling it from the true or approximate posterior distribution. The generator model is an encoder-decoder saliency network. To infer the latent variable, we introduce two different solutions: i) a Conditional Variational Auto-encoder with an extra encoder to approximate the posterior distribution of the latent variable; and ii) an Alternating Back-Propagation technique, which directly samples the latent variable from the true posterior distribution. Qualitative and quantitative results on six challenging RGB-D benchmark datasets show our approach's superior performance in learning the distribution of saliency maps. The source code is publicly available via our project page: https://github.com/JingZhang617/UCNet.
△ Less
Submitted 7 September, 2020;
originally announced September 2020.
-
Joint Learning of Social Groups, Individuals Action and Sub-group Activities in Videos
Authors:
Mahsa Ehsanpour,
Alireza Abedin,
Fatemeh Saleh,
Javen Shi,
Ian Reid,
Hamid Rezatofighi
Abstract:
The state-of-the art solutions for human activity understanding from a video stream formulate the task as a spatio-temporal problem which requires joint localization of all individuals in the scene and classification of their actions or group activity over time. Who is interacting with whom, e.g. not everyone in a queue is interacting with each other, is often not predicted. There are scenarios wh…
▽ More
The state-of-the art solutions for human activity understanding from a video stream formulate the task as a spatio-temporal problem which requires joint localization of all individuals in the scene and classification of their actions or group activity over time. Who is interacting with whom, e.g. not everyone in a queue is interacting with each other, is often not predicted. There are scenarios where people are best to be split into sub-groups, which we call social groups, and each social group may be engaged in a different social activity. In this paper, we solve the problem of simultaneously grouping people by their social interactions, predicting their individual actions and the social activity of each social group, which we call the social task. Our main contributions are: i) we propose an end-to-end trainable framework for the social task; ii) our proposed method also sets the state-of-the-art results on two widely adopted benchmarks for the traditional group activity recognition task (assuming individuals of the scene form a single group and predicting a single group activity label for the scene); iii) we introduce new annotations on an existing group activity dataset, re-purposing it for the social task.
△ Less
Submitted 27 July, 2020; v1 submitted 6 July, 2020;
originally announced July 2020.
-
The IKEA ASM Dataset: Understanding People Assembling Furniture through Actions, Objects and Pose
Authors:
Yizhak Ben-Shabat,
Xin Yu,
Fatemeh Sadat Saleh,
Dylan Campbell,
Cristian Rodriguez-Opazo,
Hongdong Li,
Stephen Gould
Abstract:
The availability of a large labeled dataset is a key requirement for applying deep learning methods to solve various computer vision tasks. In the context of understanding human activities, existing public datasets, while large in size, are often limited to a single RGB camera and provide only per-frame or per-clip action annotations. To enable richer analysis and understanding of human activities…
▽ More
The availability of a large labeled dataset is a key requirement for applying deep learning methods to solve various computer vision tasks. In the context of understanding human activities, existing public datasets, while large in size, are often limited to a single RGB camera and provide only per-frame or per-clip action annotations. To enable richer analysis and understanding of human activities, we introduce IKEA ASM -- a three million frame, multi-view, furniture assembly video dataset that includes depth, atomic actions, object segmentation, and human pose. Additionally, we benchmark prominent methods for video action recognition, object segmentation and human pose estimation tasks on this challenging dataset. The dataset enables the development of holistic methods, which integrate multi-modal and multi-view data to better perform on these tasks.
△ Less
Submitted 17 May, 2023; v1 submitted 1 July, 2020;
originally announced July 2020.
-
ArTIST: Autoregressive Trajectory Inpainting and Scoring for Tracking
Authors:
Fatemeh Saleh,
Sadegh Aliakbarian,
Mathieu Salzmann,
Stephen Gould
Abstract:
One of the core components in online multiple object tracking (MOT) frameworks is associating new detections with existing tracklets, typically done via a scoring function. Despite the great advances in MOT, designing a reliable scoring function remains a challenge. In this paper, we introduce a probabilistic autoregressive generative model to score tracklet proposals by directly measuring the lik…
▽ More
One of the core components in online multiple object tracking (MOT) frameworks is associating new detections with existing tracklets, typically done via a scoring function. Despite the great advances in MOT, designing a reliable scoring function remains a challenge. In this paper, we introduce a probabilistic autoregressive generative model to score tracklet proposals by directly measuring the likelihood that a tracklet represents natural motion. One key property of our model is its ability to generate multiple likely futures of a tracklet given partial observations. This allows us to not only score tracklets but also effectively maintain existing tracklets when the detector fails to detect some objects even for a long time, e.g., due to occlusion, by sampling trajectories so as to inpaint the gaps caused by misdetection. Our experiments demonstrate the effectiveness of our approach to scoring and inpainting tracklets on several MOT benchmark datasets. We additionally show the generality of our generative model by using it to produce future representations in the challenging task of human motion prediction.
△ Less
Submitted 16 April, 2020;
originally announced April 2020.
-
UC-Net: Uncertainty Inspired RGB-D Saliency Detection via Conditional Variational Autoencoders
Authors:
Jing Zhang,
Deng-Ping Fan,
Yuchao Dai,
Saeed Anwar,
Fatemeh Sadat Saleh,
Tong Zhang,
Nick Barnes
Abstract:
In this paper, we propose the first framework (UCNet) to employ uncertainty for RGB-D saliency detection by learning from the data labeling process. Existing RGB-D saliency detection methods treat the saliency detection task as a point estimation problem, and produce a single saliency map following a deterministic learning pipeline. Inspired by the saliency data labeling process, we propose probab…
▽ More
In this paper, we propose the first framework (UCNet) to employ uncertainty for RGB-D saliency detection by learning from the data labeling process. Existing RGB-D saliency detection methods treat the saliency detection task as a point estimation problem, and produce a single saliency map following a deterministic learning pipeline. Inspired by the saliency data labeling process, we propose probabilistic RGB-D saliency detection network via conditional variational autoencoders to model human annotation uncertainty and generate multiple saliency maps for each input image by sampling in the latent space. With the proposed saliency consensus process, we are able to generate an accurate saliency map based on these multiple predictions. Quantitative and qualitative evaluations on six challenging benchmark datasets against 18 competing algorithms demonstrate the effectiveness of our approach in learning the distribution of saliency maps, leading to a new state-of-the-art in RGB-D saliency detection.
△ Less
Submitted 13 April, 2020;
originally announced April 2020.
-
Contextually Plausible and Diverse 3D Human Motion Prediction
Authors:
Sadegh Aliakbarian,
Fatemeh Sadat Saleh,
Lars Petersson,
Stephen Gould,
Mathieu Salzmann
Abstract:
We tackle the task of diverse 3D human motion prediction, that is, forecasting multiple plausible future 3D poses given a sequence of observed 3D poses. In this context, a popular approach consists of using a Conditional Variational Autoencoder (CVAE). However, existing approaches that do so either fail to capture the diversity in human motion, or generate diverse but semantically implausible cont…
▽ More
We tackle the task of diverse 3D human motion prediction, that is, forecasting multiple plausible future 3D poses given a sequence of observed 3D poses. In this context, a popular approach consists of using a Conditional Variational Autoencoder (CVAE). However, existing approaches that do so either fail to capture the diversity in human motion, or generate diverse but semantically implausible continuations of the observed motion. In this paper, we address both of these problems by developing a new variational framework that accounts for both diversity and context of the generated future motion. To this end, and in contrast to existing approaches, we condition the sampling of the latent variable that acts as source of diversity on the representation of the past observation, thus encouraging it to carry relevant information. Our experiments demonstrate that our approach yields motions not only of higher quality while retaining diversity, but also that preserve the contextual information contained in the observed 3D pose sequence.
△ Less
Submitted 5 December, 2020; v1 submitted 18 December, 2019;
originally announced December 2019.
-
A Survey on Document-level Neural Machine Translation: Methods and Evaluation
Authors:
Sameen Maruf,
Fahimeh Saleh,
Gholamreza Haffari
Abstract:
Machine translation (MT) is an important task in natural language processing (NLP) as it automates the translation process and reduces the reliance on human translators. With the resurgence of neural networks, the translation quality surpasses that of the translations obtained using statistical techniques for most language-pairs. Up until a few years ago, almost all of the neural translation model…
▽ More
Machine translation (MT) is an important task in natural language processing (NLP) as it automates the translation process and reduces the reliance on human translators. With the resurgence of neural networks, the translation quality surpasses that of the translations obtained using statistical techniques for most language-pairs. Up until a few years ago, almost all of the neural translation models translated sentences independently, without incorporating the wider document-context and inter-dependencies among the sentences. The aim of this survey paper is to highlight the major works that have been undertaken in the space of document-level machine translation after the neural revolution, so that researchers can recognise the current state and future directions of this field. We provide an organisation of the literature based on novelties in modelling and architectures as well as training and decoding strategies. In addition, we cover evaluation strategies that have been introduced to account for the improvements in document MT, including automatic metrics and discourse-targeted test sets. We conclude by presenting possible avenues for future exploration in this research field.
△ Less
Submitted 12 January, 2021; v1 submitted 18 December, 2019;
originally announced December 2019.
-
Naver Labs Europe's Systems for the Document-Level Generation and Translation Task at WNGT 2019
Authors:
Fahimeh Saleh,
Alexandre Bérard,
Ioan Calapodescu,
Laurent Besacier
Abstract:
Recently, neural models led to significant improvements in both machine translation (MT) and natural language generation tasks (NLG). However, generation of long descriptive summaries conditioned on structured data remains an open challenge. Likewise, MT that goes beyond sentence-level context is still an open issue (e.g., document-level MT or MT with metadata). To address these challenges, we pro…
▽ More
Recently, neural models led to significant improvements in both machine translation (MT) and natural language generation tasks (NLG). However, generation of long descriptive summaries conditioned on structured data remains an open challenge. Likewise, MT that goes beyond sentence-level context is still an open issue (e.g., document-level MT or MT with metadata). To address these challenges, we propose to leverage data from both tasks and do transfer learning between MT, NLG, and MT with source-side metadata (MT+NLG). First, we train document-based MT systems with large amounts of parallel data. Then, we adapt these models to pure NLG and MT+NLG tasks by fine-tuning with smaller amounts of domain-specific data. This end-to-end NLG approach, without data selection and planning, outperforms the previous state of the art on the Rotowire NLG task. We participated to the "Document Generation and Translation" task at WNGT 2019, and ranked first in all tracks.
△ Less
Submitted 31 October, 2019;
originally announced October 2019.
-
Proposal-free Temporal Moment Localization of a Natural-Language Query in Video using Guided Attention
Authors:
Cristian Rodriguez-Opazo,
Edison Marrese-Taylor,
Fatemeh Sadat Saleh,
Hongdong Li,
Stephen Gould
Abstract:
This paper studies the problem of temporal moment localization in a long untrimmed video using natural language as the query. Given an untrimmed video and a sentence as the query, the goal is to determine the starting, and the ending, of the relevant visual moment in the video, that corresponds to the query sentence. While previous works have tackled this task by a propose-and-rank approach, we in…
▽ More
This paper studies the problem of temporal moment localization in a long untrimmed video using natural language as the query. Given an untrimmed video and a sentence as the query, the goal is to determine the starting, and the ending, of the relevant visual moment in the video, that corresponds to the query sentence. While previous works have tackled this task by a propose-and-rank approach, we introduce a more efficient, end-to-end trainable, and {\em proposal-free approach} that relies on three key components: a dynamic filter to transfer language information to the visual domain, a new loss function to guide our model to attend the most relevant parts of the video, and soft labels to model annotation uncertainty. We evaluate our method on two benchmark datasets, Charades-STA and ActivityNet-Captions. Experimental results show that our approach outperforms state-of-the-art methods on both datasets.
△ Less
Submitted 12 March, 2020; v1 submitted 20 August, 2019;
originally announced August 2019.
-
Learning Variations in Human Motion via Mix-and-Match Perturbation
Authors:
Mohammad Sadegh Aliakbarian,
Fatemeh Sadat Saleh,
Mathieu Salzmann,
Lars Petersson,
Stephen Gould,
Amirhossein Habibian
Abstract:
Human motion prediction is a stochastic process: Given an observed sequence of poses, multiple future motions are plausible. Existing approaches to modeling this stochasticity typically combine a random noise vector with information about the previous poses. This combination, however, is done in a deterministic manner, which gives the network the flexibility to learn to ignore the random noise. In…
▽ More
Human motion prediction is a stochastic process: Given an observed sequence of poses, multiple future motions are plausible. Existing approaches to modeling this stochasticity typically combine a random noise vector with information about the previous poses. This combination, however, is done in a deterministic manner, which gives the network the flexibility to learn to ignore the random noise. In this paper, we introduce an approach to stochastically combine the root of variations with previous pose information, which forces the model to take the noise into account. We exploit this idea for motion prediction by incorporating it into a recurrent encoder-decoder network with a conditional variational autoencoder block that learns to exploit the perturbations. Our experiments demonstrate that our model yields high-quality pose sequences that are much more diverse than those from state-of-the-art stochastic motion prediction techniques.
△ Less
Submitted 24 February, 2020; v1 submitted 2 August, 2019;
originally announced August 2019.
-
VIENA2: A Driving Anticipation Dataset
Authors:
Mohammad Sadegh Aliakbarian,
Fatemeh Sadat Saleh,
Mathieu Salzmann,
Basura Fernando,
Lars Petersson,
Lars Andersson
Abstract:
Action anticipation is critical in scenarios where one needs to react before the action is finalized. This is, for instance, the case in automated driving, where a car needs to, e.g., avoid hitting pedestrians and respect traffic lights. While solutions have been proposed to tackle subsets of the driving anticipation tasks, by making use of diverse, task-specific sensors, there is no single datase…
▽ More
Action anticipation is critical in scenarios where one needs to react before the action is finalized. This is, for instance, the case in automated driving, where a car needs to, e.g., avoid hitting pedestrians and respect traffic lights. While solutions have been proposed to tackle subsets of the driving anticipation tasks, by making use of diverse, task-specific sensors, there is no single dataset or framework that addresses them all in a consistent manner. In this paper, we therefore introduce a new, large-scale dataset, called VIENA2, covering 5 generic driving scenarios, with a total of 25 distinct action classes. It contains more than 15K full HD, 5s long videos acquired in various driving conditions, weathers, daytimes and environments, complemented with a common and realistic set of sensor measurements. This amounts to more than 2.25M frames, each annotated with an action label, corresponding to 600 samples per action class. We discuss our data acquisition strategy and the statistics of our dataset, and benchmark state-of-the-art action anticipation techniques, including a new multi-modal LSTM architecture with an effective loss function for action anticipation in driving scenarios.
△ Less
Submitted 29 October, 2018; v1 submitted 21 October, 2018;
originally announced October 2018.
-
Effective Use of Synthetic Data for Urban Scene Semantic Segmentation
Authors:
Fatemeh Sadat Saleh,
Mohammad Sadegh Aliakbarian,
Mathieu Salzmann,
Lars Petersson,
Jose M. Alvarez
Abstract:
Training a deep network to perform semantic segmentation requires large amounts of labeled data. To alleviate the manual effort of annotating real images, researchers have investigated the use of synthetic data, which can be labeled automatically. Unfortunately, a network trained on synthetic data performs relatively poorly on real images. While this can be addressed by domain adaptation, existing…
▽ More
Training a deep network to perform semantic segmentation requires large amounts of labeled data. To alleviate the manual effort of annotating real images, researchers have investigated the use of synthetic data, which can be labeled automatically. Unfortunately, a network trained on synthetic data performs relatively poorly on real images. While this can be addressed by domain adaptation, existing methods all require having access to real images during training. In this paper, we introduce a drastically different way to handle synthetic images that does not require seeing any real images at training time. Our approach builds on the observation that foreground and background classes are not affected in the same manner by the domain shift, and thus should be treated differently. In particular, the former should be handled in a detection-based manner to better account for the fact that, while their texture in synthetic images is not photo-realistic, their shape looks natural. Our experiments evidence the effectiveness of our approach on Cityscapes and CamVid with models trained on synthetic data only.
△ Less
Submitted 16 July, 2018;
originally announced July 2018.
-
Bringing Background into the Foreground: Making All Classes Equal in Weakly-supervised Video Semantic Segmentation
Authors:
Fatemeh Sadat Saleh,
Mohammad Sadegh Aliakbarian,
Mathieu Salzmann,
Lars Petersson,
Jose M. Alvarez
Abstract:
Pixel-level annotations are expensive and time-consuming to obtain. Hence, weak supervision using only image tags could have a significant impact in semantic segmentation. Recent years have seen great progress in weakly-supervised semantic segmentation, whether from a single image or from videos. However, most existing methods are designed to handle a single background class. In practical applicat…
▽ More
Pixel-level annotations are expensive and time-consuming to obtain. Hence, weak supervision using only image tags could have a significant impact in semantic segmentation. Recent years have seen great progress in weakly-supervised semantic segmentation, whether from a single image or from videos. However, most existing methods are designed to handle a single background class. In practical applications, such as autonomous navigation, it is often crucial to reason about multiple background classes. In this paper, we introduce an approach to doing so by making use of classifier heatmaps. We then develop a two-stream deep architecture that jointly leverages appearance and motion, and design a loss based on our heatmaps to train it. Our experiments demonstrate the benefits of our classifier heatmaps and of our two-stream architecture on challenging urban scene datasets and on the YouTube-Objects benchmark, where we obtain state-of-the-art results.
△ Less
Submitted 15 August, 2017;
originally announced August 2017.
-
Incorporating Network Built-in Priors in Weakly-supervised Semantic Segmentation
Authors:
Fatemeh Sadat Saleh,
Mohammad Sadegh Aliakbarian,
Mathieu Salzmann,
Lars Petersson,
Jose M. Alvarez,
Stephen Gould
Abstract:
Pixel-level annotations are expensive and time consuming to obtain. Hence, weak supervision using only image tags could have a significant impact in semantic segmentation. Recently, CNN-based methods have proposed to fine-tune pre-trained networks using image tags. Without additional information, this leads to poor localization accuracy. This problem, however, was alleviated by making use of objec…
▽ More
Pixel-level annotations are expensive and time consuming to obtain. Hence, weak supervision using only image tags could have a significant impact in semantic segmentation. Recently, CNN-based methods have proposed to fine-tune pre-trained networks using image tags. Without additional information, this leads to poor localization accuracy. This problem, however, was alleviated by making use of objectness priors to generate foreground/background masks. Unfortunately these priors either require pixel-level annotations/bounding boxes, or still yield inaccurate object boundaries. Here, we propose a novel method to extract accurate masks from networks pre-trained for the task of object recognition, thus forgoing external objectness modules. We first show how foreground/background masks can be obtained from the activations of higher-level convolutional layers of a network. We then show how to obtain multi-class masks by the fusion of foreground/background ones with information extracted from a weakly-supervised localization network. Our experiments evidence that exploiting these masks in conjunction with a weakly-supervised training loss yields state-of-the-art tag-based weakly-supervised semantic segmentation results.
△ Less
Submitted 5 June, 2017;
originally announced June 2017.
-
Encouraging LSTMs to Anticipate Actions Very Early
Authors:
Mohammad Sadegh Aliakbarian,
Fatemeh Sadat Saleh,
Mathieu Salzmann,
Basura Fernando,
Lars Petersson,
Lars Andersson
Abstract:
In contrast to the widely studied problem of recognizing an action given a complete sequence, action anticipation aims to identify the action from only partially available videos. As such, it is therefore key to the success of computer vision applications requiring to react as early as possible, such as autonomous navigation. In this paper, we propose a new action anticipation method that achieves…
▽ More
In contrast to the widely studied problem of recognizing an action given a complete sequence, action anticipation aims to identify the action from only partially available videos. As such, it is therefore key to the success of computer vision applications requiring to react as early as possible, such as autonomous navigation. In this paper, we propose a new action anticipation method that achieves high prediction accuracy even in the presence of a very small percentage of a video sequence. To this end, we develop a multi-stage LSTM architecture that leverages context-aware and action-aware features, and introduce a novel loss function that encourages the model to predict the correct class as early as possible. Our experiments on standard benchmark datasets evidence the benefits of our approach; We outperform the state-of-the-art action anticipation methods for early prediction by a relative increase in accuracy of 22.0% on JHMDB-21, 14.0% on UT-Interaction and 49.9% on UCF-101.
△ Less
Submitted 13 August, 2017; v1 submitted 20 March, 2017;
originally announced March 2017.
-
Deep Action- and Context-Aware Sequence Learning for Activity Recognition and Anticipation
Authors:
Mohammad Sadegh Aliakbarian,
Fatemehsadat Saleh,
Basura Fernando,
Mathieu Salzmann,
Lars Petersson,
Lars Andersson
Abstract:
Action recognition and anticipation are key to the success of many computer vision applications. Existing methods can roughly be grouped into those that extract global, context-aware representations of the entire image or sequence, and those that aim at focusing on the regions where the action occurs. While the former may suffer from the fact that context is not always reliable, the latter complet…
▽ More
Action recognition and anticipation are key to the success of many computer vision applications. Existing methods can roughly be grouped into those that extract global, context-aware representations of the entire image or sequence, and those that aim at focusing on the regions where the action occurs. While the former may suffer from the fact that context is not always reliable, the latter completely ignore this source of information, which can nonetheless be helpful in many situations. In this paper, we aim at making the best of both worlds by developing an approach that leverages both context-aware and action-aware features. At the core of our method lies a novel multi-stage recurrent architecture that allows us to effectively combine these two sources of information throughout a video. This architecture first exploits the global, context-aware features, and merges the resulting representation with the localized, action-aware ones. Our experiments on standard datasets evidence the benefits of our approach over methods that use each information type separately. We outperform the state-of-the-art methods that, as us, rely only on RGB frames as input for both action recognition and anticipation.
△ Less
Submitted 17 November, 2016; v1 submitted 16 November, 2016;
originally announced November 2016.
-
Built-in Foreground/Background Prior for Weakly-Supervised Semantic Segmentation
Authors:
Fatemehsadat Saleh,
Mohammad Sadegh Ali Akbarian,
Mathieu Salzmann,
Lars Petersson,
Stephen Gould,
Jose M. Alvarez
Abstract:
Pixel-level annotations are expensive and time consuming to obtain. Hence, weak supervision using only image tags could have a significant impact in semantic segmentation. Recently, CNN-based methods have proposed to fine-tune pre-trained networks using image tags. Without additional information, this leads to poor localization accuracy. This problem, however, was alleviated by making use of objec…
▽ More
Pixel-level annotations are expensive and time consuming to obtain. Hence, weak supervision using only image tags could have a significant impact in semantic segmentation. Recently, CNN-based methods have proposed to fine-tune pre-trained networks using image tags. Without additional information, this leads to poor localization accuracy. This problem, however, was alleviated by making use of objectness priors to generate foreground/background masks. Unfortunately these priors either require training pixel-level annotations/bounding boxes, or still yield inaccurate object boundaries. Here, we propose a novel method to extract markedly more accurate masks from the pre-trained network itself, forgoing external objectness modules. This is accomplished using the activations of the higher-level convolutional layers, smoothed by a dense CRF. We demonstrate that our method, based on these masks and a weakly-supervised loss, outperforms the state-of-the-art tag-based weakly-supervised semantic segmentation techniques. Furthermore, we introduce a new form of inexpensive weak supervision yielding an additional accuracy boost.
△ Less
Submitted 1 September, 2016;
originally announced September 2016.
-
Evaluating Open Access Paper Repository In Higher Education For Asean Region
Authors:
Reza Chandra,
Arif Purwo Nugroho,
Fikri Saleh
Abstract:
Paper repository at higher education is a collection of scientific articles created by the academic society. This study took as many as 80 universities in the Webometrics ranking of repositories in the Southeast Asia region. The tools used in this research is Google for number of web page and Google Scholar for number of document paper repository and Ahrefs for referring page, backlink and refferi…
▽ More
Paper repository at higher education is a collection of scientific articles created by the academic society. This study took as many as 80 universities in the Webometrics ranking of repositories in the Southeast Asia region. The tools used in this research is Google for number of web page and Google Scholar for number of document paper repository and Ahrefs for referring page, backlink and reffering domain. The result of this study, Eprints is the most widely used tools in higher education, as many as 37 higher educations (46,25%). Institut Teknologi Sepuluh November got the highest score in number of web page in Google (2.010.000), Bogor Agricultural University Scientific Repository got the highest score for number of document paper (44.300). University of Sumatera Utara Repository got the highest score for reffering page (82588) and backlink (86421). Universiti Teknologi Malaysia Institutional Repository got the highest score for reffering domain (532).
△ Less
Submitted 13 February, 2015;
originally announced February 2015.