Search | arXiv e-print repository

Deep learning in medical image registration: introduction and survey

Authors: Ahmad Hammoudeh, Stéphane Dupont

Abstract: Image registration (IR) is a process that deforms images to align them with respect to a reference space, making it easier for medical practitioners to examine various medical images in a standardized reference frame, such as having the same rotation and scale. This document introduces image registration using a simple numeric example. It provides a definition of image registration along with a sp… ▽ More Image registration (IR) is a process that deforms images to align them with respect to a reference space, making it easier for medical practitioners to examine various medical images in a standardized reference frame, such as having the same rotation and scale. This document introduces image registration using a simple numeric example. It provides a definition of image registration along with a space-oriented symbolic representation. This review covers various aspects of image transformations, including affine, deformable, invertible, and bidirectional transformations, as well as medical image registration algorithms such as Voxelmorph, Demons, SyN, Iterative Closest Point, and SynthMorph. It also explores atlas-based registration and multistage image registration techniques, including coarse-fine and pyramid approaches. Furthermore, this survey paper discusses medical image registration taxonomies, datasets, evaluation measures, such as correlation-based metrics, segmentation-based metrics, processing time, and model size. It also explores applications in image-guided surgery, motion tracking, and tumor diagnosis. Finally, the document addresses future research directions, including the further development of transformers. △ Less

Submitted 10 January, 2024; v1 submitted 1 September, 2023; originally announced September 2023.

arXiv:2305.18988 [pdf, other]

A Recipe for Efficient SBIR Models: Combining Relative Triplet Loss with Batch Normalization and Knowledge Distillation

Authors: Omar Seddati, Nathan Hubens, Stéphane Dupont, Thierry Dutoit

Abstract: Sketch-Based Image Retrieval (SBIR) is a crucial task in multimedia retrieval, where the goal is to retrieve a set of images that match a given sketch query. Researchers have already proposed several well-performing solutions for this task, but most focus on enhancing embedding through different approaches such as triplet loss, quadruplet loss, adding data augmentation, and using edge extraction.… ▽ More Sketch-Based Image Retrieval (SBIR) is a crucial task in multimedia retrieval, where the goal is to retrieve a set of images that match a given sketch query. Researchers have already proposed several well-performing solutions for this task, but most focus on enhancing embedding through different approaches such as triplet loss, quadruplet loss, adding data augmentation, and using edge extraction. In this work, we tackle the problem from various angles. We start by examining the training data quality and show some of its limitations. Then, we introduce a Relative Triplet Loss (RTL), an adapted triplet loss to overcome those limitations through loss weighting based on anchors similarity. Through a series of experiments, we demonstrate that replacing a triplet loss with RTL outperforms previous state-of-the-art without the need for any data augmentation. In addition, we demonstrate why batch normalization is more suited for SBIR embeddings than l2-normalization and show that it improves significantly the performance of our models. We further investigate the capacity of models required for the photo and sketch domains and demonstrate that the photo encoder requires a higher capacity than the sketch encoder, which validates the hypothesis formulated in [34]. Then, we propose a straightforward approach to train small models, such as ShuffleNetv2 [22] efficiently with a marginal loss of accuracy through knowledge distillation. The same approach used with larger models enabled us to outperform previous state-of-the-art results and achieve a recall of 62.38% at k = 1 on The Sketchy Database [30]. △ Less

Submitted 30 May, 2023; originally announced May 2023.

arXiv:2209.06629 [pdf, other]

Transformers and CNNs both Beat Humans on SBIR

Authors: Omar Seddati, Stéphane Dupont, Saïd Mahmoudi, Thierry Dutoit

Abstract: Sketch-based image retrieval (SBIR) is the task of retrieving natural images (photos) that match the semantics and the spatial configuration of hand-drawn sketch queries. The universality of sketches extends the scope of possible applications and increases the demand for efficient SBIR solutions. In this paper, we study classic triplet-based SBIR solutions and show that a persistent invariance to… ▽ More Sketch-based image retrieval (SBIR) is the task of retrieving natural images (photos) that match the semantics and the spatial configuration of hand-drawn sketch queries. The universality of sketches extends the scope of possible applications and increases the demand for efficient SBIR solutions. In this paper, we study classic triplet-based SBIR solutions and show that a persistent invariance to horizontal flip (even after model finetuning) is harming performance. To overcome this limitation, we propose several approaches and evaluate in depth each of them to check their effectiveness. Our main contributions are twofold: We propose and evaluate several intuitive modifications to build SBIR solutions with better flip equivariance. We show that vision transformers are more suited for the SBIR task, and that they outperform CNNs with a large margin. We carried out numerous experiments and introduce the first models to outperform human performance on a large-scale SBIR benchmark (Sketchy). Our best model achieves a recall of 62.25% (at k = 1) on the sketchy benchmark compared to previous state-of-the-art methods 46.2%. △ Less

Submitted 14 September, 2022; originally announced September 2022.

ACM Class: I.2.10

arXiv:2205.10266 [pdf, other]

Analysis of Co-Laughter Gesture Relationship on RGB videos in Dyadic Conversation Contex

Authors: Hugo Bohy, Ahmad Hammoudeh, Antoine Maiorca, Stéphane Dupont, Thierry Dutoit

Abstract: The development of virtual agents has enabled human-avatar interactions to become increasingly rich and varied. Moreover, an expressive virtual agent i.e. that mimics the natural expression of emotions, enhances social interaction between a user (human) and an agent (intelligent machine). The set of non-verbal behaviors of a virtual character is, therefore, an important component in the context of… ▽ More The development of virtual agents has enabled human-avatar interactions to become increasingly rich and varied. Moreover, an expressive virtual agent i.e. that mimics the natural expression of emotions, enhances social interaction between a user (human) and an agent (intelligent machine). The set of non-verbal behaviors of a virtual character is, therefore, an important component in the context of human-machine interaction. Laughter is not just an audio signal, but an intrinsic relationship of multimodal non-verbal communication, in addition to audio, it includes facial expressions and body movements. Motion analysis often relies on a relevant motion capture dataset, but the main issue is that the acquisition of such a dataset is expensive and time-consuming. This work studies the relationship between laughter and body movements in dyadic conversations. The body movements were extracted from videos using deep learning based pose estimator model. We found that, in the explored NDC-ME dataset, a single statistical feature (i.e, the maximum value, or the maximum of Fourier transform) of a joint movement weakly correlates with laughter intensity by 30%. However, we did not find a direct correlation between audio features and body movements. We discuss about the challenges to use such dataset for the audio-driven co-laughter motion synthesis task. △ Less

Submitted 20 May, 2022; originally announced May 2022.

Comments: 5 pages, 2 figures, 2 tables

arXiv:2202.05728 [pdf, other]

doi 10.1016/j.procs.2022.10.125

Deep soccer captioning with transformer: dataset, semantics-related losses, and multi-level evaluation

Authors: Ahmad Hammoudeh, Bastien Vanderplaetse, Stéphane Dupont

Abstract: This work aims at generating captions for soccer videos using deep learning. In this context, this paper introduces a dataset, model, and triple-level evaluation. The dataset consists of 22k caption-clip pairs and three visual features (images, optical flow, inpainting) for ~500 hours of \emph{SoccerNet} videos. The model is divided into three parts: a transformer learns language, ConvNets learn v… ▽ More This work aims at generating captions for soccer videos using deep learning. In this context, this paper introduces a dataset, model, and triple-level evaluation. The dataset consists of 22k caption-clip pairs and three visual features (images, optical flow, inpainting) for ~500 hours of \emph{SoccerNet} videos. The model is divided into three parts: a transformer learns language, ConvNets learn vision, and a fusion of linguistic and visual features generates captions. The paper suggests evaluating generated captions at three levels: syntax (the commonly used evaluation metrics such as BLEU-score and CIDEr), meaning (the quality of descriptions for a domain expert), and corpus (the diversity of generated captions). The paper shows that the diversity of generated captions has improved (from 0.07 reaching 0.18) with semantics-related losses that prioritize selected words. Semantics-related losses and the utilization of more visual features (optical flow, inpainting) improved the normalized captioning score by 28\%. The web page of this work: https://sites.google.com/view/soccercaptioning}{https://sites.google.com/view/soccercaptioning △ Less

Submitted 30 November, 2022; v1 submitted 11 February, 2022; originally announced February 2022.

arXiv:2106.06736 [pdf, other]

Multi-level Attention Fusion Network for Audio-visual Event Recognition

Authors: Mathilde Brousmiche, Jean Rouat, Stéphane Dupont

Abstract: Event classification is inherently sequential and multimodal. Therefore, deep neural models need to dynamically focus on the most relevant time window and/or modality of a video. In this study, we propose the Multi-level Attention Fusion network (MAFnet), an architecture that can dynamically fuse visual and audio information for event recognition. Inspired by prior studies in neuroscience, we coup… ▽ More Event classification is inherently sequential and multimodal. Therefore, deep neural models need to dynamically focus on the most relevant time window and/or modality of a video. In this study, we propose the Multi-level Attention Fusion network (MAFnet), an architecture that can dynamically fuse visual and audio information for event recognition. Inspired by prior studies in neuroscience, we couple both modalities at different levels of visual and audio paths. Furthermore, the network dynamically highlights a modality at a given time window relevant to classify events. Experimental results in AVE (Audio-Visual Event), UCF51, and Kinetics-Sounds datasets show that the approach can effectively improve the accuracy in audio-visual event classification. Code is available at: https://github.com/numediart/MAFnet △ Less

Submitted 12 June, 2021; originally announced June 2021.

Comments: Preprint submitted to the Information Fusion journal in August 2020

arXiv:2011.06502 [pdf]

doi 10.1007/978-3-030-69367-1_5

Quality4.0 -- Transparent product quality supervision in the age of Industry 4.0

Authors: Jens Brandenburger, Christoph Schirm, Josef Melcher, Edgar Hancke, Marco Vannucci, Valentina Colla, Silvia Cateni, Rami Sellami, Sébastien Dupont, Annick Majchrowski, Asier Arteaga

Abstract: Progressive digitalization is changing the game of many industrial sectors. Focus-ing on product quality the main profitability driver of this so-called Industry 4.0 will be the horizontal integration of information over the complete supply chain. Therefore, the European RFCS project 'Quality4.0' aims in developing an adap-tive platform, which releases decisions on product quality and provides tai… ▽ More Progressive digitalization is changing the game of many industrial sectors. Focus-ing on product quality the main profitability driver of this so-called Industry 4.0 will be the horizontal integration of information over the complete supply chain. Therefore, the European RFCS project 'Quality4.0' aims in developing an adap-tive platform, which releases decisions on product quality and provides tailored information of high reliability that can be individually exchanged with customers. In this context Machine Learning will be used to detect outliers in the quality data. This paper discusses the intermediate project results and the concepts developed so far for this horizontal integration of quality information. △ Less

Submitted 12 November, 2020; originally announced November 2020.

arXiv:2011.04258 [pdf, other]

Improved Soccer Action Spotting using both Audio and Video Streams

Authors: Bastien Vanderplaetse, Stéphane Dupont

Abstract: In this paper, we propose a study on multi-modal (audio and video) action spotting and classification in soccer videos. Action spotting and classification are the tasks that consist in finding the temporal anchors of events in a video and determine which event they are. This is an important application of general activity understanding. Here, we propose an experimental study on combining audio and… ▽ More In this paper, we propose a study on multi-modal (audio and video) action spotting and classification in soccer videos. Action spotting and classification are the tasks that consist in finding the temporal anchors of events in a video and determine which event they are. This is an important application of general activity understanding. Here, we propose an experimental study on combining audio and video information at different stages of deep neural network architectures. We used the SoccerNet benchmark dataset, which contains annotated events for 500 soccer game videos from the Big Five European leagues. Through this work, we evaluated several ways to integrate audio stream into video-only-based architectures. We observed an average absolute improvement of the mean Average Precision (mAP) metric of $7.43\%$ for the action classification task and of $4.19\%$ for the action spotting task. △ Less

Submitted 9 November, 2020; originally announced November 2020.

Comments: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2020, pp. 896-897

arXiv:2011.01018 [pdf, other]

AVECL-UMONS database for audio-visual event classification and localization

Authors: Mathilde Brousmiche, Stéphane Dupont, Jean Rouat

Abstract: We introduce the AVECL-UMons dataset for audio-visual event classification and localization in the context of office environments. The audio-visual dataset is composed of 11 event classes recorded at several realistic positions in two different rooms. Two types of sequences are recorded according to the number of events in the sequence. The dataset comprises 2662 unilabel sequences and 2724 multil… ▽ More We introduce the AVECL-UMons dataset for audio-visual event classification and localization in the context of office environments. The audio-visual dataset is composed of 11 event classes recorded at several realistic positions in two different rooms. Two types of sequences are recorded according to the number of events in the sequence. The dataset comprises 2662 unilabel sequences and 2724 multilabel sequences corresponding to a total of 5.24 hours. The dataset is publicly accessible online : https://zenodo.org/record/3965492#.X09wsobgrCI. △ Less

Submitted 2 October, 2020; originally announced November 2020.

arXiv:2010.02057 [pdf, other]

Modulated Fusion using Transformer for Linguistic-Acoustic Emotion Recognition

Authors: Jean-Benoit Delbrouck, Noé Tits, Stéphane Dupont

Abstract: This paper aims to bring a new lightweight yet powerful solution for the task of Emotion Recognition and Sentiment Analysis. Our motivation is to propose two architectures based on Transformers and modulation that combine the linguistic and acoustic inputs from a wide range of datasets to challenge, and sometimes surpass, the state-of-the-art in the field. To demonstrate the efficiency of our mode… ▽ More This paper aims to bring a new lightweight yet powerful solution for the task of Emotion Recognition and Sentiment Analysis. Our motivation is to propose two architectures based on Transformers and modulation that combine the linguistic and acoustic inputs from a wide range of datasets to challenge, and sometimes surpass, the state-of-the-art in the field. To demonstrate the efficiency of our models, we carefully evaluate their performances on the IEMOCAP, MOSI, MOSEI and MELD dataset. The experiments can be directly replicated and the code is fully open for future researches. △ Less

Submitted 5 October, 2020; originally announced October 2020.

Comments: EMNLP 2020 workshop: NLP Beyond Text (NLPBT)

arXiv:2006.15955 [pdf, other]

doi 10.18653/v1/2020.challengehml-1.1

A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis

Authors: Jean-Benoit Delbrouck, Noé Tits, Mathilde Brousmiche, Stéphane Dupont

Abstract: Understanding expressed sentiment and emotions are two crucial factors in human multimodal language. This paper describes a Transformer-based joint-encoding (TBJE) for the task of Emotion Recognition and Sentiment Analysis. In addition to use the Transformer architecture, our approach relies on a modular co-attention and a glimpse layer to jointly encode one or more modalities. The proposed soluti… ▽ More Understanding expressed sentiment and emotions are two crucial factors in human multimodal language. This paper describes a Transformer-based joint-encoding (TBJE) for the task of Emotion Recognition and Sentiment Analysis. In addition to use the Transformer architecture, our approach relies on a modular co-attention and a glimpse layer to jointly encode one or more modalities. The proposed solution has also been submitted to the ACL20: Second Grand-Challenge on Multimodal Language to be evaluated on the CMU-MOSEI dataset. The code to replicate the presented experiments is open-source: https://github.com/jbdel/MOSEI_UMONS. △ Less

Submitted 29 June, 2020; originally announced June 2020.

Comments: Winner of the ACL20: Second Grand-Challenge on Multimodal Language

arXiv:1910.14609 [pdf, other]

Can adversarial training learn image captioning ?

Authors: Jean-Benoit Delbrouck, Bastien Vanderplaetse, Stéphane Dupont

Abstract: Recently, generative adversarial networks (GAN) have gathered a lot of interest. Their efficiency in generating unseen samples of high quality, especially images, has improved over the years. In the field of Natural Language Generation (NLG), the use of the adversarial setting to generate meaningful sentences has shown to be difficult for two reasons: the lack of existing architectures to produce… ▽ More Recently, generative adversarial networks (GAN) have gathered a lot of interest. Their efficiency in generating unseen samples of high quality, especially images, has improved over the years. In the field of Natural Language Generation (NLG), the use of the adversarial setting to generate meaningful sentences has shown to be difficult for two reasons: the lack of existing architectures to produce realistic sentences and the lack of evaluation tools. In this paper, we propose an adversarial architecture related to the conditional GAN (cGAN) that generates sentences according to a given image (also called image captioning). This attempt is the first that uses no pre-training or reinforcement methods. We also explain why our experiment settings can be safely evaluated and interpreted for further works. △ Less

Submitted 31 October, 2019; originally announced October 2019.

Comments: Accepted to NeurIPS 2019 ViGiL workshop

arXiv:1910.03343 [pdf, ps, other]

Modulated Self-attention Convolutional Network for VQA

Authors: Jean-Benoit Delbrouck, Antoine Maiorca, Nathan Hubens, Stéphane Dupont

Abstract: As new data-sets for real-world visual reasoning and compositional question answering are emerging, it might be needed to use the visual feature extraction as a end-to-end process during training. This small contribution aims to suggest new ideas to improve the visual processing of traditional convolutional network for visual question answering (VQA). In this paper, we propose to modulate by a lin… ▽ More As new data-sets for real-world visual reasoning and compositional question answering are emerging, it might be needed to use the visual feature extraction as a end-to-end process during training. This small contribution aims to suggest new ideas to improve the visual processing of traditional convolutional network for visual question answering (VQA). In this paper, we propose to modulate by a linguistic input a CNN augmented with self-attention. We show encouraging relative improvements for future research in this direction. △ Less

Submitted 31 October, 2019; v1 submitted 8 October, 2019; originally announced October 2019.

Comments: Accepted at NeurIPS 2019 workshop: ViGIL

arXiv:1910.02766 [pdf, other]

Adversarial reconstruction for Multi-modal Machine Translation

Authors: Jean-Benoit Delbrouck, Stéphane Dupont

Abstract: Even with the growing interest in problems at the intersection of Computer Vision and Natural Language, grounding (i.e. identifying) the components of a structured description in an image still remains a challenging task. This contribution aims to propose a model which learns grounding by reconstructing the visual features for the Multi-modal translation task. Previous works have partially investi… ▽ More Even with the growing interest in problems at the intersection of Computer Vision and Natural Language, grounding (i.e. identifying) the components of a structured description in an image still remains a challenging task. This contribution aims to propose a model which learns grounding by reconstructing the visual features for the Multi-modal translation task. Previous works have partially investigated standard approaches such as regression methods to approximate the reconstruction of a visual input. In this paper, we propose a different and novel approach which learns grounding by adversarial feedback. To do so, we modulate our network following the recent promising adversarial architectures and evaluate how the adversarial response from a visual reconstruction as an auxiliary task helps the model in its learning. We report the highest scores in term of BLEU and METEOR metrics on the different datasets. △ Less

Submitted 7 October, 2019; originally announced October 2019.

arXiv:1811.09178 [pdf, other]

Object-oriented Targets for Visual Navigation using Rich Semantic Representations

Authors: Jean-Benoit Delbrouck, Stéphane Dupont

Abstract: When searching for an object humans navigate through a scene using semantic information and spatial relationships. We look for an object using our knowledge of its attributes and relationships with other objects to infer the probable location. In this paper, we propose to tackle the visual navigation problem using rich semantic representations of the observed scene and object-oriented targets to t… ▽ More When searching for an object humans navigate through a scene using semantic information and spatial relationships. We look for an object using our knowledge of its attributes and relationships with other objects to infer the probable location. In this paper, we propose to tackle the visual navigation problem using rich semantic representations of the observed scene and object-oriented targets to train an agent. We show that both allows the agent to generalize to new targets and unseen scene in a short amount of training time. △ Less

Submitted 17 December, 2018; v1 submitted 22 November, 2018; originally announced November 2018.

Comments: Presented at NIPS workshop (ViGIL)

arXiv:1810.06245 [pdf, other]

Bringing back simplicity and lightliness into neural image captioning

Authors: Jean-Benoit Delbrouck, Stéphane Dupont

Abstract: Neural Image Captioning (NIC) or neural caption generation has attracted a lot of attention over the last few years. Describing an image with a natural language has been an emerging challenge in both fields of computer vision and language processing. Therefore a lot of research has focused on driving this task forward with new creative ideas. So far, the goal has been to maximize scores on automat… ▽ More Neural Image Captioning (NIC) or neural caption generation has attracted a lot of attention over the last few years. Describing an image with a natural language has been an emerging challenge in both fields of computer vision and language processing. Therefore a lot of research has focused on driving this task forward with new creative ideas. So far, the goal has been to maximize scores on automated metric and to do so, one has to come up with a plurality of new modules and techniques. Once these add up, the models become complex and resource-hungry. In this paper, we take a small step backwards in order to study an architecture with interesting trade-off between performance and computational complexity. To do so, we tackle every component of a neural captioning model and propose one or more solution that lightens the model overall. Our ideas are inspired by two related tasks: Multimodal and Monomodal Neural Machine Translation. △ Less

Submitted 15 October, 2018; originally announced October 2018.

arXiv:1810.06233 [pdf, ps, other]

UMONS Submission for WMT18 Multimodal Translation Task

Authors: Jean-Benoit Delbrouck, Stéphane Dupont

Abstract: This paper describes the UMONS solution for the Multimodal Machine Translation Task presented at the third conference on machine translation (WMT18). We explore a novel architecture, called deepGRU, based on recent findings in the related task of Neural Image Captioning (NIC). The models presented in the following sections lead to the best METEOR translation score for both constrained (English, im… ▽ More This paper describes the UMONS solution for the Multimodal Machine Translation Task presented at the third conference on machine translation (WMT18). We explore a novel architecture, called deepGRU, based on recent findings in the related task of Neural Image Captioning (NIC). The models presented in the following sections lead to the best METEOR translation score for both constrained (English, image) -> German and (English, image) -> French sub-tasks. △ Less

Submitted 15 October, 2018; originally announced October 2018.

arXiv:1805.06349 [pdf]

Automatic segmentation of the spinal cord and intramedullary multiple sclerosis lesions with convolutional neural networks

Authors: Charley Gros, Benjamin De Leener, Atef Badji, Josefina Maranzano, Dominique Eden, Sara M. Dupont, Jason Talbott, Ren Zhuoquiong, Yaou Liu, Tobias Granberg, Russell Ouellette, Yasuhiko Tachibana, Masaaki Hori, Kouhei Kamiya, Lydia Chougar, Leszek Stawiarz, Jan Hillert, Elise Bannier, Anne Kerbrat, Gilles Edan, Pierre Labauge, Virginie Callot, Jean Pelletier, Bertrand Audoin, Henitsoa Rasoanandrianina , et al. (27 additional authors not shown)

Abstract: The spinal cord is frequently affected by atrophy and/or lesions in multiple sclerosis (MS) patients. Segmentation of the spinal cord and lesions from MRI data provides measures of damage, which are key criteria for the diagnosis, prognosis, and longitudinal monitoring in MS. Automating this operation eliminates inter-rater variability and increases the efficiency of large-throughput analysis pipe… ▽ More The spinal cord is frequently affected by atrophy and/or lesions in multiple sclerosis (MS) patients. Segmentation of the spinal cord and lesions from MRI data provides measures of damage, which are key criteria for the diagnosis, prognosis, and longitudinal monitoring in MS. Automating this operation eliminates inter-rater variability and increases the efficiency of large-throughput analysis pipelines. Robust and reliable segmentation across multi-site spinal cord data is challenging because of the large variability related to acquisition parameters and image artifacts. The goal of this study was to develop a fully-automatic framework, robust to variability in both image parameters and clinical condition, for segmentation of the spinal cord and intramedullary MS lesions from conventional MRI data. Scans of 1,042 subjects (459 healthy controls, 471 MS patients, and 112 with other spinal pathologies) were included in this multi-site study (n=30). Data spanned three contrasts (T1-, T2-, and T2*-weighted) for a total of 1,943 volumes. The proposed cord and lesion automatic segmentation approach is based on a sequence of two Convolutional Neural Networks (CNNs). To deal with the very small proportion of spinal cord and/or lesion voxels compared to the rest of the volume, a first CNN with 2D dilated convolutions detects the spinal cord centerline, followed by a second CNN with 3D convolutions that segments the spinal cord and/or lesions. When compared against manual segmentation, our CNN-based approach showed a median Dice of 95% vs. 88% for PropSeg, a state-of-the-art spinal cord segmentation method. Regarding lesion segmentation on MS data, our framework provided a Dice of 60%, a relative volume difference of -15%, and a lesion-wise detection sensitivity and precision of 83% and 77%, respectively. The proposed framework is open-source and readily available in the Spinal Cord Toolbox. △ Less

Submitted 11 September, 2018; v1 submitted 16 May, 2018; originally announced May 2018.

Comments: 38 pages, 7 figures, 2 tables

arXiv:1801.06349 [pdf]

Proceedings of eNTERFACE 2015 Workshop on Intelligent Interfaces

Authors: Matei Mancas, Christian Frisson, Joëlle Tilmanne, Nicolas d'Alessandro, Petr Barborka, Furkan Bayansar, Francisco Bernard, Rebecca Fiebrink, Alexis Heloir, Edgar Hemery, Sohaib Laraba, Alexis Moinet, Fabrizio Nunnari, Thierry Ravet, Loïc Reboursière, Alvaro Sarasua, Mickaël Tits, Noé Tits, François Zajéga, Paolo Alborno, Ksenia Kolykhalova, Emma Frid, Damiano Malafronte, Lisanne Huis in't Veld, Hüseyin Cakmak , et al. (49 additional authors not shown)

Abstract: The 11th Summer Workshop on Multimodal Interfaces eNTERFACE 2015 was hosted by the Numediart Institute of Creative Technologies of the University of Mons from August 10th to September 2015. During the four weeks, students and researchers from all over the world came together in the Numediart Institute of the University of Mons to work on eight selected projects structured around intelligent interf… ▽ More The 11th Summer Workshop on Multimodal Interfaces eNTERFACE 2015 was hosted by the Numediart Institute of Creative Technologies of the University of Mons from August 10th to September 2015. During the four weeks, students and researchers from all over the world came together in the Numediart Institute of the University of Mons to work on eight selected projects structured around intelligent interfaces. Eight projects were selected and their reports are shown here. △ Less

Submitted 19 January, 2018; originally announced January 2018.

Comments: 159 pages

arXiv:1712.03449 [pdf, other]

Modulating and attending the source image during encoding improves Multimodal Translation

Authors: Jean-Benoit Delbrouck, Stéphane Dupont

Abstract: We propose a new and fully end-to-end approach for multimodal translation where the source text encoder modulates the entire visual input processing using conditional batch normalization, in order to compute the most informative image features for our task. Additionally, we propose a new attention mechanism derived from this original idea, where the attention model for the visual input is conditio… ▽ More We propose a new and fully end-to-end approach for multimodal translation where the source text encoder modulates the entire visual input processing using conditional batch normalization, in order to compute the most informative image features for our task. Additionally, we propose a new attention mechanism derived from this original idea, where the attention model for the visual input is conditioned on the source text encoder representations. In the paper, we detail our models as well as the image analysis pipeline. Finally, we report experimental results. They are, as far as we know, the new state of the art on three different test sets. △ Less

Submitted 9 December, 2017; originally announced December 2017.

Comments: Accepted at NIPS Workshop

Journal ref: Visually-Grounded Interaction and Language, NIPS 2017 Workshop

arXiv:1707.01009 [pdf, other]

doi 10.21437/GLU.2017-13

Visually Grounded Word Embeddings and Richer Visual Features for Improving Multimodal Neural Machine Translation

Authors: Jean-Benoit Delbrouck, Stéphane Dupont, Omar Seddati

Abstract: In Multimodal Neural Machine Translation (MNMT), a neural model generates a translated sentence that describes an image, given the image itself and one source descriptions in English. This is considered as the multimodal image caption translation task. The images are processed with Convolutional Neural Network (CNN) to extract visual features exploitable by the translation model. So far, the CNNs… ▽ More In Multimodal Neural Machine Translation (MNMT), a neural model generates a translated sentence that describes an image, given the image itself and one source descriptions in English. This is considered as the multimodal image caption translation task. The images are processed with Convolutional Neural Network (CNN) to extract visual features exploitable by the translation model. So far, the CNNs used are pre-trained on object detection and localization task. We hypothesize that richer architecture, such as dense captioning models, may be more suitable for MNMT and could lead to improved translations. We extend this intuition to the word-embeddings, where we compute both linguistic and visual representation for our corpus vocabulary. We combine and compare different confi △ Less

Submitted 16 December, 2017; v1 submitted 4 July, 2017; originally announced July 2017.

Comments: Accepted to GLU 2017. arXiv admin note: text overlap with arXiv:1707.00995

Journal ref: Proc. GLU 2017 International Workshop on Grounding Language Understanding

arXiv:1707.00995 [pdf, other]

doi 10.18653/v1/D17-1095

An empirical study on the effectiveness of images in Multimodal Neural Machine Translation

Authors: Jean-Benoit Delbrouck, Stéphane Dupont

Abstract: In state-of-the-art Neural Machine Translation (NMT), an attention mechanism is used during decoding to enhance the translation. At every step, the decoder uses this mechanism to focus on different parts of the source sentence to gather the most useful information before outputting its target word. Recently, the effectiveness of the attention mechanism has also been explored for multimodal tasks,… ▽ More In state-of-the-art Neural Machine Translation (NMT), an attention mechanism is used during decoding to enhance the translation. At every step, the decoder uses this mechanism to focus on different parts of the source sentence to gather the most useful information before outputting its target word. Recently, the effectiveness of the attention mechanism has also been explored for multimodal tasks, where it becomes possible to focus both on sentence parts and image regions that they describe. In this paper, we compare several attention mechanism on the multimodal translation task (English, image to German) and evaluate the ability of the model to make use of images to improve translation. We surpass state-of-the-art scores on the Multi30k data set, we nevertheless identify and report different misbehavior of the machine while translating. △ Less

Submitted 4 July, 2017; originally announced July 2017.

Comments: Accepted to EMNLP 2017

Journal ref: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

arXiv:1703.08084 [pdf, other]

Multimodal Compact Bilinear Pooling for Multimodal Neural Machine Translation

Authors: Jean-Benoit Delbrouck, Stephane Dupont

Abstract: In state-of-the-art Neural Machine Translation, an attention mechanism is used during decoding to enhance the translation. At every step, the decoder uses this mechanism to focus on different parts of the source sentence to gather the most useful information before outputting its target word. Recently, the effectiveness of the attention mechanism has also been explored for multimodal tasks, where… ▽ More In state-of-the-art Neural Machine Translation, an attention mechanism is used during decoding to enhance the translation. At every step, the decoder uses this mechanism to focus on different parts of the source sentence to gather the most useful information before outputting its target word. Recently, the effectiveness of the attention mechanism has also been explored for multimodal tasks, where it becomes possible to focus both on sentence parts and image regions. Approaches to pool two modalities usually include element-wise product, sum or concatenation. In this paper, we evaluate the more advanced Multimodal Compact Bilinear pooling method, which takes the outer product of two vectors to combine the attention features for the two modalities. This has been previously investigated for visual question answering. We try out this approach for multimodal image caption translation and show improvements compared to basic combination methods. △ Less

Submitted 23 March, 2017; originally announced March 2017.

Comments: Submitted to ICLR Workshop 2017

Showing 1–23 of 23 results for author: Dupont, S