Search | arXiv e-print repository

How far can we go with ImageNet for Text-to-Image generation?

Authors: L. Degeorge, A. Ghosh, N. Dufour, D. Picard, V. Kalogeiton

Abstract: Recent text-to-image generation models have achieved remarkable results by training on billion-scale datasets, following a `bigger is better' paradigm that prioritizes data quantity over availability (closed vs open source) and reproducibility (data decay vs established collections). We challenge this established paradigm by demonstrating that one can match or outperform models trained on massive… ▽ More Recent text-to-image generation models have achieved remarkable results by training on billion-scale datasets, following a `bigger is better' paradigm that prioritizes data quantity over availability (closed vs open source) and reproducibility (data decay vs established collections). We challenge this established paradigm by demonstrating that one can match or outperform models trained on massive web-scraped collections, using only ImageNet enhanced with well-designed text and image augmentations. With this much simpler setup, we achieve a +1% overall score over SD-XL on GenEval and +0.5% on DPGBench while using just 1/10th the parameters and 1/1000th the training images. This opens the way for more reproducible research as ImageNet is a widely available dataset and our standardized training setup does not require massive compute resources. △ Less

Submitted 21 May, 2025; v1 submitted 28 February, 2025; originally announced February 2025.

arXiv:2412.06781 [pdf, other]

Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation

Authors: Nicolas Dufour, David Picard, Vicky Kalogeiton, Loic Landrieu

Abstract: Global visual geolocation predicts where an image was captured on Earth. Since images vary in how precisely they can be localized, this task inherently involves a significant degree of ambiguity. However, existing approaches are deterministic and overlook this aspect. In this paper, we aim to close the gap between traditional geolocalization and modern generative methods. We propose the first gene… ▽ More Global visual geolocation predicts where an image was captured on Earth. Since images vary in how precisely they can be localized, this task inherently involves a significant degree of ambiguity. However, existing approaches are deterministic and overlook this aspect. In this paper, we aim to close the gap between traditional geolocalization and modern generative methods. We propose the first generative geolocation approach based on diffusion and Riemannian flow matching, where the denoising process operates directly on the Earth's surface. Our model achieves state-of-the-art performance on three visual geolocation benchmarks: OpenStreetView-5M, YFCC-100M, and iNat21. In addition, we introduce the task of probabilistic visual geolocation, where the model predicts a probability distribution over all possible locations instead of a single point. We introduce new metrics and baselines for this task, demonstrating the advantages of our diffusion-based approach. Codes and models will be made available. △ Less

Submitted 9 December, 2024; originally announced December 2024.

Comments: Project page: https://nicolas-dufour.github.io/plonk

arXiv:2411.12663 [pdf, other]

PoM: Efficient Image and Video Generation with the Polynomial Mixer

Authors: David Picard, Nicolas Dufour

Abstract: Diffusion models based on Multi-Head Attention (MHA) have become ubiquitous to generate high quality images and videos. However, encoding an image or a video as a sequence of patches results in costly attention patterns, as the requirements both in terms of memory and compute grow quadratically. To alleviate this problem, we propose a drop-in replacement for MHA called the Polynomial Mixer (PoM) t… ▽ More Diffusion models based on Multi-Head Attention (MHA) have become ubiquitous to generate high quality images and videos. However, encoding an image or a video as a sequence of patches results in costly attention patterns, as the requirements both in terms of memory and compute grow quadratically. To alleviate this problem, we propose a drop-in replacement for MHA called the Polynomial Mixer (PoM) that has the benefit of encoding the entire sequence into an explicit state. PoM has a linear complexity with respect to the number of tokens. This explicit state also allows us to generate frames in a sequential fashion, minimizing memory and compute requirement, while still being able to train in parallel. We show the Polynomial Mixer is a universal sequence-to-sequence approximator, just like regular MHA. We adapt several Diffusion Transformers (DiT) for generating images and videos with PoM replacing MHA, and we obtain high quality samples while using less computational resources. The code is available at https://github.com/davidpicard/HoMM. △ Less

Submitted 19 November, 2024; originally announced November 2024.

arXiv:2407.10220 [pdf, other]

PAFUSE: Part-based Diffusion for 3D Whole-Body Pose Estimation

Authors: Nermin Samet, Cédric Rommel, David Picard, Eduardo Valle

Abstract: We introduce a novel approach for 3D whole-body pose estimation, addressing the challenge of scale -- and deformability -- variance across body parts brought by the challenge of extending the 17 major joints on the human body to fine-grained keypoints on the face and hands. In addition to addressing the challenge of exploiting motion in unevenly sampled data, we combine stable diffusion to a hiera… ▽ More We introduce a novel approach for 3D whole-body pose estimation, addressing the challenge of scale -- and deformability -- variance across body parts brought by the challenge of extending the 17 major joints on the human body to fine-grained keypoints on the face and hands. In addition to addressing the challenge of exploiting motion in unevenly sampled data, we combine stable diffusion to a hierarchical part representation which predicts the relative locations of fine-grained keypoints within each part (e.g., face) with respect to the part's local reference frame. On the H3WB dataset, our method greatly outperforms the current state of the art, which fails to exploit the temporal information. We also show considerable improvements compared to other spatiotemporal 3D human-pose estimation approaches that fail to account for the body part specificities. Code is available at https://github.com/valeoai/PAFUSE. △ Less

Submitted 3 January, 2025; v1 submitted 14 July, 2024; originally announced July 2024.

Comments: Accepted to ECCV 2024 Workshop T-CAP (Towards a Complete Analysis of People)

arXiv:2405.20324 [pdf, other]

Don't drop your samples! Coherence-aware training benefits Conditional diffusion

Authors: Nicolas Dufour, Victor Besnier, Vicky Kalogeiton, David Picard

Abstract: Conditional diffusion models are powerful generative models that can leverage various types of conditional information, such as class labels, segmentation masks, or text captions. However, in many real-world scenarios, conditional information may be noisy or unreliable due to human annotation errors or weak alignment. In this paper, we propose the Coherence-Aware Diffusion (CAD), a novel method th… ▽ More Conditional diffusion models are powerful generative models that can leverage various types of conditional information, such as class labels, segmentation masks, or text captions. However, in many real-world scenarios, conditional information may be noisy or unreliable due to human annotation errors or weak alignment. In this paper, we propose the Coherence-Aware Diffusion (CAD), a novel method that integrates coherence in conditional information into diffusion models, allowing them to learn from noisy annotations without discarding data. We assume that each data point has an associated coherence score that reflects the quality of the conditional information. We then condition the diffusion model on both the conditional information and the coherence score. In this way, the model learns to ignore or discount the conditioning when the coherence is low. We show that CAD is theoretically sound and empirically effective on various conditional generation tasks. Moreover, we show that leveraging coherence generates realistic and diverse samples that respect conditional information better than models trained on cleaned datasets where samples with low coherence have been discarded. △ Less

Submitted 18 February, 2025; v1 submitted 30 May, 2024; originally announced May 2024.

Comments: Accepted at CVPR 2024 as a Highlight. Project page: https://nicolas-dufour.github.io/cad.html

arXiv:2404.13040 [pdf, other]

Analysis of Classifier-Free Guidance Weight Schedulers

Authors: Xi Wang, Nicolas Dufour, Nefeli Andreou, Marie-Paule Cani, Victoria Fernandez Abrevaya, David Picard, Vicky Kalogeiton

Abstract: Classifier-Free Guidance (CFG) enhances the quality and condition adherence of text-to-image diffusion models. It operates by combining the conditional and unconditional predictions using a fixed weight. However, recent works vary the weights throughout the diffusion process, reporting superior results but without providing any rationale or analysis. By conducting comprehensive experiments, this p… ▽ More Classifier-Free Guidance (CFG) enhances the quality and condition adherence of text-to-image diffusion models. It operates by combining the conditional and unconditional predictions using a fixed weight. However, recent works vary the weights throughout the diffusion process, reporting superior results but without providing any rationale or analysis. By conducting comprehensive experiments, this paper provides insights into CFG weight schedulers. Our findings suggest that simple, monotonically increasing weight schedulers consistently lead to improved performances, requiring merely a single line of code. In addition, more complex parametrized schedulers can be optimized for further improvement, but do not generalize across different models and tasks. △ Less

Submitted 4 December, 2024; v1 submitted 19 April, 2024; originally announced April 2024.

Journal ref: Transactions on Machine Learning Research, 2835-8856, 2024

arXiv:2401.09629 [pdf, other]

Multiple Locally Linear Kernel Machines

Authors: David Picard

Abstract: In this paper we propose a new non-linear classifier based on a combination of locally linear classifiers. A well known optimization formulation is given as we cast the problem in a $\ell_1$ Multiple Kernel Learning (MKL) problem using many locally linear kernels. Since the number of such kernels is huge, we provide a scalable generic MKL training algorithm handling streaming kernels. With respect… ▽ More In this paper we propose a new non-linear classifier based on a combination of locally linear classifiers. A well known optimization formulation is given as we cast the problem in a $\ell_1$ Multiple Kernel Learning (MKL) problem using many locally linear kernels. Since the number of such kernels is huge, we provide a scalable generic MKL training algorithm handling streaming kernels. With respect to the inference time, the resulting classifier fits the gap between high accuracy but slow non-linear classifiers (such as classical MKL) and fast but low accuracy linear classifiers. △ Less

Submitted 17 January, 2024; originally announced January 2024.

Comments: This paper was written in 2014 and was originally submitted but rejected at ICML'15

arXiv:2310.11265 [pdf, other]

Image Compression using only Attention based Neural Networks

Authors: Natacha Luka, Romain Negrel, David Picard

Abstract: In recent research, Learned Image Compression has gained prominence for its capacity to outperform traditional handcrafted pipelines, especially at low bit-rates. While existing methods incorporate convolutional priors with occasional attention blocks to address long-range dependencies, recent advances in computer vision advocate for a transformative shift towards fully transformer-based architect… ▽ More In recent research, Learned Image Compression has gained prominence for its capacity to outperform traditional handcrafted pipelines, especially at low bit-rates. While existing methods incorporate convolutional priors with occasional attention blocks to address long-range dependencies, recent advances in computer vision advocate for a transformative shift towards fully transformer-based architectures grounded in the attention mechanism. This paper investigates the feasibility of image compression exclusively using attention layers within our novel model, QPressFormer. We introduce the concept of learned image queries to aggregate patch information via cross-attention, followed by quantization and coding techniques. Through extensive evaluations, our work demonstrates competitive performance achieved by convolution-free architectures across the popular Kodak, DIV2K, and CLIC datasets. △ Less

Submitted 17 October, 2023; originally announced October 2023.

arXiv:2308.11677 [pdf, other]

An Analysis of Initial Training Strategies for Exemplar-Free Class-Incremental Learning

Authors: Grégoire Petit, Michael Soumm, Eva Feillet, Adrian Popescu, Bertrand Delezoide, David Picard, Céline Hudelot

Abstract: Class-Incremental Learning (CIL) aims to build classification models from data streams. At each step of the CIL process, new classes must be integrated into the model. Due to catastrophic forgetting, CIL is particularly challenging when examples from past classes cannot be stored, the case on which we focus here. To date, most approaches are based exclusively on the target dataset of the CIL proce… ▽ More Class-Incremental Learning (CIL) aims to build classification models from data streams. At each step of the CIL process, new classes must be integrated into the model. Due to catastrophic forgetting, CIL is particularly challenging when examples from past classes cannot be stored, the case on which we focus here. To date, most approaches are based exclusively on the target dataset of the CIL process. However, the use of models pre-trained in a self-supervised way on large amounts of data has recently gained momentum. The initial model of the CIL process may only use the first batch of the target dataset, or also use pre-trained weights obtained on an auxiliary dataset. The choice between these two initial learning strategies can significantly influence the performance of the incremental learning model, but has not yet been studied in depth. Performance is also influenced by the choice of the CIL algorithm, the neural architecture, the nature of the target task, the distribution of classes in the stream and the number of examples available for learning. We conduct a comprehensive experimental study to assess the roles of these factors. We present a statistical analysis framework that quantifies the relative contribution of each factor to incremental performance. Our main finding is that the initial training strategy is the dominant factor influencing the average incremental accuracy, but that the choice of CIL algorithm is more important in preventing forgetting. Based on this analysis, we propose practical recommendations for choosing the right initial training strategy for a given incremental learning use case. These recommendations are intended to facilitate the practical deployment of incremental learning. △ Less

Submitted 27 September, 2023; v1 submitted 22 August, 2023; originally announced August 2023.

arXiv:2306.02928 [pdf, other]

LRVS-Fashion: Extending Visual Search with Referring Instructions

Authors: Simon Lepage, Jérémie Mary, David Picard

Abstract: This paper introduces a new challenge for image similarity search in the context of fashion, addressing the inherent ambiguity in this domain stemming from complex images. We present Referred Visual Search (RVS), a task allowing users to define more precisely the desired similarity, following recent interest in the industry. We release a new large public dataset, LRVS-Fashion, consisting of 272k f… ▽ More This paper introduces a new challenge for image similarity search in the context of fashion, addressing the inherent ambiguity in this domain stemming from complex images. We present Referred Visual Search (RVS), a task allowing users to define more precisely the desired similarity, following recent interest in the industry. We release a new large public dataset, LRVS-Fashion, consisting of 272k fashion products with 842k images extracted from fashion catalogs, designed explicitly for this task. However, unlike traditional visual search methods in the industry, we demonstrate that superior performance can be achieved by bypassing explicit object detection and adopting weakly-supervised conditional contrastive learning on image tuples. Our method is lightweight and demonstrates robustness, reaching Recall at one superior to strong detection-based baselines against 2M distractors. The dataset is available at https://huggingface.co/datasets/Slep/LAION-RVS-Fashion . △ Less

Submitted 15 May, 2024; v1 submitted 5 June, 2023; originally announced June 2023.

Comments: 29 pages, 14 figures, 5 tables

MSC Class: 68T07 (Primary) 68T45 (Secondary) ACM Class: I.2.10

arXiv:2302.00384 [pdf, other]

Alphazzle: Jigsaw Puzzle Solver with Deep Monte-Carlo Tree Search

Authors: Marie-Morgane Paumard, Hedi Tabia, David Picard

Abstract: Solving jigsaw puzzles requires to grasp the visual features of a sequence of patches and to explore efficiently a solution space that grows exponentially with the sequence length. Therefore, visual deep reinforcement learning (DRL) should answer this problem more efficiently than optimization solvers coupled with neural networks. Based on this assumption, we introduce Alphazzle, a reassembly algo… ▽ More Solving jigsaw puzzles requires to grasp the visual features of a sequence of patches and to explore efficiently a solution space that grows exponentially with the sequence length. Therefore, visual deep reinforcement learning (DRL) should answer this problem more efficiently than optimization solvers coupled with neural networks. Based on this assumption, we introduce Alphazzle, a reassembly algorithm based on single-player Monte Carlo Tree Search (MCTS). A major difference with DRL algorithms lies in the unavailability of game reward for MCTS, and we show how to estimate it from the visual input with neural networks. This constraint is induced by the puzzle-solving task and dramatically adds to the task complexity (and interest!). We perform an in-deep ablation study that shows the importance of MCTS and the neural networks working together. We achieve excellent results and get exciting insights into the combination of DRL and visual feature learning. △ Less

Submitted 1 February, 2023; originally announced February 2023.

arXiv:2212.10292 [pdf, other]

Towards Unsupervised Visual Reasoning: Do Off-The-Shelf Features Know How to Reason?

Authors: Monika Wysoczańska, Tom Monnier, Tomasz Trzciński, David Picard

Abstract: Recent advances in visual representation learning allowed to build an abundance of powerful off-the-shelf features that are ready-to-use for numerous downstream tasks. This work aims to assess how well these features preserve information about the objects, such as their spatial location, their visual properties and their relative relationships. We propose to do so by evaluating them in the context… ▽ More Recent advances in visual representation learning allowed to build an abundance of powerful off-the-shelf features that are ready-to-use for numerous downstream tasks. This work aims to assess how well these features preserve information about the objects, such as their spatial location, their visual properties and their relative relationships. We propose to do so by evaluating them in the context of visual reasoning, where multiple objects with complex relationships and different attributes are at play. More specifically, we introduce a protocol to evaluate visual representations for the task of Visual Question Answering. In order to decouple visual feature extraction from reasoning, we design a specific attention-based reasoning module which is trained on the frozen visual representations to be evaluated, in a spirit similar to standard feature evaluations relying on shallow networks. We compare two types of visual representations, densely extracted local features and object-centric ones, against the performances of a perfect image representation using ground truth. Our main findings are two-fold. First, despite excellent performances on classical proxy tasks, such representations fall short for solving complex reasoning problem. Second, object-centric features better preserve the critical information necessary to perform visual reasoning. In our proposed framework we show how to methodologically approach this evaluation. △ Less

Submitted 20 December, 2022; originally announced December 2022.

arXiv:2211.15692 [pdf, other]

H3WB: Human3.6M 3D WholeBody Dataset and Benchmark

Authors: Yue Zhu, Nermin Samet, David Picard

Abstract: We present a benchmark for 3D human whole-body pose estimation, which involves identifying accurate 3D keypoints on the entire human body, including face, hands, body, and feet. Currently, the lack of a fully annotated and accurate 3D whole-body dataset results in deep networks being trained separately on specific body parts, which are combined during inference. Or they rely on pseudo-groundtruth… ▽ More We present a benchmark for 3D human whole-body pose estimation, which involves identifying accurate 3D keypoints on the entire human body, including face, hands, body, and feet. Currently, the lack of a fully annotated and accurate 3D whole-body dataset results in deep networks being trained separately on specific body parts, which are combined during inference. Or they rely on pseudo-groundtruth provided by parametric body models which are not as accurate as detection based methods. To overcome these issues, we introduce the Human3.6M 3D WholeBody (H3WB) dataset, which provides whole-body annotations for the Human3.6M dataset using the COCO Wholebody layout. H3WB comprises 133 whole-body keypoint annotations on 100K images, made possible by our new multi-view pipeline. We also propose three tasks: i) 3D whole-body pose lifting from 2D complete whole-body pose, ii) 3D whole-body pose lifting from 2D incomplete whole-body pose, and iii) 3D whole-body pose estimation from a single RGB image. Additionally, we report several baselines from popular methods for these tasks. Furthermore, we also provide automated 3D whole-body annotations of TotalCapture and experimentally show that when used with H3WB it helps to improve the performance. Code and dataset is available at https://github.com/wholebody3d/wholebody3d △ Less

Submitted 6 September, 2023; v1 submitted 28 November, 2022; originally announced November 2022.

Comments: Accepted by ICCV 2023

arXiv:2211.13131 [pdf, other]

FeTrIL: Feature Translation for Exemplar-Free Class-Incremental Learning

Authors: Grégoire Petit, Adrian Popescu, Hugo Schindler, David Picard, Bertrand Delezoide

Abstract: Exemplar-free class-incremental learning is very challenging due to the negative effect of catastrophic forgetting. A balance between stability and plasticity of the incremental process is needed in order to obtain good accuracy for past as well as new classes. Existing exemplar-free class-incremental methods focus either on successive fine tuning of the model, thus favoring plasticity, or on usin… ▽ More Exemplar-free class-incremental learning is very challenging due to the negative effect of catastrophic forgetting. A balance between stability and plasticity of the incremental process is needed in order to obtain good accuracy for past as well as new classes. Existing exemplar-free class-incremental methods focus either on successive fine tuning of the model, thus favoring plasticity, or on using a feature extractor fixed after the initial incremental state, thus favoring stability. We introduce a method which combines a fixed feature extractor and a pseudo-features generator to improve the stability-plasticity balance. The generator uses a simple yet effective geometric translation of new class features to create representations of past classes, made of pseudo-features. The translation of features only requires the storage of the centroid representations of past classes to produce their pseudo-features. Actual features of new classes and pseudo-features of past classes are fed into a linear classifier which is trained incrementally to discriminate between all classes. The incremental process is much faster with the proposed method compared to mainstream ones which update the entire deep model. Experiments are performed with three challenging datasets, and different incremental settings. A comparison with ten existing methods shows that our method outperforms the others in most cases. △ Less

Submitted 28 November, 2023; v1 submitted 23 November, 2022; originally announced November 2022.

arXiv:2210.04883 [pdf, other]

SCAM! Transferring humans between images with Semantic Cross Attention Modulation

Authors: Nicolas Dufour, David Picard, Vicky Kalogeiton

Abstract: A large body of recent work targets semantically conditioned image generation. Most such methods focus on the narrower task of pose transfer and ignore the more challenging task of subject transfer that consists in not only transferring the pose but also the appearance and background. In this work, we introduce SCAM (Semantic Cross Attention Modulation), a system that encodes rich and diverse info… ▽ More A large body of recent work targets semantically conditioned image generation. Most such methods focus on the narrower task of pose transfer and ignore the more challenging task of subject transfer that consists in not only transferring the pose but also the appearance and background. In this work, we introduce SCAM (Semantic Cross Attention Modulation), a system that encodes rich and diverse information in each semantic region of the image (including foreground and background), thus achieving precise generation with emphasis on fine details. This is enabled by the Semantic Attention Transformer Encoder that extracts multiple latent vectors for each semantic region, and the corresponding generator that exploits these multiple latents by using semantic cross attention modulation. It is trained only using a reconstruction setup, while subject transfer is performed at test time. Our analysis shows that our proposed architecture is successful at encoding the diversity of appearance in each semantic region. Extensive experiments on the iDesigner and CelebAMask-HD datasets show that SCAM outperforms SEAN and SPADE; moreover, it sets the new state of the art on subject transfer. △ Less

Submitted 10 October, 2022; originally announced October 2022.

Comments: Accepted at ECCV 2022

arXiv:2210.02231 [pdf, other]

Decanus to Legatus: Synthetic training for 2D-3D human pose lifting

Authors: Yue Zhu, David Picard

Abstract: 3D human pose estimation is a challenging task because of the difficulty to acquire ground-truth data outside of controlled environments. A number of further issues have been hindering progress in building a universal and robust model for this task, including domain gaps between different datasets, unseen actions between train and test datasets, various hardware settings and high cost of annotatio… ▽ More 3D human pose estimation is a challenging task because of the difficulty to acquire ground-truth data outside of controlled environments. A number of further issues have been hindering progress in building a universal and robust model for this task, including domain gaps between different datasets, unseen actions between train and test datasets, various hardware settings and high cost of annotation, etc. In this paper, we propose an algorithm to generate infinite 3D synthetic human poses (Legatus) from a 3D pose distribution based on 10 initial handcrafted 3D poses (Decanus) during the training of a 2D to 3D human pose lifter neural network. Our results show that we can achieve 3D pose estimation performance comparable to methods using real data from specialized datasets but in a zero-shot setup, showing the generalization potential of our framework. △ Less

Submitted 5 October, 2022; originally announced October 2022.

Comments: Accepted by ACCV 2022

arXiv:2209.06606 [pdf, other]

PlaStIL: Plastic and Stable Memory-Free Class-Incremental Learning

Authors: Grégoire Petit, Adrian Popescu, Eden Belouadah, David Picard, Bertrand Delezoide

Abstract: Plasticity and stability are needed in class-incremental learning in order to learn from new data while preserving past knowledge. Due to catastrophic forgetting, finding a compromise between these two properties is particularly challenging when no memory buffer is available. Mainstream methods need to store two deep models since they integrate new classes using fine-tuning with knowledge distilla… ▽ More Plasticity and stability are needed in class-incremental learning in order to learn from new data while preserving past knowledge. Due to catastrophic forgetting, finding a compromise between these two properties is particularly challenging when no memory buffer is available. Mainstream methods need to store two deep models since they integrate new classes using fine-tuning with knowledge distillation from the previous incremental state. We propose a method which has similar number of parameters but distributes them differently in order to find a better balance between plasticity and stability. Following an approach already deployed by transfer-based incremental methods, we freeze the feature extractor after the initial state. Classes in the oldest incremental states are trained with this frozen extractor to ensure stability. Recent classes are predicted using partially fine-tuned models in order to introduce plasticity. Our proposed plasticity layer can be incorporated to any transfer-based method designed for exemplar-free incremental learning, and we apply it to two such methods. Evaluation is done with three large-scale datasets. Results show that performance gains are obtained in all tested configurations compared to existing methods. △ Less

Submitted 4 July, 2023; v1 submitted 14 September, 2022; originally announced September 2022.

arXiv:2207.10541 [pdf, other]

Unveiling the Latent Space Geometry of Push-Forward Generative Models

Authors: Thibaut Issenhuth, Ugo Tanielian, Jérémie Mary, David Picard

Abstract: Many deep generative models are defined as a push-forward of a Gaussian measure by a continuous generator, such as Generative Adversarial Networks (GANs) or Variational Auto-Encoders (VAEs). This work explores the latent space of such deep generative models. A key issue with these models is their tendency to output samples outside of the support of the target distribution when learning disconnecte… ▽ More Many deep generative models are defined as a push-forward of a Gaussian measure by a continuous generator, such as Generative Adversarial Networks (GANs) or Variational Auto-Encoders (VAEs). This work explores the latent space of such deep generative models. A key issue with these models is their tendency to output samples outside of the support of the target distribution when learning disconnected distributions. We investigate the relationship between the performance of these models and the geometry of their latent space. Building on recent developments in geometric measure theory, we prove a sufficient condition for optimality in the case where the dimension of the latent space is larger than the number of modes. Through experiments on GANs, we demonstrate the validity of our theoretical results and gain new insights into the latent space geometry of these models. Additionally, we propose a truncation method that enforces a simplicial cluster structure in the latent space and improves the performance of GANs. △ Less

Submitted 15 May, 2023; v1 submitted 21 July, 2022; originally announced July 2022.

arXiv:2207.08782 [pdf, other]

Instance-Aware Observer Network for Out-of-Distribution Object Segmentation

Authors: Victor Besnier, Andrei Bursuc, David Picard, Alexandre Briot

Abstract: Recent works on predictive uncertainty estimation have shown promising results on Out-Of-Distribution (OOD) detection for semantic segmentation. However, these methods struggle to precisely locate the point of interest in the image, i.e, the anomaly. This limitation is due to the difficulty of finegrained prediction at the pixel level. To address this issue, we build upon the recent ObsNet approac… ▽ More Recent works on predictive uncertainty estimation have shown promising results on Out-Of-Distribution (OOD) detection for semantic segmentation. However, these methods struggle to precisely locate the point of interest in the image, i.e, the anomaly. This limitation is due to the difficulty of finegrained prediction at the pixel level. To address this issue, we build upon the recent ObsNet approach by providing object instance knowledge to the observer. We extend ObsNet by harnessing an instance-wise mask prediction. We use an additional, class agnostic, object detector to filter and aggregate observer predictions. Finally, we predict an unique anomaly score for each instance in the image. We show that our proposed method accurately disentangles in-distribution objects from OOD objects on three datasets. △ Less

Submitted 29 August, 2022; v1 submitted 18 July, 2022; originally announced July 2022.

arXiv:2111.15264 [pdf, other]

EdiBERT, a generative model for image editing

Authors: Thibaut Issenhuth, Ugo Tanielian, Jérémie Mary, David Picard

Abstract: Advances in computer vision are pushing the limits of im-age manipulation, with generative models sampling detailed images on various tasks. However, a specialized model is often developed and trained for each specific task, even though many image edition tasks share similarities. In denoising, inpainting, or image compositing, one always aims at generating a realistic image from a low-quality one… ▽ More Advances in computer vision are pushing the limits of im-age manipulation, with generative models sampling detailed images on various tasks. However, a specialized model is often developed and trained for each specific task, even though many image edition tasks share similarities. In denoising, inpainting, or image compositing, one always aims at generating a realistic image from a low-quality one. In this paper, we aim at making a step towards a unified approach for image editing. To do so, we propose EdiBERT, a bi-directional transformer trained in the discrete latent space built by a vector-quantized auto-encoder. We argue that such a bidirectional model is suited for image manipulation since any patch can be re-sampled conditionally to the whole image. Using this unique and straightforward training objective, we show that the resulting model matches state-of-the-art performances on a wide variety of tasks: image denoising, image completion, and image composition. △ Less

Submitted 21 July, 2022; v1 submitted 30 November, 2021; originally announced November 2021.

arXiv:2111.10248 [pdf, other]

Non asymptotic bounds in asynchronous sum-weight gossip protocols

Authors: David Picard, Jérôme Fellus, Stéphane Garnier

Abstract: This paper focuses on non-asymptotic diffusion time in asynchronous gossip protocols. Asynchronous gossip protocols are designed to perform distributed computation in a network of nodes by randomly exchanging messages on the associated graph. To achieve consensus among nodes, a minimal number of messages has to be exchanged. We provides a probabilistic bound to such number for the general case. We… ▽ More This paper focuses on non-asymptotic diffusion time in asynchronous gossip protocols. Asynchronous gossip protocols are designed to perform distributed computation in a network of nodes by randomly exchanging messages on the associated graph. To achieve consensus among nodes, a minimal number of messages has to be exchanged. We provides a probabilistic bound to such number for the general case. We provide a explicit formula for fully connected graphs depending only on the number of nodes and an approximation for any graph depending on the spectrum of the graph. △ Less

Submitted 19 November, 2021; originally announced November 2021.

Comments: Unpublished work done circa 2016

arXiv:2110.09803 [pdf, other]

Latent reweighting, an almost free improvement for GANs

Authors: Thibaut Issenhuth, Ugo Tanielian, David Picard, Jeremie Mary

Abstract: Standard formulations of GANs, where a continuous function deforms a connected latent space, have been shown to be misspecified when fitting different classes of images. In particular, the generator will necessarily sample some low-quality images in between the classes. Rather than modifying the architecture, a line of works aims at improving the sampling quality from pre-trained generators at the… ▽ More Standard formulations of GANs, where a continuous function deforms a connected latent space, have been shown to be misspecified when fitting different classes of images. In particular, the generator will necessarily sample some low-quality images in between the classes. Rather than modifying the architecture, a line of works aims at improving the sampling quality from pre-trained generators at the expense of increased computational cost. Building on this, we introduce an additional network to predict latent importance weights and two associated sampling methods to avoid the poorest samples. This idea has several advantages: 1) it provides a way to inject disconnectedness into any GAN architecture, 2) since the rejection happens in the latent space, it avoids going through both the generator and the discriminator, saving computation time, 3) this importance weights formulation provides a principled way to reduce the Wasserstein's distance to the target distribution. We demonstrate the effectiveness of our method on several datasets, both synthetic and high-dimensional. △ Less

Submitted 19 October, 2021; originally announced October 2021.

arXiv:2109.08203 [pdf, other]

Torch.manual_seed(3407) is all you need: On the influence of random seeds in deep learning architectures for computer vision

Authors: David Picard

Abstract: In this paper I investigate the effect of random seed selection on the accuracy when using popular deep learning architectures for computer vision. I scan a large amount of seeds (up to $10^4$) on CIFAR 10 and I also scan fewer seeds on Imagenet using pre-trained models to investigate large scale datasets. The conclusions are that even if the variance is not very large, it is surprisingly easy to… ▽ More In this paper I investigate the effect of random seed selection on the accuracy when using popular deep learning architectures for computer vision. I scan a large amount of seeds (up to $10^4$) on CIFAR 10 and I also scan fewer seeds on Imagenet using pre-trained models to investigate large scale datasets. The conclusions are that even if the variance is not very large, it is surprisingly easy to find an outlier that performs much better or much worse than the average. △ Less

Submitted 11 May, 2023; v1 submitted 16 September, 2021; originally announced September 2021.

Comments: fixed typos

arXiv:2108.08109 [pdf, other]

Image Collation: Matching illustrations in manuscripts

Authors: Ryad Kaoua, Xi Shen, Alexandra Durr, Stavros Lazaris, David Picard, Mathieu Aubry

Abstract: Illustrations are an essential transmission instrument. For an historian, the first step in studying their evolution in a corpus of similar manuscripts is to identify which ones correspond to each other. This image collation task is daunting for manuscripts separated by many lost copies, spreading over centuries, which might have been completely re-organized and greatly modified to adapt to novel… ▽ More Illustrations are an essential transmission instrument. For an historian, the first step in studying their evolution in a corpus of similar manuscripts is to identify which ones correspond to each other. This image collation task is daunting for manuscripts separated by many lost copies, spreading over centuries, which might have been completely re-organized and greatly modified to adapt to novel knowledge or belief and include hundreds of illustrations. Our contributions in this paper are threefold. First, we introduce the task of illustration collation and a large annotated public dataset to evaluate solutions, including 6 manuscripts of 2 different texts with more than 2 000 illustrations and 1 200 annotated correspondences. Second, we analyze state of the art similarity measures for this task and show that they succeed in simple cases but struggle for large manuscripts when the illustrations have undergone very significant changes and are discriminated only by fine details. Finally, we show clear evidence that significant performance boosts can be expected by exploiting cycle-consistent correspondences. Our code and data are available on http://imagine.enpc.fr/~shenx/ImageCollation. △ Less

Submitted 18 August, 2021; originally announced August 2021.

Comments: accepted to ICDAR 2021

arXiv:2108.01634 [pdf, other]

Triggering Failures: Out-Of-Distribution detection by learning from local adversarial attacks in Semantic Segmentation

Authors: Victor Besnier, Andrei Bursuc, David Picard, Alexandre Briot

Abstract: In this paper, we tackle the detection of out-of-distribution (OOD) objects in semantic segmentation. By analyzing the literature, we found that current methods are either accurate or fast but not both which limits their usability in real world applications. To get the best of both aspects, we propose to mitigate the common shortcomings by following four design principles: decoupling the OOD detec… ▽ More In this paper, we tackle the detection of out-of-distribution (OOD) objects in semantic segmentation. By analyzing the literature, we found that current methods are either accurate or fast but not both which limits their usability in real world applications. To get the best of both aspects, we propose to mitigate the common shortcomings by following four design principles: decoupling the OOD detection from the segmentation task, observing the entire segmentation network instead of just its output, generating training data for the OOD detector by leveraging blind spots in the segmentation network and focusing the generated data on localized regions in the image to simulate OOD objects. Our main contribution is a new OOD detection architecture called ObsNet associated with a dedicated training scheme based on Local Adversarial Attacks (LAA). We validate the soundness of our approach across numerous ablation studies. We also show it obtains top performances both in speed and accuracy when compared to ten recent methods of the literature on three different datasets. △ Less

Submitted 3 August, 2021; originally announced August 2021.

arXiv:2105.13688 [pdf, other]

Learning Uncertainty For Safety-Oriented Semantic Segmentation In Autonomous Driving

Authors: Victor Besnier, David Picard, Alexandre Briot

Abstract: In this paper, we show how uncertainty estimation can be leveraged to enable safety critical image segmentation in autonomous driving, by triggering a fallback behavior if a target accuracy cannot be guaranteed. We introduce a new uncertainty measure based on disagreeing predictions as measured by a dissimilarity function. We propose to estimate this dissimilarity by training a deep neural archite… ▽ More In this paper, we show how uncertainty estimation can be leveraged to enable safety critical image segmentation in autonomous driving, by triggering a fallback behavior if a target accuracy cannot be guaranteed. We introduce a new uncertainty measure based on disagreeing predictions as measured by a dissimilarity function. We propose to estimate this dissimilarity by training a deep neural architecture in parallel to the task-specific network. It allows this observer to be dedicated to the uncertainty estimation, and let the task-specific network make predictions. We propose to use self-supervision to train the observer, which implies that our method does not require additional training data. We show experimentally that our proposed approach is much less computationally intensive at inference time than competing methods (e.g. MCDropout), while delivering better results on safety-oriented evaluation metrics on the CamVid dataset, especially in the case of glare artifacts. △ Less

Submitted 28 May, 2021; originally announced May 2021.

arXiv:2103.02306 [pdf, ps, other]

Rate Analysis and Deep Neural Network Detectors for SEFDM FTN Systems

Authors: Arsenia Chorti, David Picard

Abstract: In this work we compare the capacity and achievable rate of uncoded faster than Nyquist (FTN) signalling in the frequency domain, also referred to as spectrally efficient FDM (SEFDM). We propose a deep residual convolutional neural network detector for SEFDM signals in additive white Gaussian noise channels, that allows to approach the Mazo limit in systems with up to 60 subcarriers. Notably, the… ▽ More In this work we compare the capacity and achievable rate of uncoded faster than Nyquist (FTN) signalling in the frequency domain, also referred to as spectrally efficient FDM (SEFDM). We propose a deep residual convolutional neural network detector for SEFDM signals in additive white Gaussian noise channels, that allows to approach the Mazo limit in systems with up to 60 subcarriers. Notably, the deep detectors achieve a loss less than 0.4-0.7 dB for uncoded QPSK SEFDM systems of 12 to 60 subcarriers at a 15% spectral compression. △ Less

Submitted 3 March, 2021; originally announced March 2021.

arXiv:2012.07487 [pdf, other]

Clustering high dimensional meteorological scenarios: results and performance index

Authors: Yamila Barrera, Leonardo Boechi, Matthieu Jonckheere, Vincent Lefieux, Dominique Picard, Ezequiel Smucler, Agustin Somacal, Alfredo Umfurer

Abstract: The Reseau de Transport d'Electricité (RTE) is the French main electricity network operational manager and dedicates large number of resources and efforts towards understanding climate time series data. We discuss here the problem and the methodology of grouping and selecting representatives of possible climate scenarios among a large number of climate simulations provided by RTE. The data used is… ▽ More The Reseau de Transport d'Electricité (RTE) is the French main electricity network operational manager and dedicates large number of resources and efforts towards understanding climate time series data. We discuss here the problem and the methodology of grouping and selecting representatives of possible climate scenarios among a large number of climate simulations provided by RTE. The data used is composed of temperature times series for 200 different possible scenarios on a grid of geographical locations in France. These should be clustered in order to detect common patterns regarding temperatures curves and help to choose representative scenarios for network simulations, which in turn can be used for energy optimisation. We first show that the choice of the distance used for the clustering has a strong impact on the meaning of the results: depending on the type of distance used, either spatial or temporal patterns prevail. Then we discuss the difficulty of fine-tuning the distance choice (combined with a dimension reduction procedure) and we propose a methodology based on a carefully designed index. △ Less

Submitted 14 December, 2020; originally announced December 2020.

Comments: 19 pages, 14 figures

arXiv:2009.01998 [pdf, other]

SSP-Net: Scalable Sequential Pyramid Networks for Real-Time 3D Human Pose Regression

Authors: Diogo Luvizon, Hedi Tabia, David Picard

Abstract: In this paper we propose a highly scalable convolutional neural network, end-to-end trainable, for real-time 3D human pose regression from still RGB images. We call this approach the Scalable Sequential Pyramid Networks (SSP-Net) as it is trained with refined supervision at multiple scales in a sequential manner. Our network requires a single training procedure and is capable of producing its best… ▽ More In this paper we propose a highly scalable convolutional neural network, end-to-end trainable, for real-time 3D human pose regression from still RGB images. We call this approach the Scalable Sequential Pyramid Networks (SSP-Net) as it is trained with refined supervision at multiple scales in a sequential manner. Our network requires a single training procedure and is capable of producing its best predictions at 120 frames per second (FPS), or acceptable predictions at more than 200 FPS when cut at test time. We show that the proposed regression approach is invariant to the size of feature maps, allowing our method to perform multi-resolution intermediate supervisions and reaching results comparable to the state-of-the-art with very low resolution feature maps. We demonstrate the accuracy and the effectiveness of our method by providing extensive experiments on two of the most important publicly available datasets for 3D pose estimation, Human3.6M and MPI-INF-3DHP. Additionally, we provide relevant insights about our decisions on the network architecture and show its flexibility to meet the best precision-speed compromise. △ Less

Submitted 3 September, 2020; originally announced September 2020.

Comments: Under review at PR

arXiv:2006.06611 [pdf, other]

Improving Deep Metric Learning with Virtual Classes and Examples Mining

Authors: Pierre Jacob, David Picard, Aymeric Histace, Edouard Klein

Abstract: In deep metric learning, the training procedure relies on sampling informative tuples. However, as the training procedure progresses, it becomes nearly impossible to sample relevant hard negative examples without proper mining strategies or generation-based methods. Recent work on hard negative generation have shown great promises to solve the mining problem. However, this generation process is di… ▽ More In deep metric learning, the training procedure relies on sampling informative tuples. However, as the training procedure progresses, it becomes nearly impossible to sample relevant hard negative examples without proper mining strategies or generation-based methods. Recent work on hard negative generation have shown great promises to solve the mining problem. However, this generation process is difficult to tune and often leads to incorrectly labelled examples. To tackle this issue, we introduce MIRAGE, a generation-based method that relies on virtual classes entirely composed of generated examples that act as buffer areas between the training classes. We empirically show that virtual classes significantly improve the results on popular datasets (Cub-200-2011, Cars-196 and Stanford Online Products) compared to other generation methods. △ Less

Submitted 11 June, 2020; originally announced June 2020.

arXiv:2005.12548 [pdf, other]

doi 10.1109/TIP.2019.2963378

Deepzzle: Solving Visual Jigsaw Puzzles with Deep Learning andShortest Path Optimization

Authors: Marie-Morgane Paumard, David Picard, Hedi Tabia

Abstract: We tackle the image reassembly problem with wide space between the fragments, in such a way that the patterns and colors continuity is mostly unusable. The spacing emulates the erosion of which the archaeological fragments suffer. We crop-square the fragments borders to compel our algorithm to learn from the content of the fragments. We also complicate the image reassembly by removing fragments an… ▽ More We tackle the image reassembly problem with wide space between the fragments, in such a way that the patterns and colors continuity is mostly unusable. The spacing emulates the erosion of which the archaeological fragments suffer. We crop-square the fragments borders to compel our algorithm to learn from the content of the fragments. We also complicate the image reassembly by removing fragments and adding pieces from other sources. We use a two-step method to obtain the reassemblies: 1) a neural network predicts the positions of the fragments despite the gaps between them; 2) a graph that leads to the best reassemblies is made from these predictions. In this paper, we notably investigate the effect of branch-cut in the graph of reassemblies. We also provide a comparison with the literature, solve complex images reassemblies, explore at length the dataset, and propose a new metric that suits its specificities. Keywords: image reassembly, jigsaw puzzle, deep learning, graph, branch-cut, cultural heritage △ Less

Submitted 26 May, 2020; originally announced May 2020.

Journal ref: IEEE Transactions on Image Processing (2020)

arXiv:2004.14644 [pdf, other]

doi 10.1016/j.patrec.2020.03.020

DIABLO: Dictionary-based Attention Block for Deep Metric Learning

Authors: Pierre Jacob, David Picard, Aymeric Histace, Edouard Klein

Abstract: Recent breakthroughs in representation learning of unseen classes and examples have been made in deep metric learning by training at the same time the image representations and a corresponding metric with deep networks. Recent contributions mostly address the training part (loss functions, sampling strategies, etc.), while a few works focus on improving the discriminative power of the image repres… ▽ More Recent breakthroughs in representation learning of unseen classes and examples have been made in deep metric learning by training at the same time the image representations and a corresponding metric with deep networks. Recent contributions mostly address the training part (loss functions, sampling strategies, etc.), while a few works focus on improving the discriminative power of the image representation. In this paper, we propose DIABLO, a dictionary-based attention method for image embedding. DIABLO produces richer representations by aggregating only visually-related features together while being easier to train than other attention-based methods in deep metric learning. This is experimentally confirmed on four deep metric learning datasets (Cub-200-2011, Cars-196, Stanford Online Products, and In-Shop Clothes Retrieval) for which DIABLO shows state-of-the-art performances. △ Less

Submitted 30 April, 2020; originally announced April 2020.

Comments: Pre-print. Accepted for publication at Pattern Recognition Letters

arXiv:2002.02250 [pdf, other]

Uncovering differential equations from data with hidden variables

Authors: Agustín Somacal, Yamila Barrera, Leonardo Boechi, Matthieu Jonckheere, Vincent Lefieux, Dominique Picard, Ezequiel Smucler

Abstract: SINDy is a method for learning system of differential equations from data by solving a sparse linear regression optimization problem [Brunton et al., 2016]. In this article, we propose an extension of the SINDy method that learns systems of differential equations in cases where some of the variables are not observed. Our extension is based on regressing a higher order time derivative of a target v… ▽ More SINDy is a method for learning system of differential equations from data by solving a sparse linear regression optimization problem [Brunton et al., 2016]. In this article, we propose an extension of the SINDy method that learns systems of differential equations in cases where some of the variables are not observed. Our extension is based on regressing a higher order time derivative of a target variable onto a dictionary of functions that includes lower order time derivatives of the target variable. We evaluate our method by measuring the prediction accuracy of the learned dynamical systems on synthetic data and on a real data-set of temperature time series provided by the Réseau de Transport d'Électricité (RTE). Our method provides high quality short-term forecasts and it is orders of magnitude faster than competing methods for learning differential equations with latent variables. △ Less

Submitted 23 December, 2020; v1 submitted 6 February, 2020; originally announced February 2020.

arXiv:1912.08077 [pdf, other]

doi 10.1109/TPAMI.2020.2976014

Multi-task Deep Learning for Real-Time 3D Human Pose Estimation and Action Recognition

Authors: Diogo C Luvizon, Hedi Tabia, David Picard

Abstract: Human pose estimation and action recognition are related tasks since both problems are strongly dependent on the human body representation and analysis. Nonetheless, most recent methods in the literature handle the two problems separately. In this work, we propose a multi-task framework for jointly estimating 2D or 3D human poses from monocular color images and classifying human actions from video… ▽ More Human pose estimation and action recognition are related tasks since both problems are strongly dependent on the human body representation and analysis. Nonetheless, most recent methods in the literature handle the two problems separately. In this work, we propose a multi-task framework for jointly estimating 2D or 3D human poses from monocular color images and classifying human actions from video sequences. We show that a single architecture can be used to solve both problems in an efficient way and still achieves state-of-the-art or comparable results at each task while running at more than 100 frames per second. The proposed method benefits from high parameters sharing between the two tasks by unifying still images and video clips processing in a single pipeline, allowing the model to be trained with data from different categories simultaneously and in a seamlessly way. Additionally, we provide important insights for end-to-end training the proposed multi-task model by decoupling key prediction parts, which consistently leads to better accuracy on both tasks. The reported results on four datasets (MPII, Human3.6M, Penn Action and NTU RGB+D) demonstrate the effectiveness of our method on the targeted tasks. Our source code and trained weights are publicly available at https://github.com/dluvizon/deephar. △ Less

Submitted 3 March, 2020; v1 submitted 14 December, 2019; originally announced December 2019.

Comments: Accepted to TPAMI. arXiv admin note: text overlap with arXiv:1802.09232

arXiv:1911.09245 [pdf, other]

Consensus-based Optimization for 3D Human Pose Estimation in Camera Coordinates

Authors: Diogo C Luvizon, Hedi Tabia, David Picard

Abstract: 3D human pose estimation is frequently seen as the task of estimating 3D poses relative to the root body joint. Alternatively, we propose a 3D human pose estimation method in camera coordinates, which allows effective combination of 2D annotated data and 3D poses and a straightforward multi-view generalization. To that end, we cast the problem as a view frustum space pose estimation, where absolut… ▽ More 3D human pose estimation is frequently seen as the task of estimating 3D poses relative to the root body joint. Alternatively, we propose a 3D human pose estimation method in camera coordinates, which allows effective combination of 2D annotated data and 3D poses and a straightforward multi-view generalization. To that end, we cast the problem as a view frustum space pose estimation, where absolute depth prediction and joint relative depth estimations are disentangled. Final 3D predictions are obtained in camera coordinates by the inverse camera projection. Based on this, we also present a consensus-based optimization algorithm for multi-view predictions from uncalibrated images, which requires a single monocular training procedure. Although our method is indirectly tied to the training camera intrinsics, it still converges for cameras with different intrinsic parameters, resulting in coherent estimations up to a scale factor. Our method improves the state of the art on well known 3D human pose datasets, reducing the prediction error by 32% in the most common benchmark. We also reported our results in absolute pose position error, achieving 80~mm for monocular estimations and 51~mm for multi-view, on average. △ Less

Submitted 20 August, 2021; v1 submitted 20 November, 2019; originally announced November 2019.

Comments: Source code is available at https://github.com/dluvizon/3d-pose-consensus

arXiv:1908.02735 [pdf, other]

Metric Learning With HORDE: High-Order Regularizer for Deep Embeddings

Authors: Pierre Jacob, David Picard, Aymeric Histace, Edouard Klein

Abstract: Learning an effective similarity measure between image representations is key to the success of recent advances in visual search tasks (e.g. verification or zero-shot learning). Although the metric learning part is well addressed, this metric is usually computed over the average of the extracted deep features. This representation is then trained to be discriminative. However, these deep features t… ▽ More Learning an effective similarity measure between image representations is key to the success of recent advances in visual search tasks (e.g. verification or zero-shot learning). Although the metric learning part is well addressed, this metric is usually computed over the average of the extracted deep features. This representation is then trained to be discriminative. However, these deep features tend to be scattered across the feature space. Consequently, the representations are not robust to outliers, object occlusions, background variations, etc. In this paper, we tackle this scattering problem with a distribution-aware regularization named HORDE. This regularizer enforces visually-close images to have deep features with the same distribution which are well localized in the feature space. We provide a theoretical analysis supporting this regularization effect. We also show the effectiveness of our approach by obtaining state-of-the-art results on 4 well-known datasets (Cub-200-2011, Cars-196, Stanford Online Products and Inshop Clothes Retrieval). △ Less

Submitted 7 August, 2019; originally announced August 2019.

Comments: Camera-ready for our ICCV 2019 paper (poster)

arXiv:1906.01972 [pdf, ps, other]

Efficient Codebook and Factorization for Second Order Representation Learning

Authors: Pierre Jacob, David Picard, Aymeric Histace, Edouard Klein

Abstract: Learning rich and compact representations is an open topic in many fields such as object recognition or image retrieval. Deep neural networks have made a major breakthrough during the last few years for these tasks but their representations are not necessary as rich as needed nor as compact as expected. To build richer representations, high order statistics have been exploited and have shown excel… ▽ More Learning rich and compact representations is an open topic in many fields such as object recognition or image retrieval. Deep neural networks have made a major breakthrough during the last few years for these tasks but their representations are not necessary as rich as needed nor as compact as expected. To build richer representations, high order statistics have been exploited and have shown excellent performances, but they produce higher dimensional features. While this drawback has been partially addressed with factorization schemes, the original compactness of first order models has never been retrieved, or at the cost of a strong performance decrease. Our method, by jointly integrating codebook strategy to factorization scheme, is able to produce compact representations while keeping the second order performances with few additional parameters. This formulation leads to state-of-the-art results on three image retrieval datasets. △ Less

Submitted 5 June, 2019; originally announced June 2019.

Comments: Accepted at IEEE International Conference on Image Processing (ICIP) 2019

arXiv:1809.00898 [pdf, other]

Image Reassembly Combining Deep Learning and Shortest Path Problem

Authors: M. -M. Paumard, D. Picard, H. Tabia

Abstract: This paper addresses the problem of reassembling images from disjointed fragments. More specifically, given an unordered set of fragments, we aim at reassembling one or several possibly incomplete images. The main contributions of this work are: 1) several deep neural architectures to predict the relative position of image fragments that outperform the previous state of the art; 2) casting the rea… ▽ More This paper addresses the problem of reassembling images from disjointed fragments. More specifically, given an unordered set of fragments, we aim at reassembling one or several possibly incomplete images. The main contributions of this work are: 1) several deep neural architectures to predict the relative position of image fragments that outperform the previous state of the art; 2) casting the reassembly problem into the shortest path in a graph problem for which we provide several construction algorithms depending on available information; 3) a new dataset of images taken from the Metropolitan Museum of Art (MET) dedicated to image reassembly for which we provide a clear setup and a strong baseline. △ Less

Submitted 4 September, 2018; originally announced September 2018.

Comments: ECCV 2018

arXiv:1807.03155 [pdf, other]

Jigsaw Puzzle Solving Using Local Feature Co-Occurrences in Deep Neural Networks

Authors: Marie-Morgane Paumard, David Picard, Hedi Tabia

Abstract: Archaeologists are in dire need of automated object reconstruction methods. Fragments reassembly is close to puzzle problems, which may be solved by computer vision algorithms. As they are often beaten on most image related tasks by deep learning algorithms, we study a classification method that can solve jigsaw puzzles. In this paper, we focus on classifying the relative position: given a couple… ▽ More Archaeologists are in dire need of automated object reconstruction methods. Fragments reassembly is close to puzzle problems, which may be solved by computer vision algorithms. As they are often beaten on most image related tasks by deep learning algorithms, we study a classification method that can solve jigsaw puzzles. In this paper, we focus on classifying the relative position: given a couple of fragments, we compute their local relation (e.g. on top). We propose several enhancements over the state of the art in this domain, which is outperformed by our method by 25\%. We propose an original dataset composed of pictures from the Metropolitan Museum of Art. We propose a greedy reconstruction method based on the predicted relative positions. △ Less

Submitted 5 July, 2018; originally announced July 2018.

Comments: ICIP 2018

arXiv:1806.08991 [pdf, other]

Leveraging Implicit Spatial Information in Global Features for Image Retrieval

Authors: Pierre Jacob, David Picard, Aymeric Histace, Edouard Klein

Abstract: Most image retrieval methods use global features that aggregate local distinctive patterns into a single representation. However, the aggregation process destroys the relative spatial information by considering orderless sets of local descriptors. We propose to integrate relative spatial information into the aggregation process by taking into account co-occurrences of local patterns in a tensor fr… ▽ More Most image retrieval methods use global features that aggregate local distinctive patterns into a single representation. However, the aggregation process destroys the relative spatial information by considering orderless sets of local descriptors. We propose to integrate relative spatial information into the aggregation process by taking into account co-occurrences of local patterns in a tensor framework. The resulting signature called Improved Spatial Tensor Aggregation (ISTA) is able to reach state of the art performances on well known datasets such as Holidays, Oxford5k and Paris6k. △ Less

Submitted 23 June, 2018; originally announced June 2018.

Comments: 8 pages, 2 figures and 1 table. Draft paper for conference, IEEE International Conference on Image Processing (ICIP) 2018

arXiv:1805.00900 [pdf, other]

doi 10.1109/ICDEW.2018.00035

Images & Recipes: Retrieval in the cooking context

Authors: Micael Carvalho, Rémi Cadène, David Picard, Laure Soulier, Matthieu Cord

Abstract: Recent advances in the machine learning community allowed different use cases to emerge, as its association to domains like cooking which created the computational cuisine. In this paper, we tackle the picture-recipe alignment problem, having as target application the large-scale retrieval task (finding a recipe given a picture, and vice versa). Our approach is validated on the Recipe1M dataset, c… ▽ More Recent advances in the machine learning community allowed different use cases to emerge, as its association to domains like cooking which created the computational cuisine. In this paper, we tackle the picture-recipe alignment problem, having as target application the large-scale retrieval task (finding a recipe given a picture, and vice versa). Our approach is validated on the Recipe1M dataset, composed of one million image-recipe pairs and additional class information, for which we achieve state-of-the-art results. △ Less

Submitted 2 May, 2018; originally announced May 2018.

Comments: Published at DECOR / ICDE 2018. Extended version accepted at SIGIR 2018, available here: arXiv:1804.11146

arXiv:1804.11146 [pdf, other]

Cross-Modal Retrieval in the Cooking Context: Learning Semantic Text-Image Embeddings

Authors: Micael Carvalho, Rémi Cadène, David Picard, Laure Soulier, Nicolas Thome, Matthieu Cord

Abstract: Designing powerful tools that support cooking activities has rapidly gained popularity due to the massive amounts of available data, as well as recent advances in machine learning that are capable of analyzing them. In this paper, we propose a cross-modal retrieval model aligning visual and textual data (like pictures of dishes and their recipes) in a shared representation space. We describe an ef… ▽ More Designing powerful tools that support cooking activities has rapidly gained popularity due to the massive amounts of available data, as well as recent advances in machine learning that are capable of analyzing them. In this paper, we propose a cross-modal retrieval model aligning visual and textual data (like pictures of dishes and their recipes) in a shared representation space. We describe an effective learning scheme, capable of tackling large-scale problems, and validate it on the Recipe1M dataset containing nearly 1 million picture-recipe pairs. We show the effectiveness of our approach regarding previous state-of-the-art models and present qualitative results over computational cooking use cases. △ Less

Submitted 30 April, 2018; originally announced April 2018.

Comments: accepted at the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, 2018

arXiv:1804.01852

GoSGD: Distributed Optimization for Deep Learning with Gossip Exchange

Authors: Michael Blot, David Picard, Matthieu Cord

Abstract: We address the issue of speeding up the training of convolutional neural networks by studying a distributed method adapted to stochastic gradient descent. Our parallel optimization setup uses several threads, each applying individual gradient descents on a local variable. We propose a new way of sharing information between different threads based on gossip algorithms that show good consensus conve… ▽ More We address the issue of speeding up the training of convolutional neural networks by studying a distributed method adapted to stochastic gradient descent. Our parallel optimization setup uses several threads, each applying individual gradient descents on a local variable. We propose a new way of sharing information between different threads based on gossip algorithms that show good consensus convergence properties. Our method called GoSGD has the advantage to be fully asynchronous and decentralized. △ Less

Submitted 12 November, 2018; v1 submitted 4 April, 2018; originally announced April 2018.

Comments: Correction to do, and difficulties to change the document

arXiv:1802.09232 [pdf, other]

2D/3D Pose Estimation and Action Recognition using Multitask Deep Learning

Authors: Diogo C. Luvizon, David Picard, Hedi Tabia

Abstract: Action recognition and human pose estimation are closely related but both problems are generally handled as distinct tasks in the literature. In this work, we propose a multitask framework for jointly 2D and 3D pose estimation from still images and human action recognition from video sequences. We show that a single architecture can be used to solve the two problems in an efficient way and still a… ▽ More Action recognition and human pose estimation are closely related but both problems are generally handled as distinct tasks in the literature. In this work, we propose a multitask framework for jointly 2D and 3D pose estimation from still images and human action recognition from video sequences. We show that a single architecture can be used to solve the two problems in an efficient way and still achieves state-of-the-art results. Additionally, we demonstrate that optimization from end-to-end leads to significantly higher accuracy than separated learning. The proposed architecture can be trained with data from different categories simultaneously in a seamlessly way. The reported results on four datasets (MPII, Human3.6M, Penn Action and NTU) demonstrate the effectiveness of our method on the targeted tasks. △ Less

Submitted 21 March, 2018; v1 submitted 26 February, 2018; originally announced February 2018.

Comments: To appear in CVPR 2018

arXiv:1710.02322 [pdf, other]

Human Pose Regression by Combining Indirect Part Detection and Contextual Information

Authors: Diogo C. Luvizon, Hedi Tabia, David Picard

Abstract: In this paper, we propose an end-to-end trainable regression approach for human pose estimation from still images. We use the proposed Soft-argmax function to convert feature maps directly to joint coordinates, resulting in a fully differentiable framework. Our method is able to learn heat maps representations indirectly, without additional steps of artificial ground truth generation. Consequently… ▽ More In this paper, we propose an end-to-end trainable regression approach for human pose estimation from still images. We use the proposed Soft-argmax function to convert feature maps directly to joint coordinates, resulting in a fully differentiable framework. Our method is able to learn heat maps representations indirectly, without additional steps of artificial ground truth generation. Consequently, contextual information can be included to the pose predictions in a seamless way. We evaluated our method on two very challenging datasets, the Leeds Sports Poses (LSP) and the MPII Human Pose datasets, reaching the best performance among all the existing regression methods and comparable results to the state-of-the-art detection based approaches. △ Less

Submitted 6 October, 2017; originally announced October 2017.

arXiv:1701.00167 [pdf, ps, other]

Very Fast Kernel SVM under Budget Constraints

Authors: David Picard

Abstract: In this paper we propose a fast online Kernel SVM algorithm under tight budget constraints. We propose to split the input space using LVQ and train a Kernel SVM in each cluster. To allow for online training, we propose to limit the size of the support vector set of each cluster using different strategies. We show in the experiment that our algorithm is able to achieve high accuracy while having a… ▽ More In this paper we propose a fast online Kernel SVM algorithm under tight budget constraints. We propose to split the input space using LVQ and train a Kernel SVM in each cluster. To allow for online training, we propose to limit the size of the support vector set of each cluster using different strategies. We show in the experiment that our algorithm is able to achieve high accuracy while having a very high number of samples processed per second both in training and in the evaluation. △ Less

Submitted 31 December, 2016; originally announced January 2017.

arXiv:1611.09726 [pdf, other]

Gossip training for deep learning

Authors: Michael Blot, David Picard, Matthieu Cord, Nicolas Thome

Abstract: We address the issue of speeding up the training of convolutional networks. Here we study a distributed method adapted to stochastic gradient descent (SGD). The parallel optimization setup uses several threads, each applying individual gradient descents on a local variable. We propose a new way to share information between different threads inspired by gossip algorithms and showing good consensus… ▽ More We address the issue of speeding up the training of convolutional networks. Here we study a distributed method adapted to stochastic gradient descent (SGD). The parallel optimization setup uses several threads, each applying individual gradient descents on a local variable. We propose a new way to share information between different threads inspired by gossip algorithms and showing good consensus convergence properties. Our method called GoSGD has the advantage to be fully asynchronous and decentralized. We compared our method to the recent EASGD in \cite{elastic} on CIFAR-10 show encouraging results. △ Less

Submitted 29 November, 2016; originally announced November 2016.

arXiv:1207.4776 [pdf]

doi 10.1007/978-3-642-31534-3_80

Design and User Satisfaction of Interactive Maps for Visually Impaired People

Authors: Anke Brock, Philippe Truillet, Bernard Oriola, Delphine Picard, Christophe Jouffrais

Abstract: Multimodal interactive maps are a solution for presenting spatial information to visually impaired people. In this paper, we present an interactive multimodal map prototype that is based on a tactile paper map, a multi-touch screen and audio output. We first describe the different steps for designing an interactive map: drawing and printing the tactile paper map, choice of multi-touch technology,… ▽ More Multimodal interactive maps are a solution for presenting spatial information to visually impaired people. In this paper, we present an interactive multimodal map prototype that is based on a tactile paper map, a multi-touch screen and audio output. We first describe the different steps for designing an interactive map: drawing and printing the tactile paper map, choice of multi-touch technology, interaction technologies and the software architecture. Then we describe the method used to assess user satisfaction. We provide data showing that an interactive map - although based on a unique, elementary, double tap interaction - has been met with a high level of user satisfaction. Interestingly, satisfaction is independent of a user's age, previous visual experience or Braille experience. This prototype will be used as a platform to design advanced interactions for spatial learning. △ Less

Submitted 19 July, 2012; originally announced July 2012.

Journal ref: ICCHP 2012 (2012) 544-551

Showing 1–48 of 48 results for author: Picard, D