Search | arXiv e-print repository

From Image to Video: An Empirical Study of Diffusion Representations

Authors: Pedro Vélez, Luisa F. Polanía, Yi Yang, Chuhan Zhang, Rishabh Kabra, Anurag Arnab, Mehdi S. M. Sajjadi

Abstract: Diffusion models have revolutionized generative modeling, enabling unprecedented realism in image and video synthesis. This success has sparked interest in leveraging their representations for visual understanding tasks. While recent works have explored this potential for image generation, the visual understanding capabilities of video diffusion models remain largely uncharted. To address this gap… ▽ More Diffusion models have revolutionized generative modeling, enabling unprecedented realism in image and video synthesis. This success has sparked interest in leveraging their representations for visual understanding tasks. While recent works have explored this potential for image generation, the visual understanding capabilities of video diffusion models remain largely uncharted. To address this gap, we systematically compare the same model architecture trained for video versus image generation, analyzing the performance of their latent representations on various downstream tasks including image classification, action recognition, depth estimation, and tracking. Results show that video diffusion models consistently outperform their image counterparts, though we find a striking range in the extent of this superiority. We further analyze features extracted from different layers and with varying noise levels, as well as the effect of model size and training budget on representation and generation quality. This work marks the first direct comparison of video and image diffusion objectives for visual understanding, offering insights into the role of temporal information in representation learning. △ Less

Submitted 19 March, 2025; v1 submitted 10 February, 2025; originally announced February 2025.

arXiv:2412.15212 [pdf, ps, other]

Scaling 4D Representations

Authors: João Carreira, Dilara Gokay, Michael King, Chuhan Zhang, Ignacio Rocco, Aravindh Mahendran, Thomas Albert Keck, Joseph Heyward, Skanda Koppula, Etienne Pot, Goker Erdogan, Yana Hasson, Yi Yang, Klaus Greff, Guillaume Le Moing, Sjoerd van Steenkiste, Daniel Zoran, Drew A. Hudson, Pedro Vélez, Luisa Polanía, Luke Friedman, Chris Duvarney, Ross Goroshin, Kelsey Allen, Jacob Walker , et al. (10 additional authors not shown)

Abstract: Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video. However, prior work has focused evaluations on semantic-related tasks $\unicode{x2013}$ action classification, ImageNet classification, etc. In this paper we focus on evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose… ▽ More Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video. However, prior work has focused evaluations on semantic-related tasks $\unicode{x2013}$ action classification, ImageNet classification, etc. In this paper we focus on evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose estimation, point and object tracking, and depth estimation. We show that by learning from very large video datasets, masked auto-encoding (MAE) with transformer video models actually scales, consistently improving performance on these 4D tasks, as model size increases from 20M all the way to the largest by far reported self-supervised video model $\unicode{x2013}$ 22B parameters. Rigorous apples-to-apples comparison with many recent image and video models demonstrates the benefits of scaling 4D representations. Pretrained models are available at https://github.com/google-deepmind/representations4d . △ Less

Submitted 9 July, 2025; v1 submitted 19 December, 2024; originally announced December 2024.

arXiv:2306.00763 [pdf, other]

Learning Disentangled Prompts for Compositional Image Synthesis

Authors: Kihyuk Sohn, Albert Shaw, Yuan Hao, Han Zhang, Luisa Polania, Huiwen Chang, Lu Jiang, Irfan Essa

Abstract: We study domain-adaptive image synthesis, the problem of teaching pretrained image generative models a new style or concept from as few as one image to synthesize novel images, to better understand the compositional image synthesis. We present a framework that leverages a pretrained class-conditional generation model and visual prompt tuning. Specifically, we propose a novel source class distilled… ▽ More We study domain-adaptive image synthesis, the problem of teaching pretrained image generative models a new style or concept from as few as one image to synthesize novel images, to better understand the compositional image synthesis. We present a framework that leverages a pretrained class-conditional generation model and visual prompt tuning. Specifically, we propose a novel source class distilled visual prompt that learns disentangled prompts of semantic (e.g., class) and domain (e.g., style) from a few images. Learned domain prompt is then used to synthesize images of any classes in the style of target domain. We conduct studies on various target domains with the number of images ranging from one to a few to many, and show qualitative results which show the compositional generalization of our method. Moreover, we show that our method can help improve zero-shot domain adaptation classification accuracy. △ Less

Submitted 1 June, 2023; originally announced June 2023.

Comments: tech report

arXiv:2210.00990 [pdf, other]

Visual Prompt Tuning for Generative Transfer Learning

Authors: Kihyuk Sohn, Yuan Hao, José Lezama, Luisa Polania, Huiwen Chang, Han Zhang, Irfan Essa, Lu Jiang

Abstract: Transferring knowledge from an image synthesis model trained on a large dataset is a promising direction for learning generative image models from various domains efficiently. While previous works have studied GAN models, we present a recipe for learning vision transformers by generative knowledge transfer. We base our framework on state-of-the-art generative vision transformers that represent an… ▽ More Transferring knowledge from an image synthesis model trained on a large dataset is a promising direction for learning generative image models from various domains efficiently. While previous works have studied GAN models, we present a recipe for learning vision transformers by generative knowledge transfer. We base our framework on state-of-the-art generative vision transformers that represent an image as a sequence of visual tokens to the autoregressive or non-autoregressive transformers. To adapt to a new domain, we employ prompt tuning, which prepends learnable tokens called prompt to the image token sequence, and introduce a new prompt design for our task. We study on a variety of visual domains, including visual task adaptation benchmark~\cite{zhai2019large}, with varying amount of training images, and show effectiveness of knowledge transfer and a significantly better image generation quality over existing works. △ Less

Submitted 3 October, 2022; originally announced October 2022.

Comments: technical report

arXiv:2004.07268 [pdf, other]

Learning Furniture Compatibility with Graph Neural Networks

Authors: Luisa F. Polania, Mauricio Flores, Yiran Li, Matthew Nokleby

Abstract: We propose a graph neural network (GNN) approach to the problem of predicting the stylistic compatibility of a set of furniture items from images. While most existing results are based on siamese networks which evaluate pairwise compatibility between items, the proposed GNN architecture exploits relational information among groups of items. We present two GNN models, both of which comprise a deep… ▽ More We propose a graph neural network (GNN) approach to the problem of predicting the stylistic compatibility of a set of furniture items from images. While most existing results are based on siamese networks which evaluate pairwise compatibility between items, the proposed GNN architecture exploits relational information among groups of items. We present two GNN models, both of which comprise a deep CNN that extracts a feature representation for each image, a gated recurrent unit (GRU) network that models interactions between the furniture items in a set, and an aggregation function that calculates the compatibility score. In the first model, a generalized contrastive loss function that promotes the generation of clustered embeddings for items belonging to the same furniture set is introduced. Also, in the first model, the edge function between nodes in the GRU and the aggregation function are fixed in order to limit model complexity and allow training on smaller datasets; in the second model, the edge function and aggregation function are learned directly from the data. We demonstrate state-of-the art accuracy for compatibility prediction and "fill in the blank" tasks on the Bonn and Singapore furniture datasets. We further introduce a new dataset, called the Target Furniture Collections dataset, which contains over 6000 furniture items that have been hand-curated by stylists to make up 1632 compatible sets. We also demonstrate superior prediction accuracy on this dataset. △ Less

Submitted 15 April, 2020; originally announced April 2020.

Comments: Accepted for publication at CVPR Workshops

arXiv:2002.09023 [pdf, other]

Audio-video Emotion Recognition in the Wild using Deep Hybrid Networks

Authors: Xin Guo, Luisa F. Polanía, Kenneth E. Barner

Abstract: This paper presents an audiovisual-based emotion recognition hybrid network. While most of the previous work focuses either on using deep models or hand-engineered features extracted from images, we explore multiple deep models built on both images and audio signals. Specifically, in addition to convolutional neural networks (CNN) and recurrent neutral networks (RNN) trained on facial images, the… ▽ More This paper presents an audiovisual-based emotion recognition hybrid network. While most of the previous work focuses either on using deep models or hand-engineered features extracted from images, we explore multiple deep models built on both images and audio signals. Specifically, in addition to convolutional neural networks (CNN) and recurrent neutral networks (RNN) trained on facial images, the hybrid network also contains one SVM classifier trained on holistic acoustic feature vectors, one long short-term memory network (LSTM) trained on short-term feature sequences extracted from segmented audio clips, and one Inception(v2)-LSTM network trained on image-like maps, which are built based on short-term acoustic feature sequences. Experimental results show that the proposed hybrid network outperforms the baseline method by a large margin. △ Less

Submitted 20 February, 2020; originally announced February 2020.

arXiv:1912.05035 [pdf, other]

Deep Adaptive Wavelet Network

Authors: Maria Ximena Bastidas Rodriguez, Adrien Gruson, Luisa F. Polania, Shin Fujieda, Flavio Prieto Ortiz, Kohei Takayama, Toshiya Hachisuka

Abstract: Even though convolutional neural networks have become the method of choice in many fields of computer vision, they still lack interpretability and are usually designed manually in a cumbersome trial-and-error process. This paper aims at overcoming those limitations by proposing a deep neural network, which is designed in a systematic fashion and is interpretable, by integrating multiresolution ana… ▽ More Even though convolutional neural networks have become the method of choice in many fields of computer vision, they still lack interpretability and are usually designed manually in a cumbersome trial-and-error process. This paper aims at overcoming those limitations by proposing a deep neural network, which is designed in a systematic fashion and is interpretable, by integrating multiresolution analysis at the core of the deep neural network design. By using the lifting scheme, it is possible to generate a wavelet representation and design a network capable of learning wavelet coefficients in an end-to-end form. Compared to state-of-the-art architectures, the proposed model requires less hyper-parameter tuning and achieves competitive accuracy in image classification tasks △ Less

Submitted 10 December, 2019; originally announced December 2019.

arXiv:1909.12911 [pdf, other]

Graph Neural Networks for Image Understanding Based on Multiple Cues: Group Emotion Recognition and Event Recognition as Use Cases

Authors: Xin Guo, Luisa F. Polania, Bin Zhu, Charles Boncelet, Kenneth E. Barner

Abstract: A graph neural network (GNN) for image understanding based on multiple cues is proposed in this paper. Compared to traditional feature and decision fusion approaches that neglect the fact that features can interact and exchange information, the proposed GNN is able to pass information among features extracted from different models. Two image understanding tasks, namely group-level emotion recognit… ▽ More A graph neural network (GNN) for image understanding based on multiple cues is proposed in this paper. Compared to traditional feature and decision fusion approaches that neglect the fact that features can interact and exchange information, the proposed GNN is able to pass information among features extracted from different models. Two image understanding tasks, namely group-level emotion recognition (GER) and event recognition, which are highly semantic and require the interaction of several deep models to synthesize multiple cues, were selected to validate the performance of the proposed method. It is shown through experiments that the proposed method achieves state-of-the-art performance on the selected image understanding tasks. In addition, a new group-level emotion recognition database is introduced and shared in this paper. △ Less

Submitted 28 February, 2020; v1 submitted 18 September, 2019; originally announced September 2019.

Comments: Paper accepted for publication at the 2020 IEEE Winter Conference on Applications of Computer Vision (WACV)

arXiv:1905.03703 [pdf, other]

Learning fashion compatibility across apparel categories for outfit recommendation

Authors: Luisa F. Polania, Satyajit Gupte

Abstract: This paper addresses the problem of generating recommendations for completing the outfit given that a user is interested in a particular apparel item. The proposed method is based on a siamese network used for feature extraction followed by a fully-connected network used for learning a fashion compatibility metric. The embeddings generated by the siamese network are augmented with color histogram… ▽ More This paper addresses the problem of generating recommendations for completing the outfit given that a user is interested in a particular apparel item. The proposed method is based on a siamese network used for feature extraction followed by a fully-connected network used for learning a fashion compatibility metric. The embeddings generated by the siamese network are augmented with color histogram features motivated by the important role that color plays in determining fashion compatibility. The training of the network is formulated as a maximum a posteriori (MAP) problem where Laplacian distributions are assumed for the filters of the siamese network to promote sparsity and matrix-variate normal distributions are assumed for the weights of the metric network to efficiently exploit correlations between the input units of each fully-connected layer. △ Less

Submitted 1 May, 2019; originally announced May 2019.

Comments: Accepted for publication at ICIP 2019

arXiv:1811.03268 [pdf, other]

Ordinal Regression using Noisy Pairwise Comparisons for Body Mass Index Range Estimation

Authors: Luisa Polania, Dongning Wang, Glenn Fung

Abstract: Ordinal regression aims to classify instances into ordinal categories. In this paper, body mass index (BMI) category estimation from facial images is cast as an ordinal regression problem. In particular, noisy binary search algorithms based on pairwise comparisons are employed to exploit the ordinal relationship among BMI categories. Comparisons are performed with Siamese architectures, one of whi… ▽ More Ordinal regression aims to classify instances into ordinal categories. In this paper, body mass index (BMI) category estimation from facial images is cast as an ordinal regression problem. In particular, noisy binary search algorithms based on pairwise comparisons are employed to exploit the ordinal relationship among BMI categories. Comparisons are performed with Siamese architectures, one of which uses the Bradley-Terry model probabilities as target. The Bradley-Terry model is an approach to describe probabilities of the possible outcomes when elements of a set are repeatedly compared with one another in pairs. Experimental results show that our approach outperforms classification and regression-based methods at estimating BMI categories. △ Less

Submitted 7 November, 2018; originally announced November 2018.

Comments: Paper accepted for publication at the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV 2019)

arXiv:1806.01779 [pdf, ps, other]

Compressed Sensing ECG using Restricted Boltzmann Machines

Authors: Luisa Polania, Rafael Plaza

Abstract: Recently, it has been shown that compressed sensing (CS) has the potential to lower energy consumption in wireless electrocardiogram (ECG) systems. By reducing the number of acquired measurements, the communication burden is decreased and energy is saved. In this paper, we aim at further reducing the number of necessary measurements to achieve faithful reconstruction by exploiting the representati… ▽ More Recently, it has been shown that compressed sensing (CS) has the potential to lower energy consumption in wireless electrocardiogram (ECG) systems. By reducing the number of acquired measurements, the communication burden is decreased and energy is saved. In this paper, we aim at further reducing the number of necessary measurements to achieve faithful reconstruction by exploiting the representational power of restricted Boltzmann machines (RBMs) to model the probability distribution of the sparsity pattern of ECG signals. The motivation for using this approach is to capture the higher-order statistical dependencies between the coefficients of the ECG sparse representation, which in turn, leads to superior reconstruction accuracy and reduction in the number of measurements, as it is shown via experiments. △ Less

Submitted 30 May, 2018; originally announced June 2018.

Comments: Accepted for publication at Biomedical Signal Processing and Control (Elsevier)

arXiv:1802.02185 [pdf, other]

Smile detection in the wild based on transfer learning

Authors: Xin Guo, Luisa F. Polanía, Kenneth E. Barner

Abstract: Smile detection from unconstrained facial images is a specialized and challenging problem. As one of the most informative expressions, smiles convey basic underlying emotions, such as happiness and satisfaction, which lead to multiple applications, e.g., human behavior analysis and interactive controlling. Compared to the size of databases for face recognition, far less labeled data is available f… ▽ More Smile detection from unconstrained facial images is a specialized and challenging problem. As one of the most informative expressions, smiles convey basic underlying emotions, such as happiness and satisfaction, which lead to multiple applications, e.g., human behavior analysis and interactive controlling. Compared to the size of databases for face recognition, far less labeled data is available for training smile detection systems. To leverage the large amount of labeled data from face recognition datasets and to alleviate overfitting on smile detection, an efficient transfer learning-based smile detection approach is proposed in this paper. Unlike previous works which use either hand-engineered features or train deep convolutional networks from scratch, a well-trained deep face recognition model is explored and fine-tuned for smile detection in the wild. Three different models are built as a result of fine-tuning the face recognition model with different inputs, including aligned, unaligned and grayscale images generated from the GENKI-4K dataset. Experiments show that the proposed approach achieves improved state-of-the-art performance. Robustness of the model to noise and blur artifacts is also evaluated in this paper. △ Less

Submitted 17 January, 2018; originally announced February 2018.

arXiv:1705.10500 [pdf, ps, other]

doi 10.1109/TSP.2017.2712128

Exploiting Restricted Boltzmann Machines and Deep Belief Networks in Compressed Sensing

Authors: Luisa F. Polania, Kenneth E. Barner

Abstract: This paper proposes a CS scheme that exploits the representational power of restricted Boltzmann machines and deep learning architectures to model the prior distribution of the sparsity pattern of signals belonging to the same class. The determined probability distribution is then used in a maximum a posteriori (MAP) approach for the reconstruction. The parameters of the prior distribution are lea… ▽ More This paper proposes a CS scheme that exploits the representational power of restricted Boltzmann machines and deep learning architectures to model the prior distribution of the sparsity pattern of signals belonging to the same class. The determined probability distribution is then used in a maximum a posteriori (MAP) approach for the reconstruction. The parameters of the prior distribution are learned from training data. The motivation behind this approach is to model the higher-order statistical dependencies between the coefficients of the sparse representation, with the final goal of improving the reconstruction. The performance of the proposed method is validated on the Berkeley Segmentation Dataset and the MNIST Database of handwritten digits. △ Less

Submitted 30 May, 2017; originally announced May 2017.

Comments: Accepted for publication at IEEE Transactions on Signal Processing

arXiv:1405.4201 [pdf, ps, other]

doi 10.1109/JBHI.2014.2325017

Exploiting Prior Knowledge in Compressed Sensing Wireless ECG Systems

Authors: Luisa F. Polania, Rafael E. Carrillo, Manuel Blanco-Velasco, Kenneth E. Barner

Abstract: Recent results in telecardiology show that compressed sensing (CS) is a promising tool to lower energy consumption in wireless body area networks for electrocardiogram (ECG) monitoring. However, the performance of current CS-based algorithms, in terms of compression rate and reconstruction quality of the ECG, still falls short of the performance attained by state-of-the-art wavelet based algorithm… ▽ More Recent results in telecardiology show that compressed sensing (CS) is a promising tool to lower energy consumption in wireless body area networks for electrocardiogram (ECG) monitoring. However, the performance of current CS-based algorithms, in terms of compression rate and reconstruction quality of the ECG, still falls short of the performance attained by state-of-the-art wavelet based algorithms. In this paper, we propose to exploit the structure of the wavelet representation of the ECG signal to boost the performance of CS-based methods for compression and reconstruction of ECG signals. More precisely, we incorporate prior information about the wavelet dependencies across scales into the reconstruction algorithms and exploit the high fraction of common support of the wavelet coefficients of consecutive ECG segments. Experimental results utilizing the MIT-BIH Arrhythmia Database show that significant performance gains, in terms of compression rate and reconstruction quality, can be obtained by the proposed algorithms compared to current CS-based methods. △ Less

Submitted 29 May, 2014; v1 submitted 16 May, 2014; originally announced May 2014.

Comments: Accepted for publication at IEEE Journal of Biomedical and Health Informatics

Showing 1–14 of 14 results for author: Polanía, L