-
An Achievability Bound for Type-Based Unsourced Multiple Access
Authors:
Deekshith Pathayappilly Krishnan,
Kaan Okumus,
Khac-Hoang Ngo,
Giuseppe Durisi
Abstract:
We derive an achievability bound to quantify the performance of a type-based unsourced multiple access system -- an information-theoretic model for grant-free multiple access with correlated messages. The bound extends available achievability results for the per-user error probability in the unsourced multiple access framework, where, different from our setup, message collisions are treated as err…
▽ More
We derive an achievability bound to quantify the performance of a type-based unsourced multiple access system -- an information-theoretic model for grant-free multiple access with correlated messages. The bound extends available achievability results for the per-user error probability in the unsourced multiple access framework, where, different from our setup, message collisions are treated as errors. Specifically, we provide an upper bound on the total variation distance between the type (i.e., the empirical probability mass function) of the transmitted messages and its estimate over a Gaussian multiple access channel. Through numerical simulations, we illustrate that our bound can be used to determine the message type that is less efficient to transmit, because more difficult to detect. We finally show that a practical scheme for type estimation, based on coded compressed sensing with approximate message passing, operates approximately 3 dB away from the bound, for the parameters considered in the paper.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
JESTR: Joint Embedding Space Technique for Ranking Candidate Molecules for the Annotation of Untargeted Metabolomics Data
Authors:
Apurva Kalia,
Yan Zhou Chen,
Dilip Krishnan,
Soha Hassoun
Abstract:
Motivation: A major challenge in metabolomics is annotation: assigning molecular structures to mass spectral fragmentation patterns. Despite recent advances in molecule-to-spectra and in spectra-to-molecular fingerprint prediction (FP), annotation rates remain low. Results: We introduce in this paper a novel paradigm (JESTR) for annotation. Unlike prior approaches that explicitly construct molecul…
▽ More
Motivation: A major challenge in metabolomics is annotation: assigning molecular structures to mass spectral fragmentation patterns. Despite recent advances in molecule-to-spectra and in spectra-to-molecular fingerprint prediction (FP), annotation rates remain low. Results: We introduce in this paper a novel paradigm (JESTR) for annotation. Unlike prior approaches that explicitly construct molecular fingerprints or spectra, JESTR leverages the insight that molecules and their corresponding spectra are views of the same data and effectively embeds their representations in a joint space. Candidate structures are ranked based on cosine similarity between the embeddings of query spectrum and each candidate. We evaluate JESTR against mol-to-spec and spec-to-FP annotation tools on three datasets. On average, for rank@[1-5], JESTR outperforms other tools by 23.6%-71.6%. We further demonstrate the strong value of regularization with candidate molecules during training, boosting rank@1 performance by 11.4% and enhancing the model's ability to discern between target and candidate molecules. When comparing JESTR's performance against that of publicly available pretrained models of SIRIUS and CFM-ID on appropriate subsets of MassSpecGym benchmark dataset, JESTR outperforms these tools by 31% and 238%, respectively. Through JESTR, we offer a novel promising avenue towards accurate annotation, therefore unlocking valuable insights into the metabolome.
△ Less
Submitted 7 June, 2025; v1 submitted 17 November, 2024;
originally announced November 2024.
-
Collisional Dynamics of Solitons and Pattern Formation in an Integrable Cross Coupled Nonlinear Schrodinger equation with constant background
Authors:
P. S. Vinayagam,
D. Aravindha Krishnan,
R. V. Kamaleshwaran,
R. Radha
Abstract:
We investigate the dynamics arising out of the propagation of light pulses with different polarizations through a condensate (referred to as a constant background field) with cross coupling described by a coupled nonlinear Schrodinger equation(NLSE) type equation. We then employ Gauge and Darboux transformation approach to bring out the rich dynamics arising out of the background field and cross c…
▽ More
We investigate the dynamics arising out of the propagation of light pulses with different polarizations through a condensate (referred to as a constant background field) with cross coupling described by a coupled nonlinear Schrodinger equation(NLSE) type equation. We then employ Gauge and Darboux transformation approach to bring out the rich dynamics arising out of the background field and cross coupling. The collisional dynamics of bright solitons is found to be inelastic. The constant background field is found to facilitate the periodic localization of light pulses during propagation. We have also unearthed breathers, bright-bright, bright-dark and dark-bright solitons of the coupled NLSE. While the amplitude of breathers oscillate with time as predicted, their maximum(or minimum) amplitude is found to remain a constant and the addition of cross coupling only contributes to the rapid fluctuations in its amplitude over a period of time. In addition, the reinforcement of cross coupling in the presence of constant wave field facilitates the interference of light pulses leading to interesting pattern formation among bright-bright, bright-dark and dark-bright solitons. The highlight of the results is that one obtains various localized excitations like breathers, bright and dark solitons by simply manipulating the amplitude of the constant wave field.
△ Less
Submitted 28 September, 2024;
originally announced September 2024.
-
Type-Based Unsourced Multiple Access
Authors:
Khac-Hoang Ngo,
Deekshith Pathayappilly Krishnan,
Kaan Okumus,
Giuseppe Durisi,
Erik G. Ström
Abstract:
We generalize the type-based multiple access framework proposed by Mergen and Tong (2006) to the case of unsourced multiple access. In the proposed framework, each device tracks the state of a physical/digital process, quantizes this state, and communicates it to a common receiver through a shared channel in an uncoordinated manner. The receiver aims to estimate the type of the states, i.e., the s…
▽ More
We generalize the type-based multiple access framework proposed by Mergen and Tong (2006) to the case of unsourced multiple access. In the proposed framework, each device tracks the state of a physical/digital process, quantizes this state, and communicates it to a common receiver through a shared channel in an uncoordinated manner. The receiver aims to estimate the type of the states, i.e., the set of states and their multiplicity in the sequence of states reported by all devices. We measure the type estimation error using the Wasserstein distance. Considering an example of multi-target position tracking, we show that type estimation can be performed effectively via approximate message passing. Furthermore, we determine the quantization resolution that minimizes the type estimation error by balancing quantization distortion and communication error.
△ Less
Submitted 15 July, 2024; v1 submitted 30 April, 2024;
originally announced April 2024.
-
Denoising Vision Transformers
Authors:
Jiawei Yang,
Katie Z Luo,
Jiefeng Li,
Congyue Deng,
Leonidas Guibas,
Dilip Krishnan,
Kilian Q Weinberger,
Yonglong Tian,
Yue Wang
Abstract:
We study a crucial yet often overlooked issue inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts, which hurt the performance of ViTs in downstream dense prediction tasks such as semantic segmentation, depth prediction, and object discovery. We trace this issue down to the positional embeddings at the input stage. To mitigate this, we propose a two-stag…
▽ More
We study a crucial yet often overlooked issue inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts, which hurt the performance of ViTs in downstream dense prediction tasks such as semantic segmentation, depth prediction, and object discovery. We trace this issue down to the positional embeddings at the input stage. To mitigate this, we propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT). In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis. This per-image optimization process extracts artifact-free features from raw ViT outputs, providing clean feature estimates for offline applications. In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision. Our method, DVT, does not require re-training the existing pre-trained ViTs, and is immediately applicable to any Vision Transformer architecture. We evaluate our method on a variety of representative ViTs (DINO, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg) and demonstrate that DVT consistently improves existing state-of-the-art general-purpose models in semantic and geometric tasks across multiple datasets. We hope our study will encourage a re-evaluation of ViT design, especially regarding the naive use of positional embeddings. Our code and checkpoints are publicly available.
△ Less
Submitted 22 July, 2024; v1 submitted 5 January, 2024;
originally announced January 2024.
-
Learning Vision from Models Rivals Learning Vision from Data
Authors:
Yonglong Tian,
Lijie Fan,
Kaifeng Chen,
Dina Katabi,
Dilip Krishnan,
Phillip Isola
Abstract:
We introduce SynCLR, a novel approach for learning visual representations exclusively from synthetic images and synthetic captions, without any real data. We synthesize a large dataset of image captions using LLMs, then use an off-the-shelf text-to-image model to generate multiple images corresponding to each synthetic caption. We perform visual representation learning on these synthetic images vi…
▽ More
We introduce SynCLR, a novel approach for learning visual representations exclusively from synthetic images and synthetic captions, without any real data. We synthesize a large dataset of image captions using LLMs, then use an off-the-shelf text-to-image model to generate multiple images corresponding to each synthetic caption. We perform visual representation learning on these synthetic images via contrastive learning, treating images sharing the same caption as positive pairs. The resulting representations transfer well to many downstream tasks, competing favorably with other general-purpose visual representation learners such as CLIP and DINO v2 in image classification tasks. Furthermore, in dense prediction tasks such as semantic segmentation, SynCLR outperforms previous self-supervised methods by a significant margin, e.g., improving over MAE and iBOT by 6.2 and 4.3 mIoU on ADE20k for ViT-B/16.
△ Less
Submitted 28 December, 2023;
originally announced December 2023.
-
Scaling Laws of Synthetic Images for Model Training ... for Now
Authors:
Lijie Fan,
Kaifeng Chen,
Dilip Krishnan,
Dina Katabi,
Phillip Isola,
Yonglong Tian
Abstract:
Recent significant advances in text-to-image models unlock the possibility of training vision systems using synthetic images, potentially overcoming the difficulty of collecting curated data at scale. It is unclear, however, how these models behave at scale, as more synthetic data is added to the training set. In this paper we study the scaling laws of synthetic images generated by state of the ar…
▽ More
Recent significant advances in text-to-image models unlock the possibility of training vision systems using synthetic images, potentially overcoming the difficulty of collecting curated data at scale. It is unclear, however, how these models behave at scale, as more synthetic data is added to the training set. In this paper we study the scaling laws of synthetic images generated by state of the art text-to-image models, for the training of supervised models: image classifiers with label supervision, and CLIP with language supervision. We identify several factors, including text prompts, classifier-free guidance scale, and types of text-to-image models, that significantly affect scaling behavior. After tuning these factors, we observe that synthetic images demonstrate a scaling trend similar to, but slightly less effective than, real images in CLIP training, while they significantly underperform in scaling when training supervised image classifiers. Our analysis indicates that the main reason for this underperformance is the inability of off-the-shelf text-to-image models to generate certain concepts, a limitation that significantly impairs the training of image classifiers. Our findings also suggest that scaling synthetic data can be particularly effective in scenarios such as: (1) when there is a limited supply of real images for a supervised problem (e.g., fewer than 0.5 million images in ImageNet), (2) when the evaluation dataset diverges significantly from the training data, indicating the out-of-distribution scenario, or (3) when synthetic data is used in conjunction with real images, as demonstrated in the training of CLIP models.
△ Less
Submitted 7 December, 2023;
originally announced December 2023.
-
Improve Supervised Representation Learning with Masked Image Modeling
Authors:
Kaifeng Chen,
Daniel Salz,
Huiwen Chang,
Kihyuk Sohn,
Dilip Krishnan,
Mojtaba Seyedhosseini
Abstract:
Training visual embeddings with labeled data supervision has been the de facto setup for representation learning in computer vision. Inspired by recent success of adopting masked image modeling (MIM) in self-supervised representation learning, we propose a simple yet effective setup that can easily integrate MIM into existing supervised training paradigms. In our design, in addition to the origina…
▽ More
Training visual embeddings with labeled data supervision has been the de facto setup for representation learning in computer vision. Inspired by recent success of adopting masked image modeling (MIM) in self-supervised representation learning, we propose a simple yet effective setup that can easily integrate MIM into existing supervised training paradigms. In our design, in addition to the original classification task applied to a vision transformer image encoder, we add a shallow transformer-based decoder on top of the encoder and introduce an MIM task which tries to reconstruct image tokens based on masked image inputs. We show with minimal change in architecture and no overhead in inference that this setup is able to improve the quality of the learned representations for downstream tasks such as classification, image retrieval, and semantic segmentation. We conduct a comprehensive study and evaluation of our setup on public benchmarks. On ImageNet-1k, our ViT-B/14 model achieves 81.72% validation accuracy, 2.01% higher than the baseline model. On K-Nearest-Neighbor image retrieval evaluation with ImageNet-1k, the same model outperforms the baseline by 1.32%. We also show that this setup can be easily scaled to larger models and datasets. Code and checkpoints will be released.
△ Less
Submitted 1 December, 2023;
originally announced December 2023.
-
Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency
Authors:
Tianhong Li,
Sangnie Bhardwaj,
Yonglong Tian,
Han Zhang,
Jarred Barber,
Dina Katabi,
Guillaume Lajoie,
Huiwen Chang,
Dilip Krishnan
Abstract:
Current vision-language generative models rely on expansive corpora of paired image-text data to attain optimal performance and generalization capabilities. However, automatically collecting such data (e.g. via large-scale web scraping) leads to low quality and poor image-text correlation, while human annotation is more accurate but requires significant manual effort and expense. We introduce…
▽ More
Current vision-language generative models rely on expansive corpora of paired image-text data to attain optimal performance and generalization capabilities. However, automatically collecting such data (e.g. via large-scale web scraping) leads to low quality and poor image-text correlation, while human annotation is more accurate but requires significant manual effort and expense. We introduce $\textbf{ITIT}$ ($\textbf{I}$n$\textbf{T}$egrating $\textbf{I}$mage $\textbf{T}$ext): an innovative training paradigm grounded in the concept of cycle consistency which allows vision-language training on unpaired image and text data. ITIT is comprised of a joint image-text encoder with disjoint image and text decoders that enable bidirectional image-to-text and text-to-image generation in a single framework. During training, ITIT leverages a small set of paired image-text data to ensure its output matches the input reasonably well in both directions. Simultaneously, the model is also trained on much larger datasets containing only images or texts. This is achieved by enforcing cycle consistency between the original unpaired samples and the cycle-generated counterparts. For instance, it generates a caption for a given input image and then uses the caption to create an output image, and enforces similarity between the input and output images. Our experiments show that ITIT with unpaired datasets exhibits similar scaling behavior as using high-quality paired data. We demonstrate image generation and captioning performance on par with state-of-the-art text-to-image and image-to-text models with orders of magnitude fewer (only 3M) paired image-text data.
△ Less
Submitted 5 October, 2023;
originally announced October 2023.
-
Substance or Style: What Does Your Image Embedding Know?
Authors:
Cyrus Rashtchian,
Charles Herrmann,
Chun-Sung Ferng,
Ayan Chakrabarti,
Dilip Krishnan,
Deqing Sun,
Da-Cheng Juan,
Andrew Tomkins
Abstract:
Probes are small networks that predict properties of underlying data from embeddings, and they provide a targeted, effective way to illuminate the information contained in embeddings. While analysis through the use of probes has become standard in NLP, there has been much less exploration in vision. Image foundation models have primarily been evaluated for semantic content. Better understanding th…
▽ More
Probes are small networks that predict properties of underlying data from embeddings, and they provide a targeted, effective way to illuminate the information contained in embeddings. While analysis through the use of probes has become standard in NLP, there has been much less exploration in vision. Image foundation models have primarily been evaluated for semantic content. Better understanding the non-semantic information in popular embeddings (e.g., MAE, SimCLR, or CLIP) will shed new light both on the training algorithms and on the uses for these foundation models. We design a systematic transformation prediction task and measure the visual content of embeddings along many axes, including image style, quality, and a range of natural and artificial transformations. Surprisingly, six embeddings (including SimCLR) encode enough non-semantic information to identify dozens of transformations. We also consider a generalization task, where we group similar transformations and hold out several for testing. We find that image-text models (CLIP and ALIGN) are better at recognizing new examples of style transfer than masking-based models (CAN and MAE). Overall, our results suggest that the choice of pre-training algorithm impacts the types of information in the embedding, and certain models are better than others for non-semantic downstream tasks.
△ Less
Submitted 10 July, 2023;
originally announced July 2023.
-
StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners
Authors:
Yonglong Tian,
Lijie Fan,
Phillip Isola,
Huiwen Chang,
Dilip Krishnan
Abstract:
We investigate the potential of learning visual representations using synthetic images generated by text-to-image models. This is a natural question in the light of the excellent performance of such models in generating high-quality images. We consider specifically the Stable Diffusion, one of the leading open source text-to-image models. We show that (1) when the generative model is configured wi…
▽ More
We investigate the potential of learning visual representations using synthetic images generated by text-to-image models. This is a natural question in the light of the excellent performance of such models in generating high-quality images. We consider specifically the Stable Diffusion, one of the leading open source text-to-image models. We show that (1) when the generative model is configured with proper classifier-free guidance scale, training self-supervised methods on synthetic images can match or beat the real image counterpart; (2) by treating the multiple images generated from the same text prompt as positives for each other, we develop a multi-positive contrastive learning method, which we call StableRep. With solely synthetic images, the representations learned by StableRep surpass the performance of representations learned by SimCLR and CLIP using the same set of text prompts and corresponding real images, on large scale datasets. When we further add language supervision, StableRep trained with 20M synthetic images achieves better accuracy than CLIP trained with 50M real images.
△ Less
Submitted 26 October, 2023; v1 submitted 1 June, 2023;
originally announced June 2023.
-
StyleDrop: Text-to-Image Generation in Any Style
Authors:
Kihyuk Sohn,
Nataniel Ruiz,
Kimin Lee,
Daniel Castro Chin,
Irina Blok,
Huiwen Chang,
Jarred Barber,
Lu Jiang,
Glenn Entis,
Yuanzhen Li,
Yuan Hao,
Irfan Essa,
Michael Rubinstein,
Dilip Krishnan
Abstract:
Pre-trained large text-to-image models synthesize impressive images with an appropriate use of text prompts. However, ambiguities inherent in natural language and out-of-distribution effects make it hard to synthesize image styles, that leverage a specific design pattern, texture or material. In this paper, we introduce StyleDrop, a method that enables the synthesis of images that faithfully follo…
▽ More
Pre-trained large text-to-image models synthesize impressive images with an appropriate use of text prompts. However, ambiguities inherent in natural language and out-of-distribution effects make it hard to synthesize image styles, that leverage a specific design pattern, texture or material. In this paper, we introduce StyleDrop, a method that enables the synthesis of images that faithfully follow a specific style using a text-to-image model. The proposed method is extremely versatile and captures nuances and details of a user-provided style, such as color schemes, shading, design patterns, and local and global effects. It efficiently learns a new style by fine-tuning very few trainable parameters (less than $1\%$ of total model parameters) and improving the quality via iterative training with either human or automated feedback. Better yet, StyleDrop is able to deliver impressive results even when the user supplies only a single image that specifies the desired style. An extensive study shows that, for the task of style tuning text-to-image models, StyleDrop implemented on Muse convincingly outperforms other methods, including DreamBooth and textual inversion on Imagen or Stable Diffusion. More results are available at our project website: https://styledrop.github.io
△ Less
Submitted 1 June, 2023;
originally announced June 2023.
-
Improving CLIP Training with Language Rewrites
Authors:
Lijie Fan,
Dilip Krishnan,
Phillip Isola,
Dina Katabi,
Yonglong Tian
Abstract:
Contrastive Language-Image Pre-training (CLIP) stands as one of the most effective and scalable methods for training transferable vision models using paired image and text data. CLIP models are trained using contrastive loss, which typically relies on data augmentations to prevent overfitting and shortcuts. However, in the CLIP training paradigm, data augmentations are exclusively applied to image…
▽ More
Contrastive Language-Image Pre-training (CLIP) stands as one of the most effective and scalable methods for training transferable vision models using paired image and text data. CLIP models are trained using contrastive loss, which typically relies on data augmentations to prevent overfitting and shortcuts. However, in the CLIP training paradigm, data augmentations are exclusively applied to image inputs, while language inputs remain unchanged throughout the entire training process, limiting the exposure of diverse texts to the same image. In this paper, we introduce Language augmented CLIP (LaCLIP), a simple yet highly effective approach to enhance CLIP training through language rewrites. Leveraging the in-context learning capability of large language models, we rewrite the text descriptions associated with each image. These rewritten texts exhibit diversity in sentence structure and vocabulary while preserving the original key concepts and meanings. During training, LaCLIP randomly selects either the original texts or the rewritten versions as text augmentations for each image. Extensive experiments on CC3M, CC12M, RedCaps and LAION-400M datasets show that CLIP pre-training with language rewrites significantly improves the transfer performance without computation or memory overhead during training. Specifically for ImageNet zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on LAION-400M. Code is available at https://github.com/LijieFan/LaCLIP.
△ Less
Submitted 28 October, 2023; v1 submitted 31 May, 2023;
originally announced May 2023.
-
Steerable Equivariant Representation Learning
Authors:
Sangnie Bhardwaj,
Willie McClinton,
Tongzhou Wang,
Guillaume Lajoie,
Chen Sun,
Phillip Isola,
Dilip Krishnan
Abstract:
Pre-trained deep image representations are useful for post-training tasks such as classification through transfer learning, image retrieval, and object detection. Data augmentations are a crucial aspect of pre-training robust representations in both supervised and self-supervised settings. Data augmentations explicitly or implicitly promote invariance in the embedding space to the input image tran…
▽ More
Pre-trained deep image representations are useful for post-training tasks such as classification through transfer learning, image retrieval, and object detection. Data augmentations are a crucial aspect of pre-training robust representations in both supervised and self-supervised settings. Data augmentations explicitly or implicitly promote invariance in the embedding space to the input image transformations. This invariance reduces generalization to those downstream tasks which rely on sensitivity to these particular data augmentations. In this paper, we propose a method of learning representations that are instead equivariant to data augmentations. We achieve this equivariance through the use of steerable representations. Our representations can be manipulated directly in embedding space via learned linear maps. We demonstrate that our resulting steerable and equivariant representations lead to better performance on transfer learning and robustness: e.g. we improve linear probe top-1 accuracy by between 1% to 3% for transfer; and ImageNet-C accuracy by upto 3.4%. We further show that the steerability of our representations provides significant speedup (nearly 50x) for test-time augmentations; by applying a large number of augmentations for out-of-distribution detection, we significantly improve OOD AUC on the ImageNet-C dataset over an invariant representation.
△ Less
Submitted 22 February, 2023;
originally announced February 2023.
-
Muse: Text-To-Image Generation via Masked Generative Transformers
Authors:
Huiwen Chang,
Han Zhang,
Jarred Barber,
AJ Maschinot,
Jose Lezama,
Lu Jiang,
Ming-Hsuan Yang,
Kevin Murphy,
William T. Freeman,
Michael Rubinstein,
Yuanzhen Li,
Dilip Krishnan
Abstract:
We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being significantly more efficient than diffusion or autoregressive models. Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. C…
▽ More
We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being significantly more efficient than diffusion or autoregressive models. Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autoregressive models, such as Parti, Muse is more efficient due to the use of parallel decoding. The use of a pre-trained LLM enables fine-grained language understanding, translating to high-fidelity image generation and the understanding of visual concepts such as objects, their spatial relationships, pose, cardinality etc. Our 900M parameter model achieves a new SOTA on CC3M, with an FID score of 6.06. The Muse 3B parameter model achieves an FID of 7.88 on zero-shot COCO evaluation, along with a CLIP score of 0.32. Muse also directly enables a number of image editing applications without the need to fine-tune or invert the model: inpainting, outpainting, and mask-free editing. More results are available at https://muse-model.github.io
△ Less
Submitted 2 January, 2023;
originally announced January 2023.
-
MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis
Authors:
Tianhong Li,
Huiwen Chang,
Shlok Kumar Mishra,
Han Zhang,
Dina Katabi,
Dilip Krishnan
Abstract:
Generative modeling and representation learning are two key tasks in computer vision. However, these models are typically trained independently, which ignores the potential for each task to help the other, and leads to training and model maintenance overheads. In this work, we propose MAsked Generative Encoder (MAGE), the first framework to unify SOTA image generation and self-supervised represent…
▽ More
Generative modeling and representation learning are two key tasks in computer vision. However, these models are typically trained independently, which ignores the potential for each task to help the other, and leads to training and model maintenance overheads. In this work, we propose MAsked Generative Encoder (MAGE), the first framework to unify SOTA image generation and self-supervised representation learning. Our key insight is that using variable masking ratios in masked image modeling pre-training can allow generative training (very high masking ratio) and representation learning (lower masking ratio) under the same training framework. Inspired by previous generative models, MAGE uses semantic tokens learned by a vector-quantized GAN at inputs and outputs, combining this with masking. We can further improve the representation by adding a contrastive loss to the encoder output. We extensively evaluate the generation and representation learning capabilities of MAGE. On ImageNet-1K, a single MAGE ViT-L model obtains 9.10 FID in the task of class-unconditional image generation and 78.9% top-1 accuracy for linear probing, achieving state-of-the-art performance in both image generation and representation learning. Code is available at https://github.com/LTH14/mage.
△ Less
Submitted 29 June, 2023; v1 submitted 16 November, 2022;
originally announced November 2022.
-
A simple, efficient and scalable contrastive masked autoencoder for learning visual representations
Authors:
Shlok Mishra,
Joshua Robinson,
Huiwen Chang,
David Jacobs,
Aaron Sarna,
Aaron Maschinot,
Dilip Krishnan
Abstract:
We introduce CAN, a simple, efficient and scalable method for self-supervised learning of visual representations. Our framework is a minimal and conceptually clean synthesis of (C) contrastive learning, (A) masked autoencoders, and (N) the noise prediction approach used in diffusion models. The learning mechanisms are complementary to one another: contrastive learning shapes the embedding space ac…
▽ More
We introduce CAN, a simple, efficient and scalable method for self-supervised learning of visual representations. Our framework is a minimal and conceptually clean synthesis of (C) contrastive learning, (A) masked autoencoders, and (N) the noise prediction approach used in diffusion models. The learning mechanisms are complementary to one another: contrastive learning shapes the embedding space across a batch of image samples; masked autoencoders focus on reconstruction of the low-frequency spatial correlations in a single image sample; and noise prediction encourages the reconstruction of the high-frequency components of an image. The combined approach results in a robust, scalable and simple-to-implement algorithm. The training process is symmetric, with 50% of patches in both views being masked at random, yielding a considerable efficiency improvement over prior contrastive learning methods. Extensive empirical studies demonstrate that CAN achieves strong downstream performance under both linear and finetuning evaluations on transfer learning and robustness tasks. CAN outperforms MAE and SimCLR when pre-training on ImageNet, but is especially useful for pre-training on larger uncurated datasets such as JFT-300M: for linear probe on ImageNet, CAN achieves 75.4% compared to 73.4% for SimCLR and 64.1% for MAE. The finetuned performance on ImageNet of our ViT-L model is 86.1%, compared to 85.5% for SimCLR, and 85.4% for MAE. The overall FLOPs load of SimCLR is 70% higher than CAN for ViT-L models.
△ Less
Submitted 30 October, 2022;
originally announced October 2022.
-
The Lick Observatory Supernova Search follow-up program: photometry data release of 70 stripped-envelope supernovae
Authors:
WeiKang Zheng,
Benjamin E. Stahl,
Thomas de Jaeger,
Alexei V. Filippenko,
Shan-Qin Wang,
Wen-Pei Gan,
Thomas G. Brink,
Ivan Altunin,
Raphael Baer-Way,
Andrew Bigley,
Kyle Blanchard,
Peter K. Blanchard,
James Bradley,
Samantha K. Cargill,
Chadwick Casper,
Teagan Chapman,
Vidhi Chander,
Sanyum Channa,
Byung Yun Choi,
Nick Choksi,
Matthew Chu,
Kelsey I. Clubb,
Daniel P. Cohen,
Paul A. Dalba,
Asia deGraw
, et al. (63 additional authors not shown)
Abstract:
We present BVRI and unfiltered Clear light curves of 70 stripped-envelope supernovae (SESNe), observed between 2003 and 2020, from the Lick Observatory Supernova Search (LOSS) follow-up program. Our SESN sample consists of 19 spectroscopically normal SNe~Ib, two peculiar SNe Ib, six SN Ibn, 14 normal SNe Ic, one peculiar SN Ic, ten SNe Ic-BL, 15 SNe IIb, one ambiguous SN IIb/Ib/c, and two superlum…
▽ More
We present BVRI and unfiltered Clear light curves of 70 stripped-envelope supernovae (SESNe), observed between 2003 and 2020, from the Lick Observatory Supernova Search (LOSS) follow-up program. Our SESN sample consists of 19 spectroscopically normal SNe~Ib, two peculiar SNe Ib, six SN Ibn, 14 normal SNe Ic, one peculiar SN Ic, ten SNe Ic-BL, 15 SNe IIb, one ambiguous SN IIb/Ib/c, and two superluminous SNe. Our follow-up photometry has (on a per-SN basis) a mean coverage of 81 photometric points (median of 58 points) and a mean cadence of 3.6d (median of 1.2d). From our full sample, a subset of 38 SNe have pre-maximum coverage in at least one passband, allowing for the peak brightness of each SN in this subset to be quantitatively determined. We describe our data collection and processing techniques, with emphasis toward our automated photometry pipeline, from which we derive publicly available data products to enable and encourage further study by the community. Using these data products, we derive host-galaxy extinction values through the empirical colour evolution relationship and, for the first time, produce accurate rise-time measurements for a large sample of SESNe in both optical and infrared passbands. By modeling multiband light curves, we find that SNe Ic tend to have lower ejecta masses and lower ejecta velocities than SNe~Ib and IIb, but higher $^{56}$Ni masses.
△ Less
Submitted 10 March, 2022;
originally announced March 2022.
-
Object-Aware Cropping for Self-Supervised Learning
Authors:
Shlok Mishra,
Anshul Shah,
Ankan Bansal,
Abhyuday Jagannatha,
Janit Anjaria,
Abhishek Sharma,
David Jacobs,
Dilip Krishnan
Abstract:
A core component of the recent success of self-supervised learning is cropping data augmentation, which selects sub-regions of an image to be used as positive views in the self-supervised loss. The underlying assumption is that randomly cropped and resized regions of a given image share information about the objects of interest, which the learned representation will capture. This assumption is mos…
▽ More
A core component of the recent success of self-supervised learning is cropping data augmentation, which selects sub-regions of an image to be used as positive views in the self-supervised loss. The underlying assumption is that randomly cropped and resized regions of a given image share information about the objects of interest, which the learned representation will capture. This assumption is mostly satisfied in datasets such as ImageNet where there is a large, centered object, which is highly likely to be present in random crops of the full image. However, in other datasets such as OpenImages or COCO, which are more representative of real world uncurated data, there are typically multiple small objects in an image. In this work, we show that self-supervised learning based on the usual random cropping performs poorly on such datasets. We propose replacing one or both of the random crops with crops obtained from an object proposal algorithm. This encourages the model to learn both object and scene level semantic representations. Using this approach, which we call object-aware cropping, results in significant improvements over scene cropping on classification and object detection benchmarks. For example, on OpenImages, our approach achieves an improvement of 8.8% mAP over random scene-level cropping using MoCo-v2 based pre-training. We also show significant improvements on COCO and PASCAL-VOC object detection and segmentation tasks over the state-of-the-art self-supervised learning approaches. Our approach is efficient, simple and general, and can be used in most existing contrastive and non-contrastive self-supervised learning frameworks.
△ Less
Submitted 6 April, 2023; v1 submitted 1 December, 2021;
originally announced December 2021.
-
Pyramid Adversarial Training Improves ViT Performance
Authors:
Charles Herrmann,
Kyle Sargent,
Lu Jiang,
Ramin Zabih,
Huiwen Chang,
Ce Liu,
Dilip Krishnan,
Deqing Sun
Abstract:
Aggressive data augmentation is a key component of the strong generalization capabilities of Vision Transformer (ViT). One such data augmentation technique is adversarial training (AT); however, many prior works have shown that this often results in poor clean accuracy. In this work, we present pyramid adversarial training (PyramidAT), a simple and effective technique to improve ViT's overall perf…
▽ More
Aggressive data augmentation is a key component of the strong generalization capabilities of Vision Transformer (ViT). One such data augmentation technique is adversarial training (AT); however, many prior works have shown that this often results in poor clean accuracy. In this work, we present pyramid adversarial training (PyramidAT), a simple and effective technique to improve ViT's overall performance. We pair it with a "matched" Dropout and stochastic depth regularization, which adopts the same Dropout and stochastic depth configuration for the clean and adversarial samples. Similar to the improvements on CNNs by AdvProp (not directly applicable to ViT), our pyramid adversarial training breaks the trade-off between in-distribution accuracy and out-of-distribution robustness for ViT and related architectures. It leads to 1.82% absolute improvement on ImageNet clean accuracy for the ViT-B model when trained only on ImageNet-1K data, while simultaneously boosting performance on 7 ImageNet robustness metrics, by absolute numbers ranging from 1.76% to 15.68%. We set a new state-of-the-art for ImageNet-C (41.42 mCE), ImageNet-R (53.92%), and ImageNet-Sketch (41.04%) without extra data, using only the ViT-B/16 backbone and our pyramid adversarial training. Our code is publicly available at pyramidat.github.io.
△ Less
Submitted 2 September, 2022; v1 submitted 29 November, 2021;
originally announced November 2021.
-
CSI: Contrastive Data Stratification for Interaction Prediction and its Application to Compound-Protein Interaction Prediction
Authors:
Apurva Kalia,
Dilip Krishnan,
Soha Hassoun
Abstract:
Accurately predicting the likelihood of interaction between two objects (compound-protein sequence, user-item, author-paper, etc.) is a fundamental problem in Computer Science. Current deep-learning models rely on learning accurate representations of the interacting objects. Importantly, relationships between the interacting objects, or features of the interaction, offer an opportunity to partitio…
▽ More
Accurately predicting the likelihood of interaction between two objects (compound-protein sequence, user-item, author-paper, etc.) is a fundamental problem in Computer Science. Current deep-learning models rely on learning accurate representations of the interacting objects. Importantly, relationships between the interacting objects, or features of the interaction, offer an opportunity to partition the data to create multi-views of the interacting objects. The resulting congruent and non-congruent views can then be exploited via contrastive learning techniques to learn enhanced representations of the objects.
△ Less
Submitted 21 December, 2022; v1 submitted 17 November, 2021;
originally announced November 2021.
-
Unsupervised Disentanglement without Autoencoding: Pitfalls and Future Directions
Authors:
Andrea Burns,
Aaron Sarna,
Dilip Krishnan,
Aaron Maschinot
Abstract:
Disentangled visual representations have largely been studied with generative models such as Variational AutoEncoders (VAEs). While prior work has focused on generative methods for disentangled representation learning, these approaches do not scale to large datasets due to current limitations of generative models. Instead, we explore regularization methods with contrastive learning, which could re…
▽ More
Disentangled visual representations have largely been studied with generative models such as Variational AutoEncoders (VAEs). While prior work has focused on generative methods for disentangled representation learning, these approaches do not scale to large datasets due to current limitations of generative models. Instead, we explore regularization methods with contrastive learning, which could result in disentangled representations that are powerful enough for large scale datasets and downstream applications. However, we find that unsupervised disentanglement is difficult to achieve due to optimization and initialization sensitivity, with trade-offs in task performance. We evaluate disentanglement with downstream tasks, analyze the benefits and disadvantages of each regularization used, and discuss future directions.
△ Less
Submitted 14 August, 2021;
originally announced August 2021.
-
RingBFT: Resilient Consensus over Sharded Ring Topology
Authors:
Sajjad Rahnama,
Suyash Gupta,
Rohan Sogani,
Dhruv Krishnan,
Mohammad Sadoghi
Abstract:
The recent surge in federated data management applications has brought forth concerns about the security of underlying data and the consistency of replicas in the presence of malicious attacks. A prominent solution in this direction is to employ a permissioned blockchain framework that is modeled around traditional Byzantine Fault-Tolerant (BFT) consensus protocols. Any federated application expec…
▽ More
The recent surge in federated data management applications has brought forth concerns about the security of underlying data and the consistency of replicas in the presence of malicious attacks. A prominent solution in this direction is to employ a permissioned blockchain framework that is modeled around traditional Byzantine Fault-Tolerant (BFT) consensus protocols. Any federated application expects its data to be globally scattered to achieve faster access. But, prior works have shown that traditional BFT protocols are slow. This has led to the rise of sharded-replicated blockchains. Existing BFT protocols for these sharded blockchains are efficient if client transactions require access to a single-shard, but face performance degradation if there is a cross-shard transaction that requires access to multiple shards. As cross-shard transactions are common, to resolve this dilemma, we present RingBFT, a novel meta-BFT protocol for sharded blockchains. RingBFT requires shards to adhere to the ring order, and follow the principle of process, forward, and re-transmit while ensuring the communication between shards is linear. Our evaluation of RingBFT against state-of-the-art sharding BFT protocols illustrates that RingBFT achieves up to 18x higher throughput, gracefully scales to nearly 500 globally distributed nodes, and achieves a peak throughput of 1.2 million transactions per second.
△ Less
Submitted 23 March, 2022; v1 submitted 27 July, 2021;
originally announced July 2021.
-
Understanding Invariance via Feedforward Inversion of Discriminatively Trained Classifiers
Authors:
Piotr Teterwak,
Chiyuan Zhang,
Dilip Krishnan,
Michael C. Mozer
Abstract:
A discriminatively trained neural net classifier can fit the training data perfectly if all information about its input other than class membership has been discarded prior to the output layer. Surprisingly, past research has discovered that some extraneous visual detail remains in the logit vector. This finding is based on inversion techniques that map deep embeddings back to images. We explore t…
▽ More
A discriminatively trained neural net classifier can fit the training data perfectly if all information about its input other than class membership has been discarded prior to the output layer. Surprisingly, past research has discovered that some extraneous visual detail remains in the logit vector. This finding is based on inversion techniques that map deep embeddings back to images. We explore this phenomenon further using a novel synthesis of methods, yielding a feedforward inversion model that produces remarkably high fidelity reconstructions, qualitatively superior to those of past efforts. When applied to an adversarially robust classifier model, the reconstructions contain sufficient local detail and global structure that they might be confused with the original image in a quick glance, and the object category can clearly be gleaned from the reconstruction. Our approach is based on BigGAN (Brock, 2019), with conditioning on logits instead of one-hot class labels. We use our reconstruction model as a tool for exploring the nature of representations, including: the influence of model architecture and training objectives (specifically robust losses), the forms of invariance that networks achieve, representational differences between correctly and incorrectly classified images, and the effects of manipulating logits and images. We believe that our method can inspire future investigations into the nature of information flow in a neural net and can provide diagnostics for improving discriminative models.
△ Less
Submitted 21 July, 2021; v1 submitted 15 March, 2021;
originally announced March 2021.
-
What Makes for Good Views for Contrastive Learning?
Authors:
Yonglong Tian,
Chen Sun,
Ben Poole,
Dilip Krishnan,
Cordelia Schmid,
Phillip Isola
Abstract:
Contrastive learning between multiple views of the data has recently achieved state of the art performance in the field of self-supervised representation learning. Despite its success, the influence of different view choices has been less studied. In this paper, we use theoretical and empirical analysis to better understand the importance of view selection, and argue that we should reduce the mutu…
▽ More
Contrastive learning between multiple views of the data has recently achieved state of the art performance in the field of self-supervised representation learning. Despite its success, the influence of different view choices has been less studied. In this paper, we use theoretical and empirical analysis to better understand the importance of view selection, and argue that we should reduce the mutual information (MI) between views while keeping task-relevant information intact. To verify this hypothesis, we devise unsupervised and semi-supervised frameworks that learn effective views by aiming to reduce their MI. We also consider data augmentation as a way to reduce MI, and show that increasing data augmentation indeed leads to decreasing MI and improves downstream classification accuracy. As a by-product, we achieve a new state-of-the-art accuracy on unsupervised pre-training for ImageNet classification ($73\%$ top-1 linear readout with a ResNet-50). In addition, transferring our models to PASCAL VOC object detection and COCO instance segmentation consistently outperforms supervised pre-training. Code:http://github.com/HobbitLong/PyContrast
△ Less
Submitted 18 December, 2020; v1 submitted 20 May, 2020;
originally announced May 2020.
-
Supervised Contrastive Learning
Authors:
Prannay Khosla,
Piotr Teterwak,
Chen Wang,
Aaron Sarna,
Yonglong Tian,
Phillip Isola,
Aaron Maschinot,
Ce Liu,
Dilip Krishnan
Abstract:
Contrastive learning applied to self-supervised representation learning has seen a resurgence in recent years, leading to state of the art performance in the unsupervised training of deep image models. Modern batch contrastive approaches subsume or significantly outperform traditional contrastive losses such as triplet, max-margin and the N-pairs loss. In this work, we extend the self-supervised b…
▽ More
Contrastive learning applied to self-supervised representation learning has seen a resurgence in recent years, leading to state of the art performance in the unsupervised training of deep image models. Modern batch contrastive approaches subsume or significantly outperform traditional contrastive losses such as triplet, max-margin and the N-pairs loss. In this work, we extend the self-supervised batch contrastive approach to the fully-supervised setting, allowing us to effectively leverage label information. Clusters of points belonging to the same class are pulled together in embedding space, while simultaneously pushing apart clusters of samples from different classes. We analyze two possible versions of the supervised contrastive (SupCon) loss, identifying the best-performing formulation of the loss. On ResNet-200, we achieve top-1 accuracy of 81.4% on the ImageNet dataset, which is 0.8% above the best number reported for this architecture. We show consistent outperformance over cross-entropy on other datasets and two ResNet variants. The loss shows benefits for robustness to natural corruptions and is more stable to hyperparameter settings such as optimizers and data augmentations. Our loss function is simple to implement, and reference TensorFlow code is released at https://t.ly/supcon.
△ Less
Submitted 10 March, 2021; v1 submitted 23 April, 2020;
originally announced April 2020.
-
Rethinking Few-Shot Image Classification: a Good Embedding Is All You Need?
Authors:
Yonglong Tian,
Yue Wang,
Dilip Krishnan,
Joshua B. Tenenbaum,
Phillip Isola
Abstract:
The focus of recent meta-learning research has been on the development of learning algorithms that can quickly adapt to test time tasks with limited data and low computational cost. Few-shot learning is widely used as one of the standard benchmarks in meta-learning. In this work, we show that a simple baseline: learning a supervised or self-supervised representation on the meta-training set, follo…
▽ More
The focus of recent meta-learning research has been on the development of learning algorithms that can quickly adapt to test time tasks with limited data and low computational cost. Few-shot learning is widely used as one of the standard benchmarks in meta-learning. In this work, we show that a simple baseline: learning a supervised or self-supervised representation on the meta-training set, followed by training a linear classifier on top of this representation, outperforms state-of-the-art few-shot learning methods. An additional boost can be achieved through the use of self-distillation. This demonstrates that using a good learned embedding model can be more effective than sophisticated meta-learning algorithms. We believe that our findings motivate a rethinking of few-shot image classification benchmarks and the associated role of meta-learning algorithms. Code is available at: http://github.com/WangYueFt/rfs/.
△ Less
Submitted 17 June, 2020; v1 submitted 25 March, 2020;
originally announced March 2020.
-
Fantastic Generalization Measures and Where to Find Them
Authors:
Yiding Jiang,
Behnam Neyshabur,
Hossein Mobahi,
Dilip Krishnan,
Samy Bengio
Abstract:
Generalization of deep networks has been of great interest in recent years, resulting in a number of theoretically and empirically motivated complexity measures. However, most papers proposing such measures study only a small set of models, leaving open the question of whether the conclusion drawn from those experiments would remain valid in other settings. We present the first large scale study o…
▽ More
Generalization of deep networks has been of great interest in recent years, resulting in a number of theoretically and empirically motivated complexity measures. However, most papers proposing such measures study only a small set of models, leaving open the question of whether the conclusion drawn from those experiments would remain valid in other settings. We present the first large scale study of generalization in deep networks. We investigate more then 40 complexity measures taken from both theoretical bounds and empirical studies. We train over 10,000 convolutional networks by systematically varying commonly used hyperparameters. Hoping to uncover potentially causal relationships between each measure and generalization, we analyze carefully controlled experiments and show surprising failures of some measures as well as promising measures for further research.
△ Less
Submitted 4 December, 2019;
originally announced December 2019.
-
Contrastive Representation Distillation
Authors:
Yonglong Tian,
Dilip Krishnan,
Phillip Isola
Abstract:
Often we wish to transfer representational knowledge from one neural network to another. Examples include distilling a large network into a smaller one, transferring knowledge from one sensory modality to a second, or ensembling a collection of models into a single estimator. Knowledge distillation, the standard approach to these problems, minimizes the KL divergence between the probabilistic outp…
▽ More
Often we wish to transfer representational knowledge from one neural network to another. Examples include distilling a large network into a smaller one, transferring knowledge from one sensory modality to a second, or ensembling a collection of models into a single estimator. Knowledge distillation, the standard approach to these problems, minimizes the KL divergence between the probabilistic outputs of a teacher and student network. We demonstrate that this objective ignores important structural knowledge of the teacher network. This motivates an alternative objective by which we train a student to capture significantly more information in the teacher's representation of the data. We formulate this objective as contrastive learning. Experiments demonstrate that our resulting new objective outperforms knowledge distillation and other cutting-edge distillers on a variety of knowledge transfer tasks, including single model compression, ensemble distillation, and cross-modal transfer. Our method sets a new state-of-the-art in many transfer tasks, and sometimes even outperforms the teacher network when combined with knowledge distillation. Code: http://github.com/HobbitLong/RepDistiller.
△ Less
Submitted 24 January, 2022; v1 submitted 23 October, 2019;
originally announced October 2019.
-
Lick Observatory Supernova Search Follow-Up Program: Photometry Data Release of 93 Type Ia Supernovae
Authors:
Benjamin E. Stahl,
WeiKang Zheng,
Thomas de Jaeger,
Alexei V. Filippenko,
Andrew Bigley,
Kyle Blanchard,
Peter K. Blanchard,
Thomas G. Brink,
Samantha K. Cargill,
Chadwick Casper,
Sanyum Channa,
Byung Yun Choi,
Nick Choksi,
Jason Chu,
Kelsey I. Clubb,
Daniel P. Cohen,
Michael Ellison,
Edward Falcon,
Pegah Fazeli,
Kiera Fuller,
Mohan Ganeshalingam,
Elinor L. Gates,
Carolina Gould,
Goni Halevi,
Kevin T. Hayakawa
, et al. (30 additional authors not shown)
Abstract:
We present BVRI and unfiltered light curves of 93 Type Ia supernovae (SNe Ia) from the Lick Observatory Supernova Search (LOSS) follow-up program conducted between 2005 and 2018. Our sample consists of 78 spectroscopically normal SNe Ia, with the remainder divided between distinct subclasses (three SN 1991bg-like, three SN 1991T-like, four SNe Iax, two peculiar, and three super-Chandrasekhar event…
▽ More
We present BVRI and unfiltered light curves of 93 Type Ia supernovae (SNe Ia) from the Lick Observatory Supernova Search (LOSS) follow-up program conducted between 2005 and 2018. Our sample consists of 78 spectroscopically normal SNe Ia, with the remainder divided between distinct subclasses (three SN 1991bg-like, three SN 1991T-like, four SNe Iax, two peculiar, and three super-Chandrasekhar events), and has a median redshift of 0.0192. The SNe in our sample have a median coverage of 16 photometric epochs at a cadence of 5.4 days, and the median first observed epoch is ~4.6 days before maximum B-band light. We describe how the SNe in our sample are discovered, observed, and processed, and we compare the results from our newly developed automated photometry pipeline to those from the previous processing pipeline used by LOSS. After investigating potential biases, we derive a final systematic uncertainty of 0.03 mag in BVRI for our dataset. We perform an analysis of our light curves with particular focus on using template fitting to measure the parameters that are useful in standardising SNe Ia as distance indicators. All of the data are available to the community, and we encourage future studies to incorporate our light curves in their analyses.
△ Less
Submitted 24 September, 2019;
originally announced September 2019.
-
Full control of Co valence in isopolar LaCoO3 / LaTiO3 perovskite heterostructures via interfacial engineering
Authors:
Georgios Araizi-Kanoutas,
Jaap Geessinck,
Nicolas Gauquelin,
Steef Smit,
Xanthe Verbeek,
Shrawan K. Mishra,
Peter Bencok,
Christoph Schlueter,
Tien-Lin Lee,
Dileep Krishnan,
Jo Verbeeck,
Guus Rijnders,
Gertjan Koster,
Mark S. Golden
Abstract:
We report charge-transfer up to a single electron per interfacial unit cell across non-polar heterointerfaces from the Mott insulator LaTiO3 to the charge transfer insulator LaCoO3. In high-quality bi- and tri-layer systems grown using pulsed laser deposition, soft X-ray absorption, dichroism and STEM-EELS are used to probe the cobalt 3d-electron count and provide an element-specific investigation…
▽ More
We report charge-transfer up to a single electron per interfacial unit cell across non-polar heterointerfaces from the Mott insulator LaTiO3 to the charge transfer insulator LaCoO3. In high-quality bi- and tri-layer systems grown using pulsed laser deposition, soft X-ray absorption, dichroism and STEM-EELS are used to probe the cobalt 3d-electron count and provide an element-specific investigation of the magnetic properties. The experiments prove a deterministically-tunable charge transfer process acting in the LaCoO3 within three unit cells of the heterointerface, able to generate full conversion to 3d7 divalent Co, which displays a paramagnetic ground state. The number of LaTiO3 / LaCoO3 interfaces, the thickness of an additional "break" layer between the LaTiO3 and LaCoO3, and the LaCoO3 film thickness itself in tri-layers provide a trio of sensitive control knobs for the charge transfer process, illustrating the efficacy of O2p-band alignment as a guiding principle for property design in complex oxide heterointerfaces.
△ Less
Submitted 11 September, 2019;
originally announced September 2019.
-
Boundless: Generative Adversarial Networks for Image Extension
Authors:
Piotr Teterwak,
Aaron Sarna,
Dilip Krishnan,
Aaron Maschinot,
David Belanger,
Ce Liu,
William T. Freeman
Abstract:
Image extension models have broad applications in image editing, computational photography and computer graphics. While image inpainting has been extensively studied in the literature, it is challenging to directly apply the state-of-the-art inpainting methods to image extension as they tend to generate blurry or repetitive pixels with inconsistent semantics. We introduce semantic conditioning to…
▽ More
Image extension models have broad applications in image editing, computational photography and computer graphics. While image inpainting has been extensively studied in the literature, it is challenging to directly apply the state-of-the-art inpainting methods to image extension as they tend to generate blurry or repetitive pixels with inconsistent semantics. We introduce semantic conditioning to the discriminator of a generative adversarial network (GAN), and achieve strong results on image extension with coherent semantics and visually pleasing colors and textures. We also show promising results in extreme extensions, such as panorama generation.
△ Less
Submitted 19 August, 2019;
originally announced August 2019.
-
Adversarial Robustness through Local Linearization
Authors:
Chongli Qin,
James Martens,
Sven Gowal,
Dilip Krishnan,
Krishnamurthy Dvijotham,
Alhussein Fawzi,
Soham De,
Robert Stanforth,
Pushmeet Kohli
Abstract:
Adversarial training is an effective methodology for training deep neural networks that are robust against adversarial, norm-bounded perturbations. However, the computational cost of adversarial training grows prohibitively as the size of the model and number of input dimensions increase. Further, training against less expensive and therefore weaker adversaries produces models that are robust agai…
▽ More
Adversarial training is an effective methodology for training deep neural networks that are robust against adversarial, norm-bounded perturbations. However, the computational cost of adversarial training grows prohibitively as the size of the model and number of input dimensions increase. Further, training against less expensive and therefore weaker adversaries produces models that are robust against weak attacks but break down under attacks that are stronger. This is often attributed to the phenomenon of gradient obfuscation; such models have a highly non-linear loss surface in the vicinity of training examples, making it hard for gradient-based attacks to succeed even though adversarial examples still exist. In this work, we introduce a novel regularizer that encourages the loss to behave linearly in the vicinity of the training data, thereby penalizing gradient obfuscation while encouraging robustness. We show via extensive experiments on CIFAR-10 and ImageNet, that models trained with our regularizer avoid gradient obfuscation and can be trained significantly faster than adversarial training. Using this regularizer, we exceed current state of the art and achieve 47% adversarial accuracy for ImageNet with l-infinity adversarial perturbations of radius 4/255 under an untargeted, strong, white-box attack. Additionally, we match state of the art results for CIFAR-10 at 8/255.
△ Less
Submitted 10 October, 2019; v1 submitted 4 July, 2019;
originally announced July 2019.
-
Contrastive Multiview Coding
Authors:
Yonglong Tian,
Dilip Krishnan,
Phillip Isola
Abstract:
Humans view the world through many sensory channels, e.g., the long-wavelength light channel, viewed by the left eye, or the high-frequency vibrations channel, heard by the right ear. Each view is noisy and incomplete, but important factors, such as physics, geometry, and semantics, tend to be shared between all views (e.g., a "dog" can be seen, heard, and felt). We investigate the classic hypothe…
▽ More
Humans view the world through many sensory channels, e.g., the long-wavelength light channel, viewed by the left eye, or the high-frequency vibrations channel, heard by the right ear. Each view is noisy and incomplete, but important factors, such as physics, geometry, and semantics, tend to be shared between all views (e.g., a "dog" can be seen, heard, and felt). We investigate the classic hypothesis that a powerful representation is one that models view-invariant factors. We study this hypothesis under the framework of multiview contrastive learning, where we learn a representation that aims to maximize mutual information between different views of the same scene but is otherwise compact. Our approach scales to any number of views, and is view-agnostic. We analyze key properties of the approach that make it work, finding that the contrastive loss outperforms a popular alternative based on cross-view prediction, and that the more views we learn from, the better the resulting representation captures underlying scene semantics. Our approach achieves state-of-the-art results on image and video unsupervised learning benchmarks. Code is released at: http://github.com/HobbitLong/CMC/.
△ Less
Submitted 18 December, 2020; v1 submitted 13 June, 2019;
originally announced June 2019.
-
A Closed-Form Learned Pooling for Deep Classification Networks
Authors:
Vighnesh Birodkar,
Hossein Mobahi,
Dilip Krishnan,
Samy Bengio
Abstract:
In modern computer vision tasks, convolutional neural networks (CNNs) are indispensable for image classification tasks due to their efficiency and effectiveness. Part of their superiority compared to other architectures, comes from the fact that a single, local filter is shared across the entire image. However, there are scenarios where we may need to treat spatial locations in non-uniform manner.…
▽ More
In modern computer vision tasks, convolutional neural networks (CNNs) are indispensable for image classification tasks due to their efficiency and effectiveness. Part of their superiority compared to other architectures, comes from the fact that a single, local filter is shared across the entire image. However, there are scenarios where we may need to treat spatial locations in non-uniform manner. We see this in nature when considering how humans have evolved foveation to process different areas in their field of vision with varying levels of detail. In this paper we propose a way to enable CNNs to learn different pooling weights for each pixel location. We do so by introducing an extended definition of a pooling operator. This operator can learn a strict super-set of what can be learned by average pooling or convolutions. It has the benefit of being shared across feature maps and can be encouraged to be local or diffuse depending on the data. We show that for fixed network weights, our pooling operator can be computed in closed-form by spectral decomposition of matrices associated with class separability. Through experiments, we show that this operator benefits generalization for ResNets and CNNs on the CIFAR-10, CIFAR-100 and SVHN datasets and improves robustness to geometric corruptions and perturbations on the CIFAR-10-C and CIFAR-10-P test sets.
△ Less
Submitted 10 June, 2019;
originally announced June 2019.
-
Influence of stoichiometry on interfacial conductance in LaAlO$_3$/SrTiO$_3$ grown by 90$^o$ off-axis sputtering
Authors:
Chunhai Yin,
Dileep Krishnan,
Nicolas Gauquelin,
Jo Verbeeck,
Jan Aarts
Abstract:
We report on the fabrication of conducting interfaces between LaAlO$_3$ and SrTiO$_3$ by 90$^o$ off-axis sputtering in an Ar atmosphere. At a growth pressure of 0.04 mbar the interface is metallic, with a carrier density of the order of $10^{13}$ cm$^{-2}$ at 3 K. By increasing the growth pressure, we observe an increase of the out-of-plane lattice constants of the LaAlO$_3$ films while the in-pla…
▽ More
We report on the fabrication of conducting interfaces between LaAlO$_3$ and SrTiO$_3$ by 90$^o$ off-axis sputtering in an Ar atmosphere. At a growth pressure of 0.04 mbar the interface is metallic, with a carrier density of the order of $10^{13}$ cm$^{-2}$ at 3 K. By increasing the growth pressure, we observe an increase of the out-of-plane lattice constants of the LaAlO$_3$ films while the in-plane lattice constants do not change. Also, the low-temperature sheet resistance increases with increasing growth pressure, leading to an insulating interface when the growth pressure reaches 0.10 mbar. We attribute the structural variations to an increase of the La/Al ratio, which also explains the transition from metallic behavior to insulating behavior of the interfaces. Our research emphasizes the key role of the cation stoichiometry of LaAlO$_3$ in the formation of the conducting interface, and also the control which is furnished by the Ar pressure in the growth process.
△ Less
Submitted 2 November, 2018;
originally announced November 2018.
-
Predicting the Generalization Gap in Deep Networks with Margin Distributions
Authors:
Yiding Jiang,
Dilip Krishnan,
Hossein Mobahi,
Samy Bengio
Abstract:
As shown in recent research, deep neural networks can perfectly fit randomly labeled data, but with very poor accuracy on held out data. This phenomenon indicates that loss functions such as cross-entropy are not a reliable indicator of generalization. This leads to the crucial question of how generalization gap should be predicted from the training data and network parameters. In this paper, we p…
▽ More
As shown in recent research, deep neural networks can perfectly fit randomly labeled data, but with very poor accuracy on held out data. This phenomenon indicates that loss functions such as cross-entropy are not a reliable indicator of generalization. This leads to the crucial question of how generalization gap should be predicted from the training data and network parameters. In this paper, we propose such a measure, and conduct extensive empirical studies on how well it can predict the generalization gap. Our measure is based on the concept of margin distribution, which are the distances of training points to the decision boundary. We find that it is necessary to use margin distributions at multiple layers of a deep network. On the CIFAR-10 and the CIFAR-100 datasets, our proposed measure correlates very strongly with the generalization gap. In addition, we find the following other factors to be of importance: normalizing margin values for scale independence, using characterizations of margin distribution rather than just the margin (closest distance to decision boundary), and working in log space instead of linear space (effectively using a product of margins rather than a sum). Our measure can be easily applied to feedforward deep networks with any architecture and may point towards new training loss functions that could enable better generalization.
△ Less
Submitted 12 June, 2019; v1 submitted 28 September, 2018;
originally announced October 2018.
-
Large Margin Deep Networks for Classification
Authors:
Gamaleldin F. Elsayed,
Dilip Krishnan,
Hossein Mobahi,
Kevin Regan,
Samy Bengio
Abstract:
We present a formulation of deep learning that aims at producing a large margin classifier. The notion of margin, minimum distance to a decision boundary, has served as the foundation of several theoretically profound and empirically successful results for both classification and regression tasks. However, most large margin algorithms are applicable only to shallow models with a preset feature rep…
▽ More
We present a formulation of deep learning that aims at producing a large margin classifier. The notion of margin, minimum distance to a decision boundary, has served as the foundation of several theoretically profound and empirically successful results for both classification and regression tasks. However, most large margin algorithms are applicable only to shallow models with a preset feature representation; and conventional margin methods for neural networks only enforce margin at the output layer. Such methods are therefore not well suited for deep networks.
In this work, we propose a novel loss function to impose a margin on any chosen set of layers of a deep network (including input and hidden layers). Our formulation allows choosing any norm on the metric measuring the margin. We demonstrate that the decision boundary obtained by our loss has nice properties compared to standard classification loss functions. Specifically, we show improved empirical results on the MNIST, CIFAR-10 and ImageNet datasets on multiple tasks: generalization from small training sets, corrupted labels, and robustness against adversarial perturbations. The resulting loss is general and complementary to existing data augmentation (such as random/adversarial input transform) and regularization techniques (such as weight decay, dropout, and batch norm).
△ Less
Submitted 3 December, 2018; v1 submitted 15 March, 2018;
originally announced March 2018.
-
Smart, Sparse Contours to Represent and Edit Images
Authors:
Tali Dekel,
Chuang Gan,
Dilip Krishnan,
Ce Liu,
William T. Freeman
Abstract:
We study the problem of reconstructing an image from information stored at contour locations. We show that high-quality reconstructions with high fidelity to the source image can be obtained from sparse input, e.g., comprising less than $6\%$ of image pixels. This is a significant improvement over existing contour-based reconstruction methods that require much denser input to capture subtle textur…
▽ More
We study the problem of reconstructing an image from information stored at contour locations. We show that high-quality reconstructions with high fidelity to the source image can be obtained from sparse input, e.g., comprising less than $6\%$ of image pixels. This is a significant improvement over existing contour-based reconstruction methods that require much denser input to capture subtle texture information and to ensure image quality. Our model, based on generative adversarial networks, synthesizes texture and details in regions where no input information is provided. The semantic knowledge encoded into our model and the sparsity of the input allows to use contours as an intuitive interface for semantically-aware image manipulation: local edits in contour domain translate to long-range and coherent changes in pixel space. We can perform complex structural changes such as changing facial expression by simple edits of contours. Our experiments demonstrate that humans as well as a face recognition system mostly cannot distinguish between our reconstructions and the source images.
△ Less
Submitted 9 April, 2018; v1 submitted 21 December, 2017;
originally announced December 2017.
-
Synthesizing Normalized Faces from Facial Identity Features
Authors:
Forrester Cole,
David Belanger,
Dilip Krishnan,
Aaron Sarna,
Inbar Mosseri,
William T. Freeman
Abstract:
We present a method for synthesizing a frontal, neutral-expression image of a person's face given an input face photograph. This is achieved by learning to generate facial landmarks and textures from features extracted from a facial-recognition network. Unlike previous approaches, our encoding feature vector is largely invariant to lighting, pose, and facial expression. Exploiting this invariance,…
▽ More
We present a method for synthesizing a frontal, neutral-expression image of a person's face given an input face photograph. This is achieved by learning to generate facial landmarks and textures from features extracted from a facial-recognition network. Unlike previous approaches, our encoding feature vector is largely invariant to lighting, pose, and facial expression. Exploiting this invariance, we train our decoder network using only frontal, neutral-expression photographs. Since these photographs are well aligned, we can decompose them into a sparse set of landmark points and aligned texture maps. The decoder then predicts landmarks and textures independently and combines them using a differentiable image warping operation. The resulting images can be used for a number of applications, such as analyzing facial attributes, exposure and white balance adjustment, or creating a 3-D avatar.
△ Less
Submitted 17 October, 2017; v1 submitted 17 January, 2017;
originally announced January 2017.
-
Unsupervised Pixel-Level Domain Adaptation with Generative Adversarial Networks
Authors:
Konstantinos Bousmalis,
Nathan Silberman,
David Dohan,
Dumitru Erhan,
Dilip Krishnan
Abstract:
Collecting well-annotated image datasets to train modern machine learning algorithms is prohibitively expensive for many tasks. One appealing alternative is rendering synthetic data where ground-truth annotations are generated automatically. Unfortunately, models trained purely on rendered images often fail to generalize to real images. To address this shortcoming, prior work introduced unsupervis…
▽ More
Collecting well-annotated image datasets to train modern machine learning algorithms is prohibitively expensive for many tasks. One appealing alternative is rendering synthetic data where ground-truth annotations are generated automatically. Unfortunately, models trained purely on rendered images often fail to generalize to real images. To address this shortcoming, prior work introduced unsupervised domain adaptation algorithms that attempt to map representations between the two domains or learn to extract features that are domain-invariant. In this work, we present a new approach that learns, in an unsupervised manner, a transformation in the pixel space from one domain to the other. Our generative adversarial network (GAN)-based method adapts source-domain images to appear as if drawn from the target domain. Our approach not only produces plausible samples, but also outperforms the state-of-the-art on a number of unsupervised domain adaptation scenarios by large margins. Finally, we demonstrate that the adaptation process generalizes to object classes unseen during training.
△ Less
Submitted 23 August, 2017; v1 submitted 16 December, 2016;
originally announced December 2016.
-
The Marriage of Incremental and Approximate Computing
Authors:
Dhanya R Krishnan
Abstract:
Most data analytics systems that require low-latency execution and efficient utilization of computing resources, increasingly adopt two computational paradigms, namely, incremental and approximate computing. Incremental computation updates the output incrementally instead of re-computing everything from scratch for successive runs of a job with input changes. Approximate computation returns an app…
▽ More
Most data analytics systems that require low-latency execution and efficient utilization of computing resources, increasingly adopt two computational paradigms, namely, incremental and approximate computing. Incremental computation updates the output incrementally instead of re-computing everything from scratch for successive runs of a job with input changes. Approximate computation returns an approximate output for a job instead of the exact output.
Both paradigms rely on computing over a subset of data items instead of computing over the entire dataset, but they differ in their means for skipping parts of the computation. Incremental computing relies on the memoization of intermediate results of sub-computations, and reusing these memoized results across jobs for sub-computations that are unaffected by the changed input. Approximate computing relies on representative sampling of the entire dataset to compute over a subset of data items.
In this thesis, we make the observation that these two computing paradigms are complementary, and can be married together! The high level idea is to: design a sampling algorithm that biases the sample selection to the memoized data items from previous runs. To concretize this idea, we designed an online stratified sampling algorithm that uses self-adjusting computation to produce an incrementally updated approximate output with bounded error. We implemented our algorithm in a data analytics system called IncAppox based on Apache Spark Streaming. Our evaluation of the system shows that IncApprox achieves the benefits of both incremental and approximate computing.
△ Less
Submitted 25 November, 2016;
originally announced November 2016.
-
Domain Separation Networks
Authors:
Konstantinos Bousmalis,
George Trigeorgis,
Nathan Silberman,
Dilip Krishnan,
Dumitru Erhan
Abstract:
The cost of large scale data collection and annotation often makes the application of machine learning algorithms to new tasks or datasets prohibitively expensive. One approach circumventing this cost is training models on synthetic data where annotations are provided automatically. Despite their appeal, such models often fail to generalize from synthetic to real images, necessitating domain adapt…
▽ More
The cost of large scale data collection and annotation often makes the application of machine learning algorithms to new tasks or datasets prohibitively expensive. One approach circumventing this cost is training models on synthetic data where annotations are provided automatically. Despite their appeal, such models often fail to generalize from synthetic to real images, necessitating domain adaptation algorithms to manipulate these models before they can be successfully applied. Existing approaches focus either on mapping representations from one domain to the other, or on learning to extract features that are invariant to the domain from which they were extracted. However, by focusing only on creating a mapping or shared representation between the two domains, they ignore the individual characteristics of each domain. We suggest that explicitly modeling what is unique to each domain can improve a model's ability to extract domain-invariant features. Inspired by work on private-shared component analysis, we explicitly learn to extract image representations that are partitioned into two subspaces: one component which is private to each domain and one which is shared across domains. Our model is trained not only to perform the task we care about in the source domain, but also to use the partitioned representation to reconstruct the images from both domains. Our novel architecture results in a model that outperforms the state-of-the-art on a range of unsupervised domain adaptation scenarios and additionally produces visualizations of the private and shared representations enabling interpretation of the domain adaptation process.
△ Less
Submitted 21 August, 2016;
originally announced August 2016.
-
Learning visual groups from co-occurrences in space and time
Authors:
Phillip Isola,
Daniel Zoran,
Dilip Krishnan,
Edward H. Adelson
Abstract:
We propose a self-supervised framework that learns to group visual entities based on their rate of co-occurrence in space and time. To model statistical dependencies between the entities, we set up a simple binary classification problem in which the goal is to predict if two visual primitives occur in the same spatial or temporal context. We apply this framework to three domains: learning patch af…
▽ More
We propose a self-supervised framework that learns to group visual entities based on their rate of co-occurrence in space and time. To model statistical dependencies between the entities, we set up a simple binary classification problem in which the goal is to predict if two visual primitives occur in the same spatial or temporal context. We apply this framework to three domains: learning patch affinities from spatial adjacency in images, learning frame affinities from temporal adjacency in videos, and learning photo affinities from geospatial proximity in image collections. We demonstrate that in each case the learned affinities uncover meaningful semantic groupings. From patch affinities we generate object proposals that are competitive with state-of-the-art supervised methods. From frame affinities we generate movie scene segmentations that correlate well with DVD chapter structure. Finally, from geospatial affinities we learn groups that relate well to semantic place categories.
△ Less
Submitted 20 November, 2015;
originally announced November 2015.
-
Blind Deconvolution with Non-local Sparsity Reweighting
Authors:
Dilip Krishnan,
Joan Bruna,
Rob Fergus
Abstract:
Blind deconvolution has made significant progress in the past decade. Most successful algorithms are classified either as Variational or Maximum a-Posteriori ($MAP$). In spite of the superior theoretical justification of variational techniques, carefully constructed $MAP$ algorithms have proven equally effective in practice. In this paper, we show that all successful $MAP$ and variational algorith…
▽ More
Blind deconvolution has made significant progress in the past decade. Most successful algorithms are classified either as Variational or Maximum a-Posteriori ($MAP$). In spite of the superior theoretical justification of variational techniques, carefully constructed $MAP$ algorithms have proven equally effective in practice. In this paper, we show that all successful $MAP$ and variational algorithms share a common framework, relying on the following key principles: sparsity promotion in the gradient domain, $l_2$ regularization for kernel estimation, and the use of convex (often quadratic) cost functions. Our observations lead to a unified understanding of the principles required for successful blind deconvolution. We incorporate these principles into a novel algorithm that improves significantly upon the state of the art.
△ Less
Submitted 16 June, 2014; v1 submitted 16 November, 2013;
originally announced November 2013.
-
Ethics Understanding of Software Professional In Risk Reducing Reusability Coding Using Inclusion Set Theory
Authors:
G. Singaravel,
Dr. V. Palanisamy,
Dr. A. Krishnan
Abstract:
The technical skill or ability of an individual is different to person in software developments of projects. So, it is necessary to identify the talent and attitude of an individual contribution can be uniformly distributed to the different phases of software development cycle. The line of code analysis metrics to understanding the various skills of the programmers in code development. By using…
▽ More
The technical skill or ability of an individual is different to person in software developments of projects. So, it is necessary to identify the talent and attitude of an individual contribution can be uniformly distributed to the different phases of software development cycle. The line of code analysis metrics to understanding the various skills of the programmers in code development. By using the inclusion set theory of n (AUB) refer to strength and risk free code developed from union of software professionals and system must comprise of achievement of the system goal, effective memory utilization and intime delivery of the product.
△ Less
Submitted 5 December, 2009;
originally announced December 2009.