-
Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning
Authors:
Jishnu Jaykumar P,
Kamalesh Palanisamy,
Yu-Wei Chao,
Xinya Du,
Yu Xiang
Abstract:
We propose a novel framework for few-shot learning by leveraging large-scale vision-language models such as CLIP. Motivated by unimodal prototypical networks for few-shot learning, we introduce Proto-CLIP which utilizes image prototypes and text prototypes for few-shot learning. Specifically, Proto-CLIP adapts the image and text encoder embeddings from CLIP in a joint fashion using few-shot exampl…
▽ More
We propose a novel framework for few-shot learning by leveraging large-scale vision-language models such as CLIP. Motivated by unimodal prototypical networks for few-shot learning, we introduce Proto-CLIP which utilizes image prototypes and text prototypes for few-shot learning. Specifically, Proto-CLIP adapts the image and text encoder embeddings from CLIP in a joint fashion using few-shot examples. The embeddings from the two encoders are used to compute the respective prototypes of image classes for classification. During adaptation, we propose aligning the image and text prototypes of the corresponding classes. Such alignment is beneficial for few-shot classification due to the reinforced contributions from both types of prototypes. Proto-CLIP has both training-free and fine-tuned variants. We demonstrate the effectiveness of our method by conducting experiments on benchmark datasets for few-shot learning, as well as in the real world for robot perception. The project page is available at https://irvlutd.github.io/Proto-CLIP
△ Less
Submitted 14 July, 2024; v1 submitted 6 July, 2023;
originally announced July 2023.
-
Self-Supervised Unseen Object Instance Segmentation via Long-Term Robot Interaction
Authors:
Yangxiao Lu,
Ninad Khargonkar,
Zesheng Xu,
Charles Averill,
Kamalesh Palanisamy,
Kaiyu Hang,
Yunhui Guo,
Nicholas Ruozzi,
Yu Xiang
Abstract:
We introduce a novel robotic system for improving unseen object instance segmentation in the real world by leveraging long-term robot interaction with objects. Previous approaches either grasp or push an object and then obtain the segmentation mask of the grasped or pushed object after one action. Instead, our system defers the decision on segmenting objects after a sequence of robot pushing actio…
▽ More
We introduce a novel robotic system for improving unseen object instance segmentation in the real world by leveraging long-term robot interaction with objects. Previous approaches either grasp or push an object and then obtain the segmentation mask of the grasped or pushed object after one action. Instead, our system defers the decision on segmenting objects after a sequence of robot pushing actions. By applying multi-object tracking and video object segmentation on the images collected via robot pushing, our system can generate segmentation masks of all the objects in these images in a self-supervised way. These include images where objects are very close to each other, and segmentation errors usually occur on these images for existing object segmentation networks. We demonstrate the usefulness of our system by fine-tuning segmentation networks trained on synthetic data with real-world data collected by our system. We show that, after fine-tuning, the segmentation accuracy of the networks is significantly improved both in the same domain and across different domains. In addition, we verify that the fine-tuned networks improve top-down robotic grasping of unseen objects in the real world.
△ Less
Submitted 7 February, 2023;
originally announced February 2023.
-
SplitEasy: A Practical Approach for Training ML models on Mobile Devices
Authors:
Kamalesh Palanisamy,
Vivek Khimani,
Moin Hussain Moti,
Dimitris Chatzopoulos
Abstract:
Modern mobile devices, although resourceful, cannot train state-of-the-art machine learning models without the assistance of servers, which require access to, potentially, privacy-sensitive user data. Split learning has recently emerged as a promising technique for training complex deep learning (DL) models on low-powered mobile devices. The core idea behind this technique is to train the sensitiv…
▽ More
Modern mobile devices, although resourceful, cannot train state-of-the-art machine learning models without the assistance of servers, which require access to, potentially, privacy-sensitive user data. Split learning has recently emerged as a promising technique for training complex deep learning (DL) models on low-powered mobile devices. The core idea behind this technique is to train the sensitive layers of a DL model on mobile devices while offloading the computationally intensive layers to a server. Although a lot of works have already explored the effectiveness of split learning in simulated settings, a usable toolkit for this purpose does not exist. In this work, we highlight the theoretical and technical challenges that need to be resolved to develop a functional framework that trains ML models in mobile devices without transferring raw data to a server. Focusing on these challenges, we propose SplitEasy, a framework for training ML models on mobile devices using split learning. Using the abstraction provided by SplitEasy, developers can run various DL models under split learning setting by making minimal modifications. We provide a detailed explanation of SplitEasy and perform experiments with six state-of-the-art neural networks. We demonstrate how SplitEasy can train models that cannot be trained solely by a mobile device while incurring nearly constant time per data sample.
△ Less
Submitted 29 January, 2021; v1 submitted 9 November, 2020;
originally announced November 2020.
-
Rethinking CNN Models for Audio Classification
Authors:
Kamalesh Palanisamy,
Dipika Singhania,
Angela Yao
Abstract:
In this paper, we show that ImageNet-Pretrained standard deep CNN models can be used as strong baseline networks for audio classification. Even though there is a significant difference between audio Spectrogram and standard ImageNet image samples, transfer learning assumptions still hold firmly. To understand what enables the ImageNet pretrained models to learn useful audio representations, we sys…
▽ More
In this paper, we show that ImageNet-Pretrained standard deep CNN models can be used as strong baseline networks for audio classification. Even though there is a significant difference between audio Spectrogram and standard ImageNet image samples, transfer learning assumptions still hold firmly. To understand what enables the ImageNet pretrained models to learn useful audio representations, we systematically study how much of pretrained weights is useful for learning spectrograms. We show (1) that for a given standard model using pretrained weights is better than using randomly initialized weights (2) qualitative results of what the CNNs learn from the spectrograms by visualizing the gradients. Besides, we show that even though we use the pretrained model weights for initialization, there is variance in performance in various output runs of the same model. This variance in performance is due to the random initialization of linear classification layer and random mini-batch orderings in multiple runs. This brings significant diversity to build stronger ensemble models with an overall improvement in accuracy. An ensemble of ImageNet pretrained DenseNet achieves 92.89% validation accuracy on the ESC-50 dataset and 87.42% validation accuracy on the UrbanSound8K dataset which is the current state-of-the-art on both of these datasets.
△ Less
Submitted 13 November, 2020; v1 submitted 21 July, 2020;
originally announced July 2020.