Search | arXiv e-print repository

Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning

Authors: Jishnu Jaykumar P, Kamalesh Palanisamy, Yu-Wei Chao, Xinya Du, Yu Xiang

Abstract: We propose a novel framework for few-shot learning by leveraging large-scale vision-language models such as CLIP. Motivated by unimodal prototypical networks for few-shot learning, we introduce Proto-CLIP which utilizes image prototypes and text prototypes for few-shot learning. Specifically, Proto-CLIP adapts the image and text encoder embeddings from CLIP in a joint fashion using few-shot exampl… ▽ More We propose a novel framework for few-shot learning by leveraging large-scale vision-language models such as CLIP. Motivated by unimodal prototypical networks for few-shot learning, we introduce Proto-CLIP which utilizes image prototypes and text prototypes for few-shot learning. Specifically, Proto-CLIP adapts the image and text encoder embeddings from CLIP in a joint fashion using few-shot examples. The embeddings from the two encoders are used to compute the respective prototypes of image classes for classification. During adaptation, we propose aligning the image and text prototypes of the corresponding classes. Such alignment is beneficial for few-shot classification due to the reinforced contributions from both types of prototypes. Proto-CLIP has both training-free and fine-tuned variants. We demonstrate the effectiveness of our method by conducting experiments on benchmark datasets for few-shot learning, as well as in the real world for robot perception. The project page is available at https://irvlutd.github.io/Proto-CLIP △ Less

Submitted 14 July, 2024; v1 submitted 6 July, 2023; originally announced July 2023.

Comments: Accepted at 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

arXiv:2302.03793 [pdf, other]

Self-Supervised Unseen Object Instance Segmentation via Long-Term Robot Interaction

Authors: Yangxiao Lu, Ninad Khargonkar, Zesheng Xu, Charles Averill, Kamalesh Palanisamy, Kaiyu Hang, Yunhui Guo, Nicholas Ruozzi, Yu Xiang

Abstract: We introduce a novel robotic system for improving unseen object instance segmentation in the real world by leveraging long-term robot interaction with objects. Previous approaches either grasp or push an object and then obtain the segmentation mask of the grasped or pushed object after one action. Instead, our system defers the decision on segmenting objects after a sequence of robot pushing actio… ▽ More We introduce a novel robotic system for improving unseen object instance segmentation in the real world by leveraging long-term robot interaction with objects. Previous approaches either grasp or push an object and then obtain the segmentation mask of the grasped or pushed object after one action. Instead, our system defers the decision on segmenting objects after a sequence of robot pushing actions. By applying multi-object tracking and video object segmentation on the images collected via robot pushing, our system can generate segmentation masks of all the objects in these images in a self-supervised way. These include images where objects are very close to each other, and segmentation errors usually occur on these images for existing object segmentation networks. We demonstrate the usefulness of our system by fine-tuning segmentation networks trained on synthetic data with real-world data collected by our system. We show that, after fine-tuning, the segmentation accuracy of the networks is significantly improved both in the same domain and across different domains. In addition, we verify that the fine-tuned networks improve top-down robotic grasping of unseen objects in the real world. △ Less

Submitted 7 February, 2023; originally announced February 2023.

Comments: 11 pages, 7 figures, 5 tables

arXiv:2011.04232 [pdf, other]

doi 10.1145/3446382.3448362

SplitEasy: A Practical Approach for Training ML models on Mobile Devices

Authors: Kamalesh Palanisamy, Vivek Khimani, Moin Hussain Moti, Dimitris Chatzopoulos

Abstract: Modern mobile devices, although resourceful, cannot train state-of-the-art machine learning models without the assistance of servers, which require access to, potentially, privacy-sensitive user data. Split learning has recently emerged as a promising technique for training complex deep learning (DL) models on low-powered mobile devices. The core idea behind this technique is to train the sensitiv… ▽ More Modern mobile devices, although resourceful, cannot train state-of-the-art machine learning models without the assistance of servers, which require access to, potentially, privacy-sensitive user data. Split learning has recently emerged as a promising technique for training complex deep learning (DL) models on low-powered mobile devices. The core idea behind this technique is to train the sensitive layers of a DL model on mobile devices while offloading the computationally intensive layers to a server. Although a lot of works have already explored the effectiveness of split learning in simulated settings, a usable toolkit for this purpose does not exist. In this work, we highlight the theoretical and technical challenges that need to be resolved to develop a functional framework that trains ML models in mobile devices without transferring raw data to a server. Focusing on these challenges, we propose SplitEasy, a framework for training ML models on mobile devices using split learning. Using the abstraction provided by SplitEasy, developers can run various DL models under split learning setting by making minimal modifications. We provide a detailed explanation of SplitEasy and perform experiments with six state-of-the-art neural networks. We demonstrate how SplitEasy can train models that cannot be trained solely by a mobile device while incurring nearly constant time per data sample. △ Less

Submitted 29 January, 2021; v1 submitted 9 November, 2020; originally announced November 2020.

Comments: 7 pages, 4 figures, Accepted at the ACM HotMobile workshop

arXiv:2007.11154 [pdf, other]

Rethinking CNN Models for Audio Classification

Authors: Kamalesh Palanisamy, Dipika Singhania, Angela Yao

Abstract: In this paper, we show that ImageNet-Pretrained standard deep CNN models can be used as strong baseline networks for audio classification. Even though there is a significant difference between audio Spectrogram and standard ImageNet image samples, transfer learning assumptions still hold firmly. To understand what enables the ImageNet pretrained models to learn useful audio representations, we sys… ▽ More In this paper, we show that ImageNet-Pretrained standard deep CNN models can be used as strong baseline networks for audio classification. Even though there is a significant difference between audio Spectrogram and standard ImageNet image samples, transfer learning assumptions still hold firmly. To understand what enables the ImageNet pretrained models to learn useful audio representations, we systematically study how much of pretrained weights is useful for learning spectrograms. We show (1) that for a given standard model using pretrained weights is better than using randomly initialized weights (2) qualitative results of what the CNNs learn from the spectrograms by visualizing the gradients. Besides, we show that even though we use the pretrained model weights for initialization, there is variance in performance in various output runs of the same model. This variance in performance is due to the random initialization of linear classification layer and random mini-batch orderings in multiple runs. This brings significant diversity to build stronger ensemble models with an overall improvement in accuracy. An ensemble of ImageNet pretrained DenseNet achieves 92.89% validation accuracy on the ESC-50 dataset and 87.42% validation accuracy on the UrbanSound8K dataset which is the current state-of-the-art on both of these datasets. △ Less

Submitted 13 November, 2020; v1 submitted 21 July, 2020; originally announced July 2020.

Comments: 8 pages, 3 figures

Showing 1–4 of 4 results for author: Palanisamy, K