-
CoLLM: A Large Language Model for Composed Image Retrieval
Authors:
Chuong Huynh,
Jinyu Yang,
Ashish Tawari,
Mubarak Shah,
Son Tran,
Raffay Hamid,
Trishul Chilimbi,
Abhinav Shrivastava
Abstract:
Composed Image Retrieval (CIR) is a complex task that aims to retrieve images based on a multimodal query. Typical training data consists of triplets containing a reference image, a textual description of desired modifications, and the target image, which are expensive and time-consuming to acquire. The scarcity of CIR datasets has led to zero-shot approaches utilizing synthetic triplets or levera…
▽ More
Composed Image Retrieval (CIR) is a complex task that aims to retrieve images based on a multimodal query. Typical training data consists of triplets containing a reference image, a textual description of desired modifications, and the target image, which are expensive and time-consuming to acquire. The scarcity of CIR datasets has led to zero-shot approaches utilizing synthetic triplets or leveraging vision-language models (VLMs) with ubiquitous web-crawled image-caption pairs. However, these methods have significant limitations: synthetic triplets suffer from limited scale, lack of diversity, and unnatural modification text, while image-caption pairs hinder joint embedding learning of the multimodal query due to the absence of triplet data. Moreover, existing approaches struggle with complex and nuanced modification texts that demand sophisticated fusion and understanding of vision and language modalities. We present CoLLM, a one-stop framework that effectively addresses these limitations. Our approach generates triplets on-the-fly from image-caption pairs, enabling supervised training without manual annotation. We leverage Large Language Models (LLMs) to generate joint embeddings of reference images and modification texts, facilitating deeper multimodal fusion. Additionally, we introduce Multi-Text CIR (MTCIR), a large-scale dataset comprising 3.4M samples, and refine existing CIR benchmarks (CIRR and Fashion-IQ) to enhance evaluation reliability. Experimental results demonstrate that CoLLM achieves state-of-the-art performance across multiple CIR benchmarks and settings. MTCIR yields competitive results, with up to 15% performance improvement. Our refined benchmarks provide more reliable evaluation metrics for CIR models, contributing to the advancement of this important field.
△ Less
Submitted 25 March, 2025;
originally announced March 2025.
-
Open Vocabulary Multi-Label Video Classification
Authors:
Rohit Gupta,
Mamshad Nayeem Rizve,
Jayakrishnan Unnikrishnan,
Ashish Tawari,
Son Tran,
Mubarak Shah,
Benjamin Yao,
Trishul Chilimbi
Abstract:
Pre-trained vision-language models (VLMs) have enabled significant progress in open vocabulary computer vision tasks such as image classification, object detection and image segmentation. Some recent works have focused on extending VLMs to open vocabulary single label action classification in videos. However, previous methods fall short in holistic video understanding which requires the ability to…
▽ More
Pre-trained vision-language models (VLMs) have enabled significant progress in open vocabulary computer vision tasks such as image classification, object detection and image segmentation. Some recent works have focused on extending VLMs to open vocabulary single label action classification in videos. However, previous methods fall short in holistic video understanding which requires the ability to simultaneously recognize multiple actions and entities e.g., objects in the video in an open vocabulary setting. We formulate this problem as open vocabulary multilabel video classification and propose a method to adapt a pre-trained VLM such as CLIP to solve this task. We leverage large language models (LLMs) to provide semantic guidance to the VLM about class labels to improve its open vocabulary performance with two key contributions. First, we propose an end-to-end trainable architecture that learns to prompt an LLM to generate soft attributes for the CLIP text-encoder to enable it to recognize novel classes. Second, we integrate a temporal modeling module into CLIP's vision encoder to effectively model the spatio-temporal dynamics of video concepts as well as propose a novel regularized finetuning technique to ensure strong open vocabulary classification performance in the video domain. Our extensive experimentation showcases the efficacy of our approach on multiple benchmark datasets.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Dynamic Complexity of Expansion
Authors:
Samir Datta,
Anuj Tawari,
Yadu Vasudev
Abstract:
Dynamic Complexity was introduced by Immerman and Patnaik \cite{PatnaikImmerman97} (see also \cite{DongST95}). It has seen a resurgence of interest in the recent past, see \cite{DattaHK14,ZeumeS15,MunozVZ16,BouyerJ17,Zeume17,DKMSZ18,DMVZ18,BarceloRZ18,DMSVZ19,SchmidtSVZK20,DKMTVZ20} for some representative examples. Use of linear algebra has been a notable feature of some of these papers. We exten…
▽ More
Dynamic Complexity was introduced by Immerman and Patnaik \cite{PatnaikImmerman97} (see also \cite{DongST95}). It has seen a resurgence of interest in the recent past, see \cite{DattaHK14,ZeumeS15,MunozVZ16,BouyerJ17,Zeume17,DKMSZ18,DMVZ18,BarceloRZ18,DMSVZ19,SchmidtSVZK20,DKMTVZ20} for some representative examples. Use of linear algebra has been a notable feature of some of these papers. We extend this theme to show that the gap version of spectral expansion in bounded degree graphs can be maintained in the class $\DynACz$ (also known as $\dynfo$, for domain independent queries) under batch changes (insertions and deletions) of $O(\frac{\log{n}}{\log{\log{n}}})$ many edges.
The spectral graph theoretic material of this work is based on the paper by Kale-Seshadri \cite{KaleS11}. Our primary technical contribution is to maintain up to logarithmic powers of the transition matrix of a bounded degree undirected graph in $\DynACz$.
△ Less
Submitted 13 August, 2020;
originally announced August 2020.
-
Dynamic complexity of Reachability: How many changes can we handle?
Authors:
Samir Datta,
Pankaj Kumar,
Anish Mukherjee,
Anuj Tawari,
Nils Vortmeier,
Thomas Zeume
Abstract:
In 2015, it was shown that reachability for arbitrary directed graphs can be updated by first-order formulas after inserting or deleting single edges. Later, in 2018, this was extended for changes of size $\frac{\log n}{\log \log n}$, where $n$ is the size of the graph. Changes of polylogarithmic size can be handled when also majority quantifiers may be used.
In this paper we extend these result…
▽ More
In 2015, it was shown that reachability for arbitrary directed graphs can be updated by first-order formulas after inserting or deleting single edges. Later, in 2018, this was extended for changes of size $\frac{\log n}{\log \log n}$, where $n$ is the size of the graph. Changes of polylogarithmic size can be handled when also majority quantifiers may be used.
In this paper we extend these results by showing that, for changes of polylogarithmic size, first-order update formulas suffice for maintaining (1) undirected reachability, and (2) directed reachability under insertions. For classes of directed graphs for which efficient parallel algorithms can compute non-zero circulation weights, reachability can be maintained with update formulas that may use "modulo 2" quantifiers under changes of polylogarithmic size. Examples for these classes include the class of planar graphs and graphs with bounded treewidth. The latter is shown here.
As the logics we consider cannot maintain reachability under changes of larger sizes, our results are optimal with respect to the size of the changes.
△ Less
Submitted 27 April, 2020;
originally announced April 2020.
-
Interaction Graphs for Object Importance Estimation in On-road Driving Videos
Authors:
Zehua Zhang,
Ashish Tawari,
Sujitha Martin,
David Crandall
Abstract:
A vehicle driving along the road is surrounded by many objects, but only a small subset of them influence the driver's decisions and actions. Learning to estimate the importance of each object on the driver's real-time decision-making may help better understand human driving behavior and lead to more reliable autonomous driving systems. Solving this problem requires models that understand the inte…
▽ More
A vehicle driving along the road is surrounded by many objects, but only a small subset of them influence the driver's decisions and actions. Learning to estimate the importance of each object on the driver's real-time decision-making may help better understand human driving behavior and lead to more reliable autonomous driving systems. Solving this problem requires models that understand the interactions between the ego-vehicle and the surrounding objects. However, interactions among other objects in the scene can potentially also be very helpful, e.g., a pedestrian beginning to cross the road between the ego-vehicle and the car in front will make the car in front less important. We propose a novel framework for object importance estimation using an interaction graph, in which the features of each object node are updated by interacting with others through graph convolution. Experiments show that our model outperforms state-of-the-art baselines with much less input and pre-processing.
△ Less
Submitted 12 March, 2020;
originally announced March 2020.
-
Grounding Human-to-Vehicle Advice for Self-driving Vehicles
Authors:
Jinkyu Kim,
Teruhisa Misu,
Yi-Ting Chen,
Ashish Tawari,
John Canny
Abstract:
Recent success suggests that deep neural control networks are likely to be a key component of self-driving vehicles. These networks are trained on large datasets to imitate human actions, but they lack semantic understanding of image contents. This makes them brittle and potentially unsafe in situations that do not match training data. Here, we propose to address this issue by augmenting training…
▽ More
Recent success suggests that deep neural control networks are likely to be a key component of self-driving vehicles. These networks are trained on large datasets to imitate human actions, but they lack semantic understanding of image contents. This makes them brittle and potentially unsafe in situations that do not match training data. Here, we propose to address this issue by augmenting training data with natural language advice from a human. Advice includes guidance about what to do and where to attend. We present the first step toward advice giving, where we train an end-to-end vehicle controller that accepts advice. The controller adapts the way it attends to the scene (visual attention) and the control (steering and speed). Attention mechanisms tie controller behavior to salient objects in the advice. We evaluate our model on a novel advisable driving dataset with manually annotated human-to-vehicle advice called Honda Research Institute-Advice Dataset (HAD). We show that taking advice improves the performance of the end-to-end network, while the network cues on a variety of visual features that are provided by advice. The dataset is available at https://usa.honda-ri.com/HAD.
△ Less
Submitted 16 November, 2019;
originally announced November 2019.
-
Context Aware Road-user Importance Estimation (iCARE)
Authors:
Alireza Rahimpour,
Sujitha Martin,
Ashish Tawari,
Hairong Qi
Abstract:
Road-users are a critical part of decision-making for both self-driving cars and driver assistance systems. Some road-users, however, are more important for decision-making than others because of their respective intentions, ego vehicle's intention and their effects on each other. In this paper, we propose a novel architecture for road-user importance estimation which takes advantage of the local…
▽ More
Road-users are a critical part of decision-making for both self-driving cars and driver assistance systems. Some road-users, however, are more important for decision-making than others because of their respective intentions, ego vehicle's intention and their effects on each other. In this paper, we propose a novel architecture for road-user importance estimation which takes advantage of the local and global context of the scene. For local context, the model exploits the appearance of the road users (which captures orientation, intention, etc.) and their location relative to ego-vehicle. The global context in our model is defined based on the feature map of the convolutional layer of the module which predicts the future path of the ego-vehicle and contains rich global information of the scene (e.g., infrastructure, road lanes, etc.), as well as the ego vehicle's intention information. Moreover, this paper introduces a new data set of real-world driving, concentrated around inter-sections and includes annotations of important road users. Systematic evaluations of our proposed method against several baselines show promising results.
△ Less
Submitted 30 August, 2019;
originally announced September 2019.
-
Goal-oriented Object Importance Estimation in On-road Driving Videos
Authors:
Mingfei Gao,
Ashish Tawari,
Sujitha Martin
Abstract:
We formulate a new problem as Object Importance Estimation (OIE) in on-road driving videos, where the road users are considered as important objects if they have influence on the control decision of the ego-vehicle's driver. The importance of a road user depends on both its visual dynamics, e.g., appearance, motion and location, in the driving scene and the driving goal, \emph{e.g}., the planned p…
▽ More
We formulate a new problem as Object Importance Estimation (OIE) in on-road driving videos, where the road users are considered as important objects if they have influence on the control decision of the ego-vehicle's driver. The importance of a road user depends on both its visual dynamics, e.g., appearance, motion and location, in the driving scene and the driving goal, \emph{e.g}., the planned path, of the ego vehicle. We propose a novel framework that incorporates both visual model and goal representation to conduct OIE. To evaluate our framework, we collect an on-road driving dataset at traffic intersections in the real world and conduct human-labeled annotation of the important objects. Experimental results show that our goal-oriented method outperforms baselines and has much more improvement on the left-turn and right-turn scenarios. Furthermore, we explore the possibility of using object importance for driving control prediction and demonstrate that binary brake prediction can be improved with the information of object importance.
△ Less
Submitted 7 May, 2019;
originally announced May 2019.
-
Sums of read-once formulas: How many summands suffice?
Authors:
Meena Mahajan,
Anuj Tawari
Abstract:
An arithmetic read-once formula (ROF) is a formula (circuit of fan-out 1) over $+,\times$ where each variable labels at most one leaf. Every multilinear polynomial can be expressed as the sum of ROFs. In this work, we prove, for certain multilinear polynomials, a tight lower bound on the number of summands in such an expression.
An arithmetic read-once formula (ROF) is a formula (circuit of fan-out 1) over $+,\times$ where each variable labels at most one leaf. Every multilinear polynomial can be expressed as the sum of ROFs. In this work, we prove, for certain multilinear polynomials, a tight lower bound on the number of summands in such an expression.
△ Less
Submitted 8 March, 2016;
originally announced March 2016.
-
Read-once polynomials: How many summands suffice?
Authors:
Meena Mahajan,
Anuj Tawari
Abstract:
An arithmetic read-once formula (ROF) is a formula (circuit of fan-out 1) over $+, \times$ where each variable labels at most one leaf. Every multilinear polynomial can be expressed as the sum of ROFs. In this work, we prove, for certain multilinear polynomials, a tight lower bound on the number of summands in such an expression.
An arithmetic read-once formula (ROF) is a formula (circuit of fan-out 1) over $+, \times$ where each variable labels at most one leaf. Every multilinear polynomial can be expressed as the sum of ROFs. In this work, we prove, for certain multilinear polynomials, a tight lower bound on the number of summands in such an expression.
△ Less
Submitted 14 December, 2015;
originally announced December 2015.