-
Visual symbolic mechanisms: Emergent symbol processing in vision language models
Authors:
Rim Assouel,
Declan Campbell,
Taylor Webb
Abstract:
To accurately process a visual scene, observers must bind features together to represent individual objects. This capacity is necessary, for instance, to distinguish an image containing a red square and a blue circle from an image containing a blue square and a red circle. Recent work has found that language models solve this 'binding problem' via a set of symbol-like, content-independent indices,…
▽ More
To accurately process a visual scene, observers must bind features together to represent individual objects. This capacity is necessary, for instance, to distinguish an image containing a red square and a blue circle from an image containing a blue square and a red circle. Recent work has found that language models solve this 'binding problem' via a set of symbol-like, content-independent indices, but it is unclear whether similar mechanisms are employed by vision language models (VLMs). This question is especially relevant, given the persistent failures of VLMs on tasks that require binding. Here, we identify a set of emergent symbolic mechanisms that support binding in VLMs via a content-independent, spatial indexing scheme. Moreover, we find that binding errors can be traced directly to failures in these mechanisms. Taken together, these results shed light on the mechanisms that support symbol-like processing in VLMs, and suggest possible avenues for addressing the persistent binding failures exhibited by these models.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
Object-centric Binding in Contrastive Language-Image Pretraining
Authors:
Rim Assouel,
Pietro Astolfi,
Florian Bordes,
Michal Drozdzal,
Adriana Romero-Soriano
Abstract:
Recent advances in vision language models (VLM) have been driven by contrastive models such as CLIP, which learn to associate visual information with their corresponding text descriptions. However, these models have limitations in understanding complex compositional scenes involving multiple objects and their spatial relationships. To address these challenges, we propose a novel approach that dive…
▽ More
Recent advances in vision language models (VLM) have been driven by contrastive models such as CLIP, which learn to associate visual information with their corresponding text descriptions. However, these models have limitations in understanding complex compositional scenes involving multiple objects and their spatial relationships. To address these challenges, we propose a novel approach that diverges from commonly used strategies, which rely on the design of hard-negative augmentations. Instead, our work focuses on integrating inductive biases into pre-trained CLIP-like models to improve their compositional understanding without using any additional hard-negatives. To that end, we introduce a binding module that connects a scene graph, derived from a text description, with a slot-structured image representation, facilitating a structured similarity assessment between the two modalities. We also leverage relationships as text-conditioned visual constraints, thereby capturing the intricate interactions between objects and their contextual relationships more effectively. Our resulting model not only enhances the performance of CLIP-based models in multi-object compositional understanding but also paves the way towards more accurate and sample-efficient image-text matching of complex scenes.
△ Less
Submitted 19 February, 2025;
originally announced February 2025.
-
The BrowserGym Ecosystem for Web Agent Research
Authors:
Thibault Le Sellier De Chezelles,
Maxime Gasse,
Alexandre Drouin,
Massimo Caccia,
Léo Boisvert,
Megh Thakkar,
Tom Marty,
Rim Assouel,
Sahar Omidi Shayegan,
Lawrence Keunho Jang,
Xing Han Lù,
Ori Yoran,
Dehan Kong,
Frank F. Xu,
Siva Reddy,
Quentin Cappart,
Graham Neubig,
Ruslan Salakhutdinov,
Nicolas Chapados,
Alexandre Lacoste
Abstract:
The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging automation and Large Language Models (LLMs). Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. In an earlier work, Drouin et al. (2024) i…
▽ More
The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging automation and Large Language Models (LLMs). Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. In an earlier work, Drouin et al. (2024) introduced BrowserGym which aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature and includes AgentLab, a complementary framework that aids in agent creation, testing, and analysis. Our proposed ecosystem offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks made available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.
△ Less
Submitted 28 February, 2025; v1 submitted 6 December, 2024;
originally announced December 2024.
-
Action abstractions for amortized sampling
Authors:
Oussama Boussif,
Léna Néhale Ezzine,
Joseph D Viviano,
Michał Koziarski,
Moksh Jain,
Nikolay Malkin,
Emmanuel Bengio,
Rim Assouel,
Yoshua Bengio
Abstract:
As trajectories sampled by policies used by reinforcement learning (RL) and generative flow networks (GFlowNets) grow longer, credit assignment and exploration become more challenging, and the long planning horizon hinders mode discovery and generalization. The challenge is particularly pronounced in entropy-seeking RL methods, such as generative flow networks, where the agent must learn to sample…
▽ More
As trajectories sampled by policies used by reinforcement learning (RL) and generative flow networks (GFlowNets) grow longer, credit assignment and exploration become more challenging, and the long planning horizon hinders mode discovery and generalization. The challenge is particularly pronounced in entropy-seeking RL methods, such as generative flow networks, where the agent must learn to sample from a structured distribution and discover multiple high-reward states, each of which take many steps to reach. To tackle this challenge, we propose an approach to incorporate the discovery of action abstractions, or high-level actions, into the policy optimization process. Our approach involves iteratively extracting action subsequences commonly used across many high-reward trajectories and `chunking' them into a single action that is added to the action space. In empirical evaluation on synthetic and real-world environments, our approach demonstrates improved sample efficiency performance in discovering diverse high-reward objects, especially on harder exploration problems. We also observe that the abstracted high-order actions are interpretable, capturing the latent structure of the reward landscape of the action space. This work provides a cognitively motivated approach to action abstraction in RL and is the first demonstration of hierarchical planning in amortized sequential sampling.
△ Less
Submitted 19 October, 2024;
originally announced October 2024.
-
An Introduction to Vision-Language Modeling
Authors:
Florian Bordes,
Richard Yuanzhe Pang,
Anurag Ajay,
Alexander C. Li,
Adrien Bardes,
Suzanne Petryk,
Oscar Mañas,
Zhiqiu Lin,
Anas Mahmoud,
Bargav Jayaraman,
Mark Ibrahim,
Melissa Hall,
Yunyang Xiong,
Jonathan Lebensold,
Candace Ross,
Srihari Jayakumar,
Chuan Guo,
Diane Bouchacourt,
Haider Al-Tahan,
Karthik Padthe,
Vasu Sharma,
Hu Xu,
Xiaoqing Ellen Tan,
Megan Richards,
Samuel Lavoie
, et al. (16 additional authors not shown)
Abstract:
Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technol…
▽ More
Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
OC-NMN: Object-centric Compositional Neural Module Network for Generative Visual Analogical Reasoning
Authors:
Rim Assouel,
Pau Rodriguez,
Perouz Taslakian,
David Vazquez,
Yoshua Bengio
Abstract:
A key aspect of human intelligence is the ability to imagine -- composing learned concepts in novel ways -- to make sense of new scenarios. Such capacity is not yet attained for machine learning systems. In this work, in the context of visual reasoning, we show how modularity can be leveraged to derive a compositional data augmentation framework inspired by imagination. Our method, denoted Object-…
▽ More
A key aspect of human intelligence is the ability to imagine -- composing learned concepts in novel ways -- to make sense of new scenarios. Such capacity is not yet attained for machine learning systems. In this work, in the context of visual reasoning, we show how modularity can be leveraged to derive a compositional data augmentation framework inspired by imagination. Our method, denoted Object-centric Compositional Neural Module Network (OC-NMN), decomposes visual generative reasoning tasks into a series of primitives applied to objects without using a domain-specific language. We show that our modular architectural choices can be used to generate new training tasks that lead to better out-of-distribution generalization. We compare our model to existing and new baselines in proposed visual reasoning benchmark that consists of applying arithmetic operations to MNIST digits.
△ Less
Submitted 28 October, 2023;
originally announced October 2023.
-
Image-to-image Mapping with Many Domains by Sparse Attribute Transfer
Authors:
Matthew Amodio,
Rim Assouel,
Victor Schmidt,
Tristan Sylvain,
Smita Krishnaswamy,
Yoshua Bengio
Abstract:
Unsupervised image-to-image translation consists of learning a pair of mappings between two domains without known pairwise correspondences between points. The current convention is to approach this task with cycle-consistent GANs: using a discriminator to encourage the generator to change the image to match the target domain, while training the generator to be inverted with another mapping. While…
▽ More
Unsupervised image-to-image translation consists of learning a pair of mappings between two domains without known pairwise correspondences between points. The current convention is to approach this task with cycle-consistent GANs: using a discriminator to encourage the generator to change the image to match the target domain, while training the generator to be inverted with another mapping. While ending up with paired inverse functions may be a good end result, enforcing this restriction at all times during training can be a hindrance to effective modeling. We propose an alternate approach that directly restricts the generator to performing a simple sparse transformation in a latent layer, motivated by recent work from cognitive neuroscience suggesting an architectural prior on representations corresponding to consciousness. Our biologically motivated approach leads to representations more amenable to transformation by disentangling high-level abstract concepts in the latent space. We demonstrate that image-to-image domain translation with many different domains can be learned more effectively with our architecturally constrained, simple transformation than with previous unconstrained architectures that rely on a cycle-consistency loss.
△ Less
Submitted 23 June, 2020;
originally announced June 2020.
-
DEFactor: Differentiable Edge Factorization-based Probabilistic Graph Generation
Authors:
Rim Assouel,
Mohamed Ahmed,
Marwin H Segler,
Amir Saffari,
Yoshua Bengio
Abstract:
Generating novel molecules with optimal properties is a crucial step in many industries such as drug discovery. Recently, deep generative models have shown a promising way of performing de-novo molecular design. Although graph generative models are currently available they either have a graph size dependency in their number of parameters, limiting their use to only very small graphs or are formula…
▽ More
Generating novel molecules with optimal properties is a crucial step in many industries such as drug discovery. Recently, deep generative models have shown a promising way of performing de-novo molecular design. Although graph generative models are currently available they either have a graph size dependency in their number of parameters, limiting their use to only very small graphs or are formulated as a sequence of discrete actions needed to construct a graph, making the output graph non-differentiable w.r.t the model parameters, therefore preventing them to be used in scenarios such as conditional graph generation. In this work we propose a model for conditional graph generation that is computationally efficient and enables direct optimisation of the graph. We demonstrate favourable performance of our model on prototype-based molecular graph conditional generation tasks.
△ Less
Submitted 24 November, 2018;
originally announced November 2018.