-
Rendering-Aware Reinforcement Learning for Vector Graphics Generation
Authors:
Juan A. Rodriguez,
Haotian Zhang,
Abhay Puri,
Aarash Feizi,
Rishav Pramanik,
Pascal Wichmann,
Arnab Mondal,
Mohammad Reza Samsami,
Rabiul Awal,
Perouz Taslakian,
Spandana Gella,
Sai Rajeswar,
David Vazquez,
Christopher Pal,
Marco Pedersoli
Abstract:
Scalable Vector Graphics (SVG) offer a powerful format for representing visual designs as interpretable code. Recent advances in vision-language models (VLMs) have enabled high-quality SVG generation by framing the problem as a code generation task and leveraging large-scale pretraining. VLMs are particularly suitable for this task as they capture both global semantics and fine-grained visual patt…
▽ More
Scalable Vector Graphics (SVG) offer a powerful format for representing visual designs as interpretable code. Recent advances in vision-language models (VLMs) have enabled high-quality SVG generation by framing the problem as a code generation task and leveraging large-scale pretraining. VLMs are particularly suitable for this task as they capture both global semantics and fine-grained visual patterns, while transferring knowledge across vision, natural language, and code domains. However, existing VLM approaches often struggle to produce faithful and efficient SVGs because they never observe the rendered images during training. Although differentiable rendering for autoregressive SVG code generation remains unavailable, rendered outputs can still be compared to original inputs, enabling evaluative feedback suitable for reinforcement learning (RL). We introduce RLRF(Reinforcement Learning from Rendering Feedback), an RL method that enhances SVG generation in autoregressive VLMs by leveraging feedback from rendered SVG outputs. Given an input image, the model generates SVG roll-outs that are rendered and compared to the original image to compute a reward. This visual fidelity feedback guides the model toward producing more accurate, efficient, and semantically coherent SVGs. RLRF significantly outperforms supervised fine-tuning, addressing common failure modes and enabling precise, high-quality SVG generation with strong structural understanding and generalization.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
StarFlow: Generating Structured Workflow Outputs From Sketch Images
Authors:
Patrice Bechard,
Chao Wang,
Amirhossein Abaskohi,
Juan Rodriguez,
Christopher Pal,
David Vazquez,
Spandana Gella,
Sai Rajeswar,
Perouz Taslakian
Abstract:
Workflows are a fundamental component of automation in enterprise platforms, enabling the orchestration of tasks, data processing, and system integrations. Despite being widely used, building workflows can be complex, often requiring manual configuration through low-code platforms or visual programming tools. To simplify this process, we explore the use of generative foundation models, particularl…
▽ More
Workflows are a fundamental component of automation in enterprise platforms, enabling the orchestration of tasks, data processing, and system integrations. Despite being widely used, building workflows can be complex, often requiring manual configuration through low-code platforms or visual programming tools. To simplify this process, we explore the use of generative foundation models, particularly vision-language models (VLMs), to automatically generate structured workflows from visual inputs. Translating hand-drawn sketches or computer-generated diagrams into executable workflows is challenging due to the ambiguity of free-form drawings, variations in diagram styles, and the difficulty of inferring execution logic from visual elements. To address this, we introduce StarFlow, a framework for generating structured workflow outputs from sketches using vision-language models. We curate a diverse dataset of workflow diagrams -- including synthetic, manually annotated, and real-world samples -- to enable robust training and evaluation. We finetune and benchmark multiple vision-language models, conducting a series of ablation studies to analyze the strengths and limitations of our approach. Our results show that finetuning significantly enhances structured workflow generation, outperforming large vision-language models on this task.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction
Authors:
Shravan Nayak,
Xiangru Jian,
Kevin Qinghong Lin,
Juan A. Rodriguez,
Montek Kalsi,
Rabiul Awal,
Nicolas Chapados,
M. Tamer Özsu,
Aishwarya Agrawal,
David Vazquez,
Christopher Pal,
Perouz Taslakian,
Spandana Gella,
Sai Rajeswar
Abstract:
Autonomous agents that navigate Graphical User Interfaces (GUIs) to automate tasks like document editing and file management can greatly enhance computer workflows. While existing research focuses on online settings, desktop environments, critical for many professional and everyday tasks, remain underexplored due to data collection challenges and licensing issues. We introduce UI-Vision, the first…
▽ More
Autonomous agents that navigate Graphical User Interfaces (GUIs) to automate tasks like document editing and file management can greatly enhance computer workflows. While existing research focuses on online settings, desktop environments, critical for many professional and everyday tasks, remain underexplored due to data collection challenges and licensing issues. We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents in real-world desktop environments. Unlike online benchmarks, UI-Vision provides: (i) dense, high-quality annotations of human demonstrations, including bounding boxes, UI labels, and action trajectories (clicks, drags, and keyboard inputs) across 83 software applications, and (ii) three fine-to-coarse grained tasks-Element Grounding, Layout Grounding, and Action Prediction-with well-defined metrics to rigorously evaluate agents' performance in desktop environments. Our evaluation reveals critical limitations in state-of-the-art models like UI-TARS-72B, including issues with understanding professional software, spatial reasoning, and complex actions like drag-and-drop. These findings highlight the challenges in developing fully autonomous computer use agents. By releasing UI-Vision as open-source, we aim to advance the development of more capable agents for real-world desktop tasks.
△ Less
Submitted 6 May, 2025; v1 submitted 19 March, 2025;
originally announced March 2025.
-
Learning to Defer for Causal Discovery with Imperfect Experts
Authors:
Oscar Clivio,
Divyat Mahajan,
Perouz Taslakian,
Sara Magliacane,
Ioannis Mitliagkas,
Valentina Zantedeschi,
Alexandre Drouin
Abstract:
Integrating expert knowledge, e.g. from large language models, into causal discovery algorithms can be challenging when the knowledge is not guaranteed to be correct. Expert recommendations may contradict data-driven results, and their reliability can vary significantly depending on the domain or specific query. Existing methods based on soft constraints or inconsistencies in predicted causal rela…
▽ More
Integrating expert knowledge, e.g. from large language models, into causal discovery algorithms can be challenging when the knowledge is not guaranteed to be correct. Expert recommendations may contradict data-driven results, and their reliability can vary significantly depending on the domain or specific query. Existing methods based on soft constraints or inconsistencies in predicted causal relationships fail to account for these variations in expertise. To remedy this, we propose L2D-CD, a method for gauging the correctness of expert recommendations and optimally combining them with data-driven causal discovery results. By adapting learning-to-defer (L2D) algorithms for pairwise causal discovery (CD), we learn a deferral function that selects whether to rely on classical causal discovery methods using numerical data or expert recommendations based on textual meta-data. We evaluate L2D-CD on the canonical Tübingen pairs dataset and demonstrate its superior performance compared to both the causal discovery method and the expert used in isolation. Moreover, our approach identifies domains where the expert's performance is strong or weak. Finally, we outline a strategy for generalizing this approach to causal discovery on graphs with more than two variables, paving the way for further research in this area.
△ Less
Submitted 18 February, 2025;
originally announced February 2025.
-
ReTreever: Tree-based Coarse-to-Fine Representations for Retrieval
Authors:
Shubham Gupta,
Zichao Li,
Tianyi Chen,
Cem Subakan,
Siva Reddy,
Perouz Taslakian,
Valentina Zantedeschi
Abstract:
Document retrieval is a core component of question-answering systems, as it enables conditioning answer generation on new and large-scale corpora. While effective, the standard practice of encoding documents into high-dimensional embeddings for similarity search entails large memory and compute footprints, and also makes it hard to inspect the inner workings of the system. In this paper, we propos…
▽ More
Document retrieval is a core component of question-answering systems, as it enables conditioning answer generation on new and large-scale corpora. While effective, the standard practice of encoding documents into high-dimensional embeddings for similarity search entails large memory and compute footprints, and also makes it hard to inspect the inner workings of the system. In this paper, we propose a tree-based method for organizing and representing reference documents at various granular levels, which offers the flexibility to balance cost and utility, and eases the inspection of the corpus content and retrieval operations. Our method, called ReTreever, jointly learns a routing function per internal node of a binary tree such that query and reference documents are assigned to similar tree branches, hence directly optimizing for retrieval performance. Our evaluations show that ReTreever generally preserves full representation accuracy. Its hierarchical structure further provides strong coarse representations and enhances transparency by indirectly learning meaningful semantic groupings. Among hierarchical retrieval methods, ReTreever achieves the best retrieval accuracy at the lowest latency, proving that this family of techniques can be viable in practical applications.
△ Less
Submitted 11 February, 2025;
originally announced February 2025.
-
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding
Authors:
Ahmed Masry,
Juan A. Rodriguez,
Tianyu Zhang,
Suyuchen Wang,
Chao Wang,
Aarash Feizi,
Akshay Kalkunte Suresh,
Abhay Puri,
Xiangru Jian,
Pierre-André Noël,
Sathwik Tejaswi Madhusudhan,
Marco Pedersoli,
Bang Liu,
Nicolas Chapados,
Yoshua Bengio,
Enamul Hoque,
Christopher Pal,
Issam H. Laradji,
David Vazquez,
Perouz Taslakian,
Spandana Gella,
Sai Rajeswar
Abstract:
Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges on having a good connector that maps visual features generated by a vision encoder to a shared embedding space with the LLM while preserving semantic similarity. Existing connectors, such as multilayer perceptrons (MLPs), often produce out-of-distribution or…
▽ More
Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges on having a good connector that maps visual features generated by a vision encoder to a shared embedding space with the LLM while preserving semantic similarity. Existing connectors, such as multilayer perceptrons (MLPs), often produce out-of-distribution or noisy inputs, leading to misalignment between the modalities. In this work, we propose a novel vision-text alignment method, AlignVLM, that maps visual features to a weighted average of LLM text embeddings. Our approach leverages the linguistic priors encoded by the LLM to ensure that visual features are mapped to regions of the space that the LLM can effectively interpret. AlignVLM is particularly effective for document understanding tasks, where scanned document images must be accurately mapped to their textual content. Our extensive experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods. We provide further analysis demonstrating improved vision-text feature alignment and robustness to noise.
△ Less
Submitted 3 February, 2025;
originally announced February 2025.
-
BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks
Authors:
Juan Rodriguez,
Xiangru Jian,
Siba Smarak Panigrahi,
Tianyu Zhang,
Aarash Feizi,
Abhay Puri,
Akshay Kalkunte,
François Savard,
Ahmed Masry,
Shravan Nayak,
Rabiul Awal,
Mahsa Massoud,
Amirhossein Abaskohi,
Zichao Li,
Suyuchen Wang,
Pierre-André Noël,
Mats Leon Richter,
Saverio Vadacchino,
Shubham Agarwal,
Sanket Biswas,
Sara Shanian,
Ying Zhang,
Noah Bolger,
Kurt MacDonald,
Simon Fauvel
, et al. (18 additional authors not shown)
Abstract:
Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows, extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training da…
▽ More
Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows, extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .
△ Less
Submitted 17 March, 2025; v1 submitted 5 December, 2024;
originally announced December 2024.
-
InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation
Authors:
Gaurav Sahu,
Abhay Puri,
Juan Rodriguez,
Amirhossein Abaskohi,
Mohammad Chegini,
Alexandre Drouin,
Perouz Taslakian,
Valentina Zantedeschi,
Alexandre Lacoste,
David Vazquez,
Nicolas Chapados,
Christopher Pal,
Sai Rajeswar Mudumba,
Issam Hadj Laradji
Abstract:
Data analytics is essential for extracting valuable insights from data that can assist organizations in making effective decisions. We introduce InsightBench, a benchmark dataset with three key features. First, it consists of 100 datasets representing diverse business use cases such as finance and incident management, each accompanied by a carefully curated set of insights planted in the datasets.…
▽ More
Data analytics is essential for extracting valuable insights from data that can assist organizations in making effective decisions. We introduce InsightBench, a benchmark dataset with three key features. First, it consists of 100 datasets representing diverse business use cases such as finance and incident management, each accompanied by a carefully curated set of insights planted in the datasets. Second, unlike existing benchmarks focusing on answering single queries, InsightBench evaluates agents based on their ability to perform end-to-end data analytics, including formulating questions, interpreting answers, and generating a summary of insights and actionable steps. Third, we conducted comprehensive quality assurance to ensure that each dataset in the benchmark had clear goals and included relevant and meaningful questions and analysis. Furthermore, we implement a two-way evaluation mechanism using LLaMA-3 as an effective, open-source evaluator to assess agents' ability to extract insights. We also propose AgentPoirot, our baseline data analysis agent capable of performing end-to-end data analytics. Our evaluation on InsightBench shows that AgentPoirot outperforms existing approaches (such as Pandas Agent) that focus on resolving single queries. We also compare the performance of open- and closed-source LLMs and various evaluation strategies. Overall, this benchmark serves as a testbed to motivate further development in comprehensive automated data analytics and can be accessed here: https://github.com/ServiceNow/insight-bench.
△ Less
Submitted 27 February, 2025; v1 submitted 8 July, 2024;
originally announced July 2024.
-
RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content
Authors:
Joao Monteiro,
Pierre-Andre Noel,
Etienne Marcotte,
Sai Rajeswar,
Valentina Zantedeschi,
David Vazquez,
Nicolas Chapados,
Christopher Pal,
Perouz Taslakian
Abstract:
Large Language Models (LLMs) are trained on vast amounts of data, most of which is automatically scraped from the internet. This data includes encyclopedic documents that harbor a vast amount of general knowledge (e.g., Wikipedia) but also potentially overlap with benchmark datasets used for evaluating LLMs. Consequently, evaluating models on test splits that might have leaked into the training se…
▽ More
Large Language Models (LLMs) are trained on vast amounts of data, most of which is automatically scraped from the internet. This data includes encyclopedic documents that harbor a vast amount of general knowledge (e.g., Wikipedia) but also potentially overlap with benchmark datasets used for evaluating LLMs. Consequently, evaluating models on test splits that might have leaked into the training set is prone to misleading conclusions. To foster sound evaluation of language models, we introduce a new test dataset named RepLiQA, suited for question-answering and topic retrieval tasks. RepLiQA is a collection of five splits of test sets, four of which have not been released to the internet or exposed to LLM APIs prior to this publication. Each sample in RepLiQA comprises (1) a reference document crafted by a human annotator and depicting an imaginary scenario (e.g., a news article) absent from the internet; (2) a question about the document's topic; (3) a ground-truth answer derived directly from the information in the document; and (4) the paragraph extracted from the reference document containing the answer. As such, accurate answers can only be generated if a model can find relevant content within the provided document. We run a large-scale benchmark comprising several state-of-the-art LLMs to uncover differences in performance across models of various types and sizes in a context-conditional language modeling setting. Released splits of RepLiQA can be found here: https://huggingface.co/datasets/ServiceNow/repliqa.
△ Less
Submitted 5 November, 2024; v1 submitted 17 June, 2024;
originally announced June 2024.
-
VCR: A Task for Pixel-Level Complex Reasoning in Vision Language Models via Restoring Occluded Text
Authors:
Tianyu Zhang,
Suyuchen Wang,
Lu Li,
Ge Zhang,
Perouz Taslakian,
Sai Rajeswar,
Jie Fu,
Bang Liu,
Yoshua Bengio
Abstract:
We introduce Visual Caption Restoration (VCR), a novel vision-language task that challenges models to accurately restore partially obscured texts using pixel-level hints within images. This task stems from the observation that text embedded in images is intrinsically different from common visual elements and natural language due to the need to align the modalities of vision, text, and text embedde…
▽ More
We introduce Visual Caption Restoration (VCR), a novel vision-language task that challenges models to accurately restore partially obscured texts using pixel-level hints within images. This task stems from the observation that text embedded in images is intrinsically different from common visual elements and natural language due to the need to align the modalities of vision, text, and text embedded in images. While numerous works have integrated text embedded in images into visual question-answering tasks, approaches to these tasks generally rely on optical character recognition or masked language modeling, thus reducing the task to mainly text-based processing. However, text-based processing becomes ineffective in VCR as accurate text restoration depends on the combined information from provided images, context, and subtle cues from the tiny exposed areas of masked texts. We develop a pipeline to generate synthetic images for the VCR task using image-caption pairs, with adjustable caption visibility to control the task difficulty. With this pipeline, we construct a dataset for VCR called VCR-Wiki using images with captions from Wikipedia, comprising 2.11M English and 346K Chinese entities in both easy and hard split variants. Our results reveal that current vision language models significantly lag behind human performance in the VCR task, and merely fine-tuning the models on our dataset does not lead to notable improvements. We release VCR-Wiki and the data construction code to facilitate future research.
△ Less
Submitted 18 April, 2025; v1 submitted 10 June, 2024;
originally announced June 2024.
-
XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference
Authors:
João Monteiro,
Étienne Marcotte,
Pierre-André Noël,
Valentina Zantedeschi,
David Vázquez,
Nicolas Chapados,
Christopher Pal,
Perouz Taslakian
Abstract:
In-context learning (ICL) approaches typically leverage prompting to condition decoder-only language model generation on reference information. Just-in-time processing of a context is inefficient due to the quadratic cost of self-attention operations, and caching is desirable. However, caching transformer states can easily require almost as much space as the model parameters. When the right contex…
▽ More
In-context learning (ICL) approaches typically leverage prompting to condition decoder-only language model generation on reference information. Just-in-time processing of a context is inefficient due to the quadratic cost of self-attention operations, and caching is desirable. However, caching transformer states can easily require almost as much space as the model parameters. When the right context isn't known in advance, caching ICL can be challenging. This work addresses these limitations by introducing models that, inspired by the encoder-decoder architecture, use cross-attention to condition generation on reference text without the prompt. More precisely, we leverage pre-trained decoder-only models and only train a small number of added layers. We use Question-Answering (QA) as a testbed to evaluate the ability of our models to perform conditional generation and observe that they outperform ICL, are comparable to fine-tuned prompted LLMs, and drastically reduce the space footprint relative to standard KV caching by two orders of magnitude.
△ Less
Submitted 1 November, 2024; v1 submitted 23 April, 2024;
originally announced April 2024.
-
A Sparsity Principle for Partially Observable Causal Representation Learning
Authors:
Danru Xu,
Dingling Yao,
Sébastien Lachapelle,
Perouz Taslakian,
Julius von Kügelgen,
Francesco Locatello,
Sara Magliacane
Abstract:
Causal representation learning aims at identifying high-level causal variables from perceptual data. Most methods assume that all latent causal variables are captured in the high-dimensional observations. We instead consider a partially observed setting, in which each measurement only provides information about a subset of the underlying causal state. Prior work has studied this setting with multi…
▽ More
Causal representation learning aims at identifying high-level causal variables from perceptual data. Most methods assume that all latent causal variables are captured in the high-dimensional observations. We instead consider a partially observed setting, in which each measurement only provides information about a subset of the underlying causal state. Prior work has studied this setting with multiple domains or views, each depending on a fixed subset of latents. Here, we focus on learning from unpaired observations from a dataset with an instance-dependent partial observability pattern. Our main contribution is to establish two identifiability results for this setting: one for linear mixing functions without parametric assumptions on the underlying causal model, and one for piecewise linear mixing functions with Gaussian latent causal variables. Based on these insights, we propose two methods for estimating the underlying causal variables by enforcing sparsity in the inferred representation. Experiments on different simulated datasets and established benchmarks highlight the effectiveness of our approach in recovering the ground-truth latents.
△ Less
Submitted 15 June, 2024; v1 submitted 13 March, 2024;
originally announced March 2024.
-
Capture the Flag: Uncovering Data Insights with Large Language Models
Authors:
Issam Laradji,
Perouz Taslakian,
Sai Rajeswar,
Valentina Zantedeschi,
Alexandre Lacoste,
Nicolas Chapados,
David Vazquez,
Christopher Pal,
Alexandre Drouin
Abstract:
The extraction of a small number of relevant insights from vast amounts of data is a crucial component of data-driven decision-making. However, accomplishing this task requires considerable technical skills, domain expertise, and human labor. This study explores the potential of using Large Language Models (LLMs) to automate the discovery of insights in data, leveraging recent advances in reasonin…
▽ More
The extraction of a small number of relevant insights from vast amounts of data is a crucial component of data-driven decision-making. However, accomplishing this task requires considerable technical skills, domain expertise, and human labor. This study explores the potential of using Large Language Models (LLMs) to automate the discovery of insights in data, leveraging recent advances in reasoning and code generation techniques. We propose a new evaluation methodology based on a "capture the flag" principle, measuring the ability of such models to recognize meaningful and pertinent information (flags) in a dataset. We further propose two proof-of-concept agents, with different inner workings, and compare their ability to capture such flags in a real-world sales dataset. While the work reported here is preliminary, our results are sufficiently interesting to mandate future exploration by the community.
△ Less
Submitted 21 December, 2023;
originally announced December 2023.
-
Multi-View Causal Representation Learning with Partial Observability
Authors:
Dingling Yao,
Danru Xu,
Sébastien Lachapelle,
Sara Magliacane,
Perouz Taslakian,
Georg Martius,
Julius von Kügelgen,
Francesco Locatello
Abstract:
We present a unified framework for studying the identifiability of representations learned from simultaneously observed views, such as different data modalities. We allow a partially observed setting in which each view constitutes a nonlinear mixture of a subset of underlying latent variables, which can be causally related. We prove that the information shared across all subsets of any number of v…
▽ More
We present a unified framework for studying the identifiability of representations learned from simultaneously observed views, such as different data modalities. We allow a partially observed setting in which each view constitutes a nonlinear mixture of a subset of underlying latent variables, which can be causally related. We prove that the information shared across all subsets of any number of views can be learned up to a smooth bijection using contrastive learning and a single encoder per view. We also provide graphical criteria indicating which latent variables can be identified through a simple set of rules, which we refer to as identifiability algebra. Our general framework and theoretical results unify and extend several previous works on multi-view nonlinear ICA, disentanglement, and causal representation learning. We experimentally validate our claims on numerical, image, and multi-modal data sets. Further, we demonstrate that the performance of prior methods is recovered in different special cases of our setup. Overall, we find that access to multiple partial views enables us to identify a more fine-grained representation, under the generally milder assumption of partial observability.
△ Less
Submitted 8 March, 2024; v1 submitted 7 November, 2023;
originally announced November 2023.
-
OC-NMN: Object-centric Compositional Neural Module Network for Generative Visual Analogical Reasoning
Authors:
Rim Assouel,
Pau Rodriguez,
Perouz Taslakian,
David Vazquez,
Yoshua Bengio
Abstract:
A key aspect of human intelligence is the ability to imagine -- composing learned concepts in novel ways -- to make sense of new scenarios. Such capacity is not yet attained for machine learning systems. In this work, in the context of visual reasoning, we show how modularity can be leveraged to derive a compositional data augmentation framework inspired by imagination. Our method, denoted Object-…
▽ More
A key aspect of human intelligence is the ability to imagine -- composing learned concepts in novel ways -- to make sense of new scenarios. Such capacity is not yet attained for machine learning systems. In this work, in the context of visual reasoning, we show how modularity can be leveraged to derive a compositional data augmentation framework inspired by imagination. Our method, denoted Object-centric Compositional Neural Module Network (OC-NMN), decomposes visual generative reasoning tasks into a series of primitives applied to objects without using a domain-specific language. We show that our modular architectural choices can be used to generate new training tasks that lead to better out-of-distribution generalization. We compare our model to existing and new baselines in proposed visual reasoning benchmark that consists of applying arithmetic operations to MNIST digits.
△ Less
Submitted 28 October, 2023;
originally announced October 2023.
-
Using Graph Algorithms to Pretrain Graph Completion Transformers
Authors:
Jonathan Pilault,
Michael Galkin,
Bahare Fatemi,
Perouz Taslakian,
David Vasquez,
Christopher Pal
Abstract:
Recent work on Graph Neural Networks has demonstrated that self-supervised pretraining can further enhance performance on downstream graph, link, and node classification tasks. However, the efficacy of pretraining tasks has not been fully investigated for downstream large knowledge graph completion tasks. Using a contextualized knowledge graph embedding approach, we investigate five different pret…
▽ More
Recent work on Graph Neural Networks has demonstrated that self-supervised pretraining can further enhance performance on downstream graph, link, and node classification tasks. However, the efficacy of pretraining tasks has not been fully investigated for downstream large knowledge graph completion tasks. Using a contextualized knowledge graph embedding approach, we investigate five different pretraining signals, constructed using several graph algorithms and no external data, as well as their combination. We leverage the versatility of our Transformer-based model to explore graph structure generation pretraining tasks (i.e. path and k-hop neighborhood generation), typically inapplicable to most graph embedding methods. We further propose a new path-finding algorithm guided by information gain and find that it is the best-performing pretraining task across three downstream knowledge graph completion datasets. While using our new path-finding algorithm as a pretraining signal provides 2-3% MRR improvements, we show that pretraining on all signals together gives the best knowledge graph completion results. In a multitask setting that combines all pretraining tasks, our method surpasses the latest and strong performing knowledge graph embedding methods on all metrics for FB15K-237, on MRR and Hit@1 for WN18RRand on MRR and hit@10 for JF17K (a knowledge hypergraph dataset).
△ Less
Submitted 27 March, 2023; v1 submitted 13 October, 2022;
originally announced October 2022.
-
Typing assumptions improve identification in causal discovery
Authors:
Philippe Brouillard,
Perouz Taslakian,
Alexandre Lacoste,
Sebastien Lachapelle,
Alexandre Drouin
Abstract:
Causal discovery from observational data is a challenging task that can only be solved up to a set of equivalent solutions, called an equivalence class. Such classes, which are often large in size, encode uncertainties about the orientation of some edges in the causal graph. In this work, we propose a new set of assumptions that constrain possible causal relationships based on the nature of variab…
▽ More
Causal discovery from observational data is a challenging task that can only be solved up to a set of equivalent solutions, called an equivalence class. Such classes, which are often large in size, encode uncertainties about the orientation of some edges in the causal graph. In this work, we propose a new set of assumptions that constrain possible causal relationships based on the nature of variables, thus circumscribing the equivalence class. Namely, we introduce typed directed acyclic graphs, in which variable types are used to determine the validity of causal relationships. We demonstrate, both theoretically and empirically, that the proposed assumptions can result in significant gains in the identification of the causal graph. We also propose causal discovery algorithms that make use of these assumptions and demonstrate their benefits on simulated and pseudo-real data.
△ Less
Submitted 28 February, 2022; v1 submitted 22 July, 2021;
originally announced July 2021.
-
Knowledge Hypergraph Embedding Meets Relational Algebra
Authors:
Bahare Fatemi,
Perouz Taslakian,
David Vazquez,
David Poole
Abstract:
Embedding-based methods for reasoning in knowledge hypergraphs learn a representation for each entity and relation. Current methods do not capture the procedural rules underlying the relations in the graph. We propose a simple embedding-based model called ReAlE that performs link prediction in knowledge hypergraphs (generalized knowledge graphs) and can represent high-level abstractions in terms o…
▽ More
Embedding-based methods for reasoning in knowledge hypergraphs learn a representation for each entity and relation. Current methods do not capture the procedural rules underlying the relations in the graph. We propose a simple embedding-based model called ReAlE that performs link prediction in knowledge hypergraphs (generalized knowledge graphs) and can represent high-level abstractions in terms of relational algebra operations. We show theoretically that ReAlE is fully expressive and provide proofs and empirical evidence that it can represent a large subset of the primitive relational algebra operations, namely renaming, projection, set union, selection, and set difference. We also verify experimentally that ReAlE outperforms state-of-the-art models in knowledge hypergraph completion, and in representing each of these primitive relational algebra operations. For the latter experiment, we generate a synthetic knowledge hypergraph, for which we design an algorithm based on the Erdos-R'enyi model for generating random graphs.
△ Less
Submitted 18 February, 2021;
originally announced February 2021.
-
Knowledge Hypergraphs: Prediction Beyond Binary Relations
Authors:
Bahare Fatemi,
Perouz Taslakian,
David Vazquez,
David Poole
Abstract:
Knowledge graphs store facts using relations between two entities. In this work, we address the question of link prediction in knowledge hypergraphs where relations are defined on any number of entities. While techniques exist (such as reification) that convert non-binary relations into binary ones, we show that current embedding-based methods for knowledge graph completion do not work well out of…
▽ More
Knowledge graphs store facts using relations between two entities. In this work, we address the question of link prediction in knowledge hypergraphs where relations are defined on any number of entities. While techniques exist (such as reification) that convert non-binary relations into binary ones, we show that current embedding-based methods for knowledge graph completion do not work well out of the box for knowledge graphs obtained through these techniques. To overcome this, we introduce HSimplE and HypE, two embedding-based methods that work directly with knowledge hypergraphs. In both models, the prediction is a function of the relation embedding, the entity embeddings and their corresponding positions in the relation. We also develop public datasets, benchmarks and baselines for hypergraph prediction and show experimentally that the proposed models are more effective than the baselines.
△ Less
Submitted 15 July, 2020; v1 submitted 31 May, 2019;
originally announced June 2019.
-
Context-Aware Visual Compatibility Prediction
Authors:
Guillem Cucurull,
Perouz Taslakian,
David Vazquez
Abstract:
How do we determine whether two or more clothing items are compatible or visually appealing? Part of the answer lies in understanding of visual aesthetics, and is biased by personal preferences shaped by social attitudes, time, and place. In this work we propose a method that predicts compatibility between two items based on their visual features, as well as their context. We define context as the…
▽ More
How do we determine whether two or more clothing items are compatible or visually appealing? Part of the answer lies in understanding of visual aesthetics, and is biased by personal preferences shaped by social attitudes, time, and place. In this work we propose a method that predicts compatibility between two items based on their visual features, as well as their context. We define context as the products that are known to be compatible with each of these item. Our model is in contrast to other metric learning approaches that rely on pairwise comparisons between item features alone. We address the compatibility prediction problem using a graph neural network that learns to generate product embeddings conditioned on their context. We present results for two prediction tasks (fill in the blank and outfit compatibility) tested on two fashion datasets Polyvore and Fashion-Gen, and on a subset of the Amazon dataset; we achieve state of the art results when using context information and show how test performance improves as more context is used.
△ Less
Submitted 12 February, 2019; v1 submitted 10 February, 2019;
originally announced February 2019.
-
Efficient Multi-Robot Coverage of a Known Environment
Authors:
Nare Karapetyan,
Kelly Benson,
Chris McKinney,
Perouz Taslakian,
Ioannis Rekleitis
Abstract:
This paper addresses the complete area coverage problem of a known environment by multiple-robots. Complete area coverage is the problem of moving an end-effector over all available space while avoiding existing obstacles. In such tasks, using multiple robots can increase the efficiency of the area coverage in terms of minimizing the operational time and increase the robustness in the face of robo…
▽ More
This paper addresses the complete area coverage problem of a known environment by multiple-robots. Complete area coverage is the problem of moving an end-effector over all available space while avoiding existing obstacles. In such tasks, using multiple robots can increase the efficiency of the area coverage in terms of minimizing the operational time and increase the robustness in the face of robot attrition. Unfortunately, the problem of finding an optimal solution for such an area coverage problem with multiple robots is known to be NP-complete. In this paper we present two approximation heuristics for solving the multi-robot coverage problem. The first solution presented is a direct extension of an efficient single robot area coverage algorithm, based on an exact cellular decomposition. The second algorithm is a greedy approach that divides the area into equal regions and applies an efficient single-robot coverage algorithm to each region. We present experimental results for two algorithms. Results indicate that our approaches provide good coverage distribution between robots and minimize the workload per robot, meanwhile ensuring complete coverage of the area.
△ Less
Submitted 7 August, 2018;
originally announced August 2018.
-
Continuous Yao Graphs
Authors:
Luis Barba,
Prosenjit Bose,
Jean-Lou De Carufel,
Mirela Damian,
Rolf Fagerberg,
André van Renssen,
Perouz Taslakian,
Sander Verdonschot
Abstract:
In this paper, we introduce a variation of the well-studied Yao graphs. Given a set of points $S\subset \mathbb{R}^2$ and an angle $0 < θ\leq 2π$, we define the continuous Yao graph $cY(θ)$ with vertex set $S$ and angle $θ$ as follows. For each $p,q\in S$, we add an edge from $p$ to $q$ in $cY(θ)$ if there exists a cone with apex $p$ and aperture $θ$ such that $q$ is the closest point to $p$ insid…
▽ More
In this paper, we introduce a variation of the well-studied Yao graphs. Given a set of points $S\subset \mathbb{R}^2$ and an angle $0 < θ\leq 2π$, we define the continuous Yao graph $cY(θ)$ with vertex set $S$ and angle $θ$ as follows. For each $p,q\in S$, we add an edge from $p$ to $q$ in $cY(θ)$ if there exists a cone with apex $p$ and aperture $θ$ such that $q$ is the closest point to $p$ inside this cone.
We study the spanning ratio of $cY(θ)$ for different values of $θ$. Using a new algebraic technique, we show that $cY(θ)$ is a spanner when $θ\leq 2π/3$. We believe that this technique may be of independent interest. We also show that $cY(π)$ is not a spanner, and that $cY(θ)$ may be disconnected for $θ> π$.
△ Less
Submitted 18 August, 2014;
originally announced August 2014.
-
Theta-3 is connected
Authors:
Oswin Aichholzer,
Sang Won Bae,
Luis Barba,
Prosenjit Bose,
Matias Korman,
André van Renssen,
Perouz Taslakian,
Sander Verdonschot
Abstract:
In this paper, we show that the $θ$-graph with three cones is connected. We also provide an alternative proof of the connectivity of the Yao graph with three cones.
In this paper, we show that the $θ$-graph with three cones is connected. We also provide an alternative proof of the connectivity of the Yao graph with three cones.
△ Less
Submitted 28 April, 2014;
originally announced April 2014.
-
Fitting Voronoi Diagrams to Planar Tesselations
Authors:
Greg Aloupis,
Hebert Pérez-Rosés,
Guillermo Pineda-Villavicencio,
Perouz Taslakian,
Dannier Trinchet
Abstract:
Given a tesselation of the plane, defined by a planar straight-line graph $G$, we want to find a minimal set $S$ of points in the plane, such that the Voronoi diagram associated with $S$ "fits" \ $G$. This is the Generalized Inverse Voronoi Problem (GIVP), defined in \cite{Trin07} and rediscovered recently in \cite{Baner12}. Here we give an algorithm that solves this problem with a number of point…
▽ More
Given a tesselation of the plane, defined by a planar straight-line graph $G$, we want to find a minimal set $S$ of points in the plane, such that the Voronoi diagram associated with $S$ "fits" \ $G$. This is the Generalized Inverse Voronoi Problem (GIVP), defined in \cite{Trin07} and rediscovered recently in \cite{Baner12}. Here we give an algorithm that solves this problem with a number of points that is linear in the size of $G$, assuming that the smallest angle in $G$ is constant.
△ Less
Submitted 26 August, 2013;
originally announced August 2013.
-
New and Improved Spanning Ratios for Yao Graphs
Authors:
Luis Barba,
Prosenjit Bose,
Mirela Damian,
Rolf Fagerberg,
Wah Loon Keng,
Joseph O'Rourke,
André van Renssen,
Perouz Taslakian,
Sander Verdonschot,
Ge Xia
Abstract:
For a set of points in the plane and a fixed integer $k > 0$, the Yao graph $Y_k$ partitions the space around each point into $k$ equiangular cones of angle $θ=2π/k$, and connects each point to a nearest neighbor in each cone. It is known for all Yao graphs, with the sole exception of $Y_5$, whether or not they are geometric spanners. In this paper we close this gap by showing that for odd…
▽ More
For a set of points in the plane and a fixed integer $k > 0$, the Yao graph $Y_k$ partitions the space around each point into $k$ equiangular cones of angle $θ=2π/k$, and connects each point to a nearest neighbor in each cone. It is known for all Yao graphs, with the sole exception of $Y_5$, whether or not they are geometric spanners. In this paper we close this gap by showing that for odd $k \geq 5$, the spanning ratio of $Y_k$ is at most $1/(1-2\sin(3θ/8))$, which gives the first constant upper bound for $Y_5$, and is an improvement over the previous bound of $1/(1-2\sin(θ/2))$ for odd $k \geq 7$. We further reduce the upper bound on the spanning ratio for $Y_5$ from $10.9$ to $2+\sqrt{3} \approx 3.74$, which falls slightly below the lower bound of $3.79$ established for the spanning ratio of $Θ_5$ ($Θ$-graphs differ from Yao graphs only in the way they select the closest neighbor in each cone). This is the first such separation between a Yao and $Θ$-graph with the same number of cones. We also give a lower bound of $2.87$ on the spanning ratio of $Y_5$. Finally, we revisit the $Y_6$ graph, which plays a particularly important role as the transition between the graphs ($k > 6$) for which simple inductive proofs are known, and the graphs ($k \le 6$) whose best spanning ratios have been established by complex arguments. Here we reduce the known spanning ratio of $Y_6$ from $17.6$ to $5.8$, getting closer to the spanning ratio of 2 established for $Θ_6$.
△ Less
Submitted 14 March, 2019; v1 submitted 22 July, 2013;
originally announced July 2013.
-
Cannibal Animal Games: a new variant of Tic-Tac-Toe
Authors:
Jean Cardinal,
Sébastien Collette,
Hiro Ito,
Matias Korman,
Stefan Langerman,
Hikaru Sakaidani,
Perouz Taslakian
Abstract:
This paper presents a new partial two-player game, called the \emph{cannibal animal game}, which is a variant of Tic-Tac-Toe. The game is played on the infinite grid, where in each round a player chooses and occupies free cells. The first player Alice can occupy a cell in each turn and wins if she occupies a set of cells, the union of a subset of which is a translated, reflected and/or rotated cop…
▽ More
This paper presents a new partial two-player game, called the \emph{cannibal animal game}, which is a variant of Tic-Tac-Toe. The game is played on the infinite grid, where in each round a player chooses and occupies free cells. The first player Alice can occupy a cell in each turn and wins if she occupies a set of cells, the union of a subset of which is a translated, reflected and/or rotated copy of a previously agreed upon polyomino $P$ (called an \emph{animal}). The objective of the second player Bob is to prevent Alice from creating her animal by occupying in each round a translated, reflected and/or rotated copy of $P$. An animal is a \emph{cannibal} if Bob has a winning strategy, and a \emph{non-cannibal} otherwise. This paper presents some new tools, such as the \emph{bounding strategy} and the \emph{punching lemma}, to classify animals into cannibals or non-cannibals. We also show that the \emph{pairing strategy} works for this problem.
△ Less
Submitted 20 June, 2013;
originally announced June 2013.
-
Necklaces, Convolutions, and X+Y
Authors:
David Bremner,
Timothy M. Chan,
Erik D. Demaine,
Jeff Erickson,
Ferran Hurtado,
John Iacono,
Stefan Langerman,
Mihai Patrascu,
Perouz Taslakian
Abstract:
We give subquadratic algorithms that, given two necklaces each with n beads at arbitrary positions, compute the optimal rotation of the necklaces to best align the beads. Here alignment is measured according to the p norm of the vector of distances between pairs of beads from opposite necklaces in the best perfect matching. We show surprisingly different results for p = 1, p even, and p = \infty.…
▽ More
We give subquadratic algorithms that, given two necklaces each with n beads at arbitrary positions, compute the optimal rotation of the necklaces to best align the beads. Here alignment is measured according to the p norm of the vector of distances between pairs of beads from opposite necklaces in the best perfect matching. We show surprisingly different results for p = 1, p even, and p = \infty. For p even, we reduce the problem to standard convolution, while for p = \infty and p = 1, we reduce the problem to (min, +) convolution and (median, +) convolution. Then we solve the latter two convolution problems in subquadratic time, which are interesting results in their own right. These results shed some light on the classic sorting X + Y problem, because the convolutions can be viewed as computing order statistics on the antidiagonals of the X + Y matrix. All of our algorithms run in o(n^2) time, whereas the obvious algorithms for these problems run in Θ(n^2) time.
△ Less
Submitted 19 December, 2012;
originally announced December 2012.
-
Coloring and Guarding Arrangements
Authors:
Prosenjit Bose,
Jean Cardinal,
Sébastien Collette,
Ferran Hurtado,
Matias Korman,
Stefan Langerman,
Perouz Taslakian
Abstract:
Given an arrangement of lines in the plane, what is the minimum number $c$ of colors required to color the lines so that no cell of the arrangement is monochromatic? In this paper we give bounds on the number c both for the above question, as well as some of its variations. We redefine these problems as geometric hypergraph coloring problems. If we define $\Hlinecell$ as the hypergraph where verti…
▽ More
Given an arrangement of lines in the plane, what is the minimum number $c$ of colors required to color the lines so that no cell of the arrangement is monochromatic? In this paper we give bounds on the number c both for the above question, as well as some of its variations. We redefine these problems as geometric hypergraph coloring problems. If we define $\Hlinecell$ as the hypergraph where vertices are lines and edges represent cells of the arrangement, the answer to the above question is equal to the chromatic number of this hypergraph. We prove that this chromatic number is between $Ω(\log n / \log\log n)$. and $O(\sqrt{n})$.
Similarly, we give bounds on the minimum size of a subset $S$ of the intersections of the lines in $\mathcal{A}$ such that every cell is bounded by at least one of the vertices in $S$. This may be seen as a problem on guarding cells with vertices when the lines act as obstacles. The problem can also be defined as the minimum vertex cover problem in the hypergraph $\Hvertexcell$, the vertices of which are the line intersections, and the hyperedges are vertices of a cell. Analogously, we consider the problem of touching the lines with a minimum subset of the cells of the arrangement, which we identify as the minimum vertex cover problem in the $\Hcellzone$ hypergraph.
△ Less
Submitted 6 June, 2012; v1 submitted 23 May, 2012;
originally announced May 2012.
-
Colorful Strips
Authors:
G. Aloupis,
J. Cardinal,
S. Collette,
S. Imahori,
M. Korman,
S. Langerman,
O. Schwartz,
S. Smorodinsky,
P. Taslakian
Abstract:
Given a planar point set and an integer $k$, we wish to color the points with $k$ colors so that any axis-aligned strip containing enough points contains all colors. The goal is to bound the necessary size of such a strip, as a function of $k$. We show that if the strip size is at least $2k{-}1$, such a coloring can always be found. We prove that the size of the strip is also bounded in any fixed…
▽ More
Given a planar point set and an integer $k$, we wish to color the points with $k$ colors so that any axis-aligned strip containing enough points contains all colors. The goal is to bound the necessary size of such a strip, as a function of $k$. We show that if the strip size is at least $2k{-}1$, such a coloring can always be found. We prove that the size of the strip is also bounded in any fixed number of dimensions. In contrast to the planar case, we show that deciding whether a 3D point set can be 2-colored so that any strip containing at least three points contains both colors is NP-complete.
We also consider the problem of coloring a given set of axis-aligned strips, so that any sufficiently covered point in the plane is covered by $k$ colors. We show that in $d$ dimensions the required coverage is at most $d(k{-}1)+1$.
Lower bounds are given for the two problems. This complements recent impossibility results on decomposition of strip coverings with arbitrary orientations. Finally, we study a variant where strips are replaced by wedges.
△ Less
Submitted 7 April, 2011; v1 submitted 14 April, 2009;
originally announced April 2009.
-
The Distance Geometry of Music
Authors:
Erik D. Demaine,
Francisco Gomez-Martin,
Henk Meijer,
David Rappaport,
Perouz Taslakian,
Godfried T. Toussaint,
Terry Winograd,
David R. Wood
Abstract:
We demonstrate relationships between the classic Euclidean algorithm and many other fields of study, particularly in the context of music and distance geometry. Specifically, we show how the structure of the Euclidean algorithm defines a family of rhythms which encompass over forty timelines (\emph{ostinatos}) from traditional world music. We prove that these \emph{Euclidean rhythms} have the ma…
▽ More
We demonstrate relationships between the classic Euclidean algorithm and many other fields of study, particularly in the context of music and distance geometry. Specifically, we show how the structure of the Euclidean algorithm defines a family of rhythms which encompass over forty timelines (\emph{ostinatos}) from traditional world music. We prove that these \emph{Euclidean rhythms} have the mathematical property that their onset patterns are distributed as evenly as possible: they maximize the sum of the Euclidean distances between all pairs of onsets, viewing onsets as points on a circle. Indeed, Euclidean rhythms are the unique rhythms that maximize this notion of \emph{evenness}. We also show that essentially all Euclidean rhythms are \emph{deep}: each distinct distance between onsets occurs with a unique multiplicity, and these multiplicies form an interval $1,2,...,k-1$. Finally, we characterize all deep rhythms, showing that they form a subclass of generated rhythms, which in turn proves a useful property called shelling. All of our results for musical rhythms apply equally well to musical scales. In addition, many of the problems we explore are interesting in their own right as distance geometry problems on the circle; some of the same problems were explored by Erdős in the plane.
△ Less
Submitted 28 May, 2007;
originally announced May 2007.