-
Poutine: Vision-Language-Trajectory Pre-Training and Reinforcement Learning Post-Training Enable Robust End-to-End Autonomous Driving
Authors:
Luke Rowe,
Rodrigue de Schaetzen,
Roger Girgis,
Christopher Pal,
Liam Paull
Abstract:
We present Poutine, a 3B-parameter vision-language model (VLM) tailored for end-to-end autonomous driving in long-tail driving scenarios. Poutine is trained in two stages. To obtain strong base driving capabilities, we train Poutine-Base in a self-supervised vision-language-trajectory (VLT) next-token prediction fashion on 83 hours of CoVLA nominal driving and 11 hours of Waymo long-tail driving.…
▽ More
We present Poutine, a 3B-parameter vision-language model (VLM) tailored for end-to-end autonomous driving in long-tail driving scenarios. Poutine is trained in two stages. To obtain strong base driving capabilities, we train Poutine-Base in a self-supervised vision-language-trajectory (VLT) next-token prediction fashion on 83 hours of CoVLA nominal driving and 11 hours of Waymo long-tail driving. Accompanying language annotations are auto-generated with a 72B-parameter VLM. Poutine is obtained by fine-tuning Poutine-Base with Group Relative Policy Optimization (GRPO) using less than 500 preference-labeled frames from the Waymo validation set. We show that both VLT pretraining and RL fine-tuning are critical to attain strong driving performance in the long-tail. Poutine-Base achieves a rater-feedback score (RFS) of 8.12 on the validation set, nearly matching Waymo's expert ground-truth RFS. The final Poutine model achieves an RFS of 7.99 on the official Waymo test set, placing 1st in the 2025 Waymo Vision-Based End-to-End Driving Challenge by a significant margin. These results highlight the promise of scalable VLT pre-training and lightweight RL fine-tuning to enable robust and generalizable autonomy.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
VERDI: VLM-Embedded Reasoning for Autonomous Driving
Authors:
Bowen Feng,
Zhiting Mei,
Baiang Li,
Julian Ost,
Roger Girgis,
Anirudha Majumdar,
Felix Heide
Abstract:
While autonomous driving (AD) stacks struggle with decision making under partial observability and real-world complexity, human drivers are capable of commonsense reasoning to make near-optimal decisions with limited information. Recent work has attempted to leverage finetuned Vision-Language Models (VLMs) for trajectory planning at inference time to emulate human behavior. Despite their success i…
▽ More
While autonomous driving (AD) stacks struggle with decision making under partial observability and real-world complexity, human drivers are capable of commonsense reasoning to make near-optimal decisions with limited information. Recent work has attempted to leverage finetuned Vision-Language Models (VLMs) for trajectory planning at inference time to emulate human behavior. Despite their success in benchmark evaluations, these methods are often impractical to deploy (a 70B parameter VLM inference at merely 8 tokens per second requires more than 160G of memory), and their monolithic network structure prohibits safety decomposition. To bridge this gap, we propose VLM-Embedded Reasoning for autonomous Driving (VERDI), a training-time framework that distills the reasoning process and commonsense knowledge of VLMs into the AD stack. VERDI augments modular differentiable end-to-end (e2e) AD models by aligning intermediate module outputs at the perception, prediction, and planning stages with text features explaining the driving reasoning process produced by VLMs. By encouraging alignment in latent space, VERDI enables the modular AD stack to internalize structured reasoning, without incurring the inference-time costs of large VLMs. We demonstrate the effectiveness of our method on the NuScenes dataset and find that VERDI outperforms existing e2e methods that do not embed reasoning by 10% in $\ell_{2}$ distance, while maintaining high inference speed.
△ Less
Submitted 23 May, 2025; v1 submitted 21 May, 2025;
originally announced May 2025.
-
Scenario Dreamer: Vectorized Latent Diffusion for Generating Driving Simulation Environments
Authors:
Luke Rowe,
Roger Girgis,
Anthony Gosselin,
Liam Paull,
Christopher Pal,
Felix Heide
Abstract:
We introduce Scenario Dreamer, a fully data-driven generative simulator for autonomous vehicle planning that generates both the initial traffic scene - comprising a lane graph and agent bounding boxes - and closed-loop agent behaviours. Existing methods for generating driving simulation environments encode the initial traffic scene as a rasterized image and, as such, require parameter-heavy networ…
▽ More
We introduce Scenario Dreamer, a fully data-driven generative simulator for autonomous vehicle planning that generates both the initial traffic scene - comprising a lane graph and agent bounding boxes - and closed-loop agent behaviours. Existing methods for generating driving simulation environments encode the initial traffic scene as a rasterized image and, as such, require parameter-heavy networks that perform unnecessary computation due to many empty pixels in the rasterized scene. Moreover, we find that existing methods that employ rule-based agent behaviours lack diversity and realism. Scenario Dreamer instead employs a novel vectorized latent diffusion model for initial scene generation that directly operates on the vectorized scene elements and an autoregressive Transformer for data-driven agent behaviour simulation. Scenario Dreamer additionally supports scene extrapolation via diffusion inpainting, enabling the generation of unbounded simulation environments. Extensive experiments show that Scenario Dreamer outperforms existing generative simulators in realism and efficiency: the vectorized scene-generation base model achieves superior generation quality with around 2x fewer parameters, 6x lower generation latency, and 10x fewer GPU training hours compared to the strongest baseline. We confirm its practical utility by showing that reinforcement learning planning agents are more challenged in Scenario Dreamer environments than traditional non-generative simulation environments, especially on long and adversarial driving environments.
△ Less
Submitted 28 March, 2025;
originally announced March 2025.
-
CtRL-Sim: Reactive and Controllable Driving Agents with Offline Reinforcement Learning
Authors:
Luke Rowe,
Roger Girgis,
Anthony Gosselin,
Bruno Carrez,
Florian Golemo,
Felix Heide,
Liam Paull,
Christopher Pal
Abstract:
Evaluating autonomous vehicle stacks (AVs) in simulation typically involves replaying driving logs from real-world recorded traffic. However, agents replayed from offline data are not reactive and hard to intuitively control. Existing approaches address these challenges by proposing methods that rely on heuristics or generative models of real-world data but these approaches either lack realism or…
▽ More
Evaluating autonomous vehicle stacks (AVs) in simulation typically involves replaying driving logs from real-world recorded traffic. However, agents replayed from offline data are not reactive and hard to intuitively control. Existing approaches address these challenges by proposing methods that rely on heuristics or generative models of real-world data but these approaches either lack realism or necessitate costly iterative sampling procedures to control the generated behaviours. In this work, we take an alternative approach and propose CtRL-Sim, a method that leverages return-conditioned offline reinforcement learning (RL) to efficiently generate reactive and controllable traffic agents. Specifically, we process real-world driving data through a physics-enhanced Nocturne simulator to generate a diverse offline RL dataset, annotated with various rewards. With this dataset, we train a return-conditioned multi-agent behaviour model that allows for fine-grained manipulation of agent behaviours by modifying the desired returns for the various reward components. This capability enables the generation of a wide range of driving behaviours beyond the scope of the initial dataset, including adversarial behaviours. We show that CtRL-Sim can generate realistic safety-critical scenarios while providing fine-grained control over agent behaviours.
△ Less
Submitted 14 October, 2024; v1 submitted 28 March, 2024;
originally announced March 2024.
-
Direct Behavior Specification via Constrained Reinforcement Learning
Authors:
Julien Roy,
Roger Girgis,
Joshua Romoff,
Pierre-Luc Bacon,
Christopher Pal
Abstract:
The standard formulation of Reinforcement Learning lacks a practical way of specifying what are admissible and forbidden behaviors. Most often, practitioners go about the task of behavior specification by manually engineering the reward function, a counter-intuitive process that requires several iterations and is prone to reward hacking by the agent. In this work, we argue that constrained RL, whi…
▽ More
The standard formulation of Reinforcement Learning lacks a practical way of specifying what are admissible and forbidden behaviors. Most often, practitioners go about the task of behavior specification by manually engineering the reward function, a counter-intuitive process that requires several iterations and is prone to reward hacking by the agent. In this work, we argue that constrained RL, which has almost exclusively been used for safe RL, also has the potential to significantly reduce the amount of work spent for reward specification in applied RL projects. To this end, we propose to specify behavioral preferences in the CMDP framework and to use Lagrangian methods to automatically weigh each of these behavioral constraints. Specifically, we investigate how CMDPs can be adapted to solve goal-based tasks while adhering to several constraints simultaneously. We evaluate this framework on a set of continuous control tasks relevant to the application of Reinforcement Learning for NPC design in video games.
△ Less
Submitted 18 June, 2022; v1 submitted 22 December, 2021;
originally announced December 2021.
-
Latent Variable Sequential Set Transformers For Joint Multi-Agent Motion Prediction
Authors:
Roger Girgis,
Florian Golemo,
Felipe Codevilla,
Martin Weiss,
Jim Aldon D'Souza,
Samira Ebrahimi Kahou,
Felix Heide,
Christopher Pal
Abstract:
Robust multi-agent trajectory prediction is essential for the safe control of robotic systems. A major challenge is to efficiently learn a representation that approximates the true joint distribution of contextual, social, and temporal information to enable planning. We propose Latent Variable Sequential Set Transformers which are encoder-decoder architectures that generate scene-consistent multi-…
▽ More
Robust multi-agent trajectory prediction is essential for the safe control of robotic systems. A major challenge is to efficiently learn a representation that approximates the true joint distribution of contextual, social, and temporal information to enable planning. We propose Latent Variable Sequential Set Transformers which are encoder-decoder architectures that generate scene-consistent multi-agent trajectories. We refer to these architectures as "AutoBots". The encoder is a stack of interleaved temporal and social multi-head self-attention (MHSA) modules which alternately perform equivariant processing across the temporal and social dimensions. The decoder employs learnable seed parameters in combination with temporal and social MHSA modules allowing it to perform inference over the entire future scene in a single forward pass efficiently. AutoBots can produce either the trajectory of one ego-agent or a distribution over the future trajectories for all agents in the scene. For the single-agent prediction case, our model achieves top results on the global nuScenes vehicle motion prediction leaderboard, and produces strong results on the Argoverse vehicle prediction challenge. In the multi-agent setting, we evaluate on the synthetic partition of TrajNet++ dataset to showcase the model's socially-consistent predictions. We also demonstrate our model on general sequences of sets and provide illustrative experiments modelling the sequential structure of the multiple strokes that make up symbols in the Omniglot data. A distinguishing feature of AutoBots is that all models are trainable on a single desktop GPU (1080 Ti) in under 48h.
△ Less
Submitted 10 February, 2022; v1 submitted 19 February, 2021;
originally announced April 2021.
-
Navigation Agents for the Visually Impaired: A Sidewalk Simulator and Experiments
Authors:
Martin Weiss,
Simon Chamorro,
Roger Girgis,
Margaux Luck,
Samira E. Kahou,
Joseph P. Cohen,
Derek Nowrouzezahrai,
Doina Precup,
Florian Golemo,
Chris Pal
Abstract:
Millions of blind and visually-impaired (BVI) people navigate urban environments every day, using smartphones for high-level path-planning and white canes or guide dogs for local information. However, many BVI people still struggle to travel to new places. In our endeavor to create a navigation assistant for the BVI, we found that existing Reinforcement Learning (RL) environments were unsuitable f…
▽ More
Millions of blind and visually-impaired (BVI) people navigate urban environments every day, using smartphones for high-level path-planning and white canes or guide dogs for local information. However, many BVI people still struggle to travel to new places. In our endeavor to create a navigation assistant for the BVI, we found that existing Reinforcement Learning (RL) environments were unsuitable for the task. This work introduces SEVN, a sidewalk simulation environment and a neural network-based approach to creating a navigation agent. SEVN contains panoramic images with labels for house numbers, doors, and street name signs, and formulations for several navigation tasks. We study the performance of an RL algorithm (PPO) in this setting. Our policy model fuses multi-modal observations in the form of variable resolution images, visible text, and simulated GPS data to navigate to a goal door. We hope that this dataset, simulator, and experimental results will provide a foundation for further research into the creation of agents that can assist members of the BVI community with outdoor navigation.
△ Less
Submitted 29 October, 2019;
originally announced October 2019.
-
A Survey of Mobile Computing for the Visually Impaired
Authors:
Martin Weiss,
Margaux Luck,
Roger Girgis,
Chris Pal,
Joseph Paul Cohen
Abstract:
The number of visually impaired or blind (VIB) people in the world is estimated at several hundred million. Based on a series of interviews with the VIB and developers of assistive technology, this paper provides a survey of machine-learning based mobile applications and identifies the most relevant applications. We discuss the functionality of these apps, how they align with the needs and require…
▽ More
The number of visually impaired or blind (VIB) people in the world is estimated at several hundred million. Based on a series of interviews with the VIB and developers of assistive technology, this paper provides a survey of machine-learning based mobile applications and identifies the most relevant applications. We discuss the functionality of these apps, how they align with the needs and requirements of the VIB users, and how they can be improved with techniques such as federated learning and model compression. As a result of this study we identify promising future directions of research in mobile perception, micro-navigation, and content-summarization.
△ Less
Submitted 27 November, 2018; v1 submitted 25 November, 2018;
originally announced November 2018.
-
Performance evaluation of a new route optimization technique for mobile IP
Authors:
Moheb R Girgis,
Tarek M Mahmoud,
Youssef S Takroni,
Hassan S Hassan
Abstract:
Mobile ip (mip) is an internet protocol that allows mobile nodes to have continuous network connectivity to the internet without changing their ip addresses while moving to other networks. The packets sent from correspondent node (cn) to a mobile node (mn) go first through the mobile node's home agent (ha), then the ha tunnels them to the mn's foreign network. One of the main problems in the origi…
▽ More
Mobile ip (mip) is an internet protocol that allows mobile nodes to have continuous network connectivity to the internet without changing their ip addresses while moving to other networks. The packets sent from correspondent node (cn) to a mobile node (mn) go first through the mobile node's home agent (ha), then the ha tunnels them to the mn's foreign network. One of the main problems in the original mip is the triangle routing problem. Triangle routing problem appears when the indirect path between cn and mn through the ha is longer than the direct path. This paper proposes a new technique to improve the performance of the original mip during the handoff. The proposed technique reduces the delay, the packet loss and the registration time for all the packets transferred between the cn and the mn. In this technique, tunneling occurs at two levels above the ha in a hierarchical network. To show the effectiveness of the proposed technique, it is compared with the original mip and another technique for solving the same problem in which tunneling occurs at one level above the ha. Simulation results presented in this paper are based on the ns2 mobility software on linux platform. The simulations results show that our proposed technique achieves better performance than the others, considering the packet delay, the packet losses during handoffs and the registration time, in different scenarios for the location of the mn with respect to the ha and fas.
△ Less
Submitted 6 April, 2010;
originally announced April 2010.