Search | arXiv e-print repository

RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models

Authors: Jacky Kwok, Christopher Agia, Rohan Sinha, Matt Foutter, Shulu Li, Ion Stoica, Azalia Mirhoseini, Marco Pavone

Abstract: Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in visuomotor control, yet ensuring their robustness in unstructured real-world environments remains a persistent challenge. In this paper, we investigate test-time scaling through the lens of sampling and verification as means to enhance the robustness and generalization of VLAs. We first demonstrate that the relationsh… ▽ More Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in visuomotor control, yet ensuring their robustness in unstructured real-world environments remains a persistent challenge. In this paper, we investigate test-time scaling through the lens of sampling and verification as means to enhance the robustness and generalization of VLAs. We first demonstrate that the relationship between action error and the number of generated samples follows an exponentiated power law across a range of VLAs, indicating the existence of inference-time scaling laws. Building on these insights, we introduce RoboMonkey, a test-time scaling framework for VLAs. At deployment, RoboMonkey samples a small set of actions from a VLA, applies Gaussian perturbation and majority voting to construct an action proposal distribution, and then uses a Vision Language Model (VLM)-based verifier to select the optimal action. We propose a synthetic data generation pipeline for training such VLM-based action verifiers, and demonstrate that scaling the synthetic dataset consistently improves verification and downstream accuracy. Through extensive simulated and hardware experiments, we show that pairing existing VLAs with RoboMonkey yields significant performance gains, achieving a 25% absolute improvement on out-of-distribution tasks and 9% on in-distribution tasks. Additionally, when adapting to new robot setups, we show that fine-tuning both VLAs and action verifiers yields a 7% performance increase compared to fine-tuning VLAs alone. △ Less

Submitted 6 July, 2025; v1 submitted 21 June, 2025; originally announced June 2025.

arXiv:2410.16227 [pdf, other]

Managing Bandwidth: The Key to Cloud-Assisted Autonomous Driving

Authors: Alexander Krentsel, Peter Schafhalter, Joseph E. Gonzalez, Sylvia Ratnasamy, Scott Shenker, Ion Stoica

Abstract: Prevailing wisdom asserts that one cannot rely on the cloud for critical real-time control systems like self-driving cars. We argue that we can, and must. Following the trends of increasing model sizes, improvements in hardware, and evolving mobile networks, we identify an opportunity to offload parts of time-sensitive and latency-critical compute to the cloud. Doing so requires carefully allocati… ▽ More Prevailing wisdom asserts that one cannot rely on the cloud for critical real-time control systems like self-driving cars. We argue that we can, and must. Following the trends of increasing model sizes, improvements in hardware, and evolving mobile networks, we identify an opportunity to offload parts of time-sensitive and latency-critical compute to the cloud. Doing so requires carefully allocating bandwidth to meet strict latency SLOs, while maximizing benefit to the car. △ Less

Submitted 21 October, 2024; originally announced October 2024.

Comments: 6 pages

arXiv:2403.02419 [pdf, other]

Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems

Authors: Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Ion Stoica, Matei Zaharia, James Zou

Abstract: Many recent state-of-the-art results in language tasks were achieved using compound systems that perform multiple Language Model (LM) calls and aggregate their responses. However, there is little understanding of how the number of LM calls - e.g., when asking the LM to answer each question multiple times and taking a majority vote - affects such a compound system's performance. In this paper, we i… ▽ More Many recent state-of-the-art results in language tasks were achieved using compound systems that perform multiple Language Model (LM) calls and aggregate their responses. However, there is little understanding of how the number of LM calls - e.g., when asking the LM to answer each question multiple times and taking a majority vote - affects such a compound system's performance. In this paper, we initiate the study of scaling properties of compound inference systems. We analyze, theoretically and empirically, how the number of LM calls affects the performance of Vote and Filter-Vote, two of the simplest compound system designs, which aggregate LM responses via majority voting, optionally applying LM filters. We find, surprisingly, that across multiple language tasks, the performance of both Vote and Filter-Vote can first increase but then decrease as a function of the number of LM calls. Our theoretical results suggest that this non-monotonicity is due to the diversity of query difficulties within a task: more LM calls lead to higher performance on "easy" queries, but lower performance on "hard" queries, and non-monotone behavior can emerge when a task contains both types of queries. This insight then allows us to compute, from a small number of samples, the number of LM calls that maximizes system performance, and define an analytical scaling model for both systems. Experiments show that our scaling model can accurately predict the performance of Vote and Filter-Vote systems and thus find the optimal number of LM calls to make. △ Less

Submitted 4 June, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

arXiv:2112.07238 [pdf, other]

Composing MPC with LQR and Neural Network for Amortized Efficiency and Stable Control

Authors: Fangyu Wu, Guanhua Wang, Siyuan Zhuang, Kehan Wang, Alexander Keimer, Ion Stoica, Alexandre Bayen

Abstract: Model predictive control (MPC) is a powerful control method that handles dynamical systems with constraints. However, solving MPC iteratively in real time, i.e., implicit MPC, remains a computational challenge. To address this, common solutions include explicit MPC and function approximation. Both methods, whenever applicable, may improve the computational efficiency of the implicit MPC by several… ▽ More Model predictive control (MPC) is a powerful control method that handles dynamical systems with constraints. However, solving MPC iteratively in real time, i.e., implicit MPC, remains a computational challenge. To address this, common solutions include explicit MPC and function approximation. Both methods, whenever applicable, may improve the computational efficiency of the implicit MPC by several orders of magnitude. Nevertheless, explicit MPC often requires expensive pre-computation and does not easily apply to higher-dimensional problems. Meanwhile, function approximation, although scales better with dimension, still requires pre-training on a large dataset and generally cannot guarantee to find an accurate surrogate policy, the failure of which often leads to closed-loop instability. To address these issues, we propose a triple-mode hybrid control scheme, named Memory-Augmented MPC, by combining a linear quadratic regulator, a neural network, and an MPC. From its standard form, we further derive two variants of such hybrid control scheme: one customized for chaotic systems and the other for slow systems. The proposed scheme does not require pre-computation and can improve the amortized running time of the composed MPC with a well-trained neural network. In addition, the scheme maintains closed-loop stability with any neural networks of proper input and output dimensions, alleviating the need for certifying optimality of the neural network in safety-critical applications. △ Less

Submitted 2 August, 2022; v1 submitted 14 December, 2021; originally announced December 2021.

Comments: 13 pages, 10 figures, 2 tables

arXiv:1908.01275 [pdf, other]

A View on Deep Reinforcement Learning in System Optimization

Authors: Ameer Haj-Ali, Nesreen K. Ahmed, Ted Willke, Joseph Gonzalez, Krste Asanovic, Ion Stoica

Abstract: Many real-world systems problems require reasoning about the long term consequences of actions taken to configure and manage the system. These problems with delayed and often sequentially aggregated reward, are often inherently reinforcement learning problems and present the opportunity to leverage the recent substantial advances in deep reinforcement learning. However, in some cases, it is not cl… ▽ More Many real-world systems problems require reasoning about the long term consequences of actions taken to configure and manage the system. These problems with delayed and often sequentially aggregated reward, are often inherently reinforcement learning problems and present the opportunity to leverage the recent substantial advances in deep reinforcement learning. However, in some cases, it is not clear why deep reinforcement learning is a good fit for the problem. Sometimes, it does not perform better than the state-of-the-art solutions. And in other cases, random search or greedy algorithms could outperform deep reinforcement learning. In this paper, we review, discuss, and evaluate the recent trends of using deep reinforcement learning in system optimization. We propose a set of essential metrics to guide future works in evaluating the efficacy of using deep reinforcement learning in system optimization. Our evaluation includes challenges, the types of problems, their formulation in the deep reinforcement learning setting, embedding, the model used, efficiency, and robustness. We conclude with a discussion on open challenges and potential directions for pushing further the integration of reinforcement learning in system optimization. △ Less

Submitted 4 September, 2019; v1 submitted 4 August, 2019; originally announced August 2019.

Showing 1–5 of 5 results for author: Stoica, I