Skip to main content

Showing 1–50 of 178 results for author: Stoica, I

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.02825  [pdf, ps, other

    cs.AI

    Establishing Best Practices for Building Rigorous Agentic Benchmarks

    Authors: Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy Zhang, Shu Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Rebecca Weiss, Fazl Barez, Rahul Gupta, Jwala Dhamala, Jacob Merizian, Mario Giulianelli, Harry Coppock, Cozmin Ududec, Jasjeet Sekhon, Jacob Steinhardt, Antony Kellerman, Sarah Schwettmann, Matei Zaharia, Ion Stoica, Percy Liang, Daniel Kang

    Abstract: Benchmarks are essential for quantitatively tracking progress in AI. As AI agents become increasingly capable, researchers and practitioners have introduced agentic benchmarks to evaluate agents on complex, real-world tasks. These benchmarks typically measure agent capabilities by evaluating task outcomes via specific reward designs. However, we show that many agentic benchmarks have issues in tas… ▽ More

    Submitted 10 July, 2025; v1 submitted 3 July, 2025; originally announced July 2025.

    Comments: 39 pages, 15 tables, 6 figures

    ACM Class: A.1; I.2.m

  2. arXiv:2506.19852  [pdf, ps, other

    cs.CV cs.AI cs.LG

    Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation

    Authors: Xingyang Li, Muyang Li, Tianle Cai, Haocheng Xi, Shuo Yang, Yujun Lin, Lvmin Zhang, Songlin Yang, Jinbo Hu, Kelly Peng, Maneesh Agrawala, Ion Stoica, Kurt Keutzer, Song Han

    Abstract: Recent advances in diffusion models have enabled high-quality video generation, but the additional temporal dimension significantly increases computational costs, making training and inference on long videos prohibitively expensive. In this paper, we identify a phenomenon we term Spatiotemporal Energy Decay in video diffusion models: post-softmax attention scores diminish as spatial and temporal d… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

    Comments: Code: https://github.com/mit-han-lab/radial-attention

  3. arXiv:2506.17811  [pdf, ps, other

    cs.RO cs.AI eess.SY

    RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models

    Authors: Jacky Kwok, Christopher Agia, Rohan Sinha, Matt Foutter, Shulu Li, Ion Stoica, Azalia Mirhoseini, Marco Pavone

    Abstract: Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in visuomotor control, yet ensuring their robustness in unstructured real-world environments remains a persistent challenge. In this paper, we investigate test-time scaling through the lens of sampling and verification as means to enhance the robustness and generalization of VLAs. We first demonstrate that the relationsh… ▽ More

    Submitted 6 July, 2025; v1 submitted 21 June, 2025; originally announced June 2025.

  4. arXiv:2506.15733  [pdf, ps, other

    cs.AI cs.CL cs.LG

    $\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts

    Authors: Mert Cemri, Nived Rajaraman, Rishabh Tiwari, Xiaoxuan Liu, Kurt Keutzer, Ion Stoica, Kannan Ramchandran, Ahmad Beirami, Ziteng Sun

    Abstract: Scaling test-time compute has driven the recent advances in the reasoning capabilities of large language models (LLMs), typically by allocating additional computation for more thorough exploration. However, increased compute often comes at the expense of higher user-facing latency, directly impacting user experience. Current test-time scaling methods primarily optimize for accuracy based on total… ▽ More

    Submitted 15 June, 2025; originally announced June 2025.

    Comments: 28 pages, 6 figures, 2 tables

  5. arXiv:2506.08276  [pdf, ps, other

    cs.DB cs.LG

    LEANN: A Low-Storage Vector Index

    Authors: Yichuan Wang, Shu Liu, Zhifei Li, Yongji Wu, Ziming Mao, Yilong Zhao, Xiao Yan, Zhiying Xu, Yang Zhou, Ion Stoica, Sewon Min, Matei Zaharia, Joseph E. Gonzalez

    Abstract: Embedding-based search is widely used in applications such as recommendation and retrieval-augmented generation (RAG). Recently, there is a growing demand to support these capabilities over personal data stored locally on devices. However, maintaining the necessary data structure associated with the embedding-based search is often infeasible due to its high storage overhead. For example, indexing… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  6. arXiv:2505.24095  [pdf, ps, other

    cs.DC

    SkyLB: A Locality-Aware Cross-Region Load Balancer for LLM Inference

    Authors: Tian Xia, Ziming Mao, Jamison Kerney, Ethan J. Jackson, Zhifei Li, Jiarong Xing, Scott Shenker, Ion Stoica

    Abstract: Serving Large Language Models (LLMs) efficiently in multi-region setups remains a challenge. Due to cost and GPU availability concerns, providers typically deploy LLMs in multiple regions using instance with long-term commitments, like reserved instances or on-premise clusters, which are often underutilized due to their region-local traffic handling and diurnal traffic variance. In this paper, we… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

  7. arXiv:2505.23671  [pdf, ps, other

    cs.SE cs.AI cs.CL cs.LG

    GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents

    Authors: Manish Shetty, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, Ion Stoica

    Abstract: Developing high-performance software is a complex task that requires specialized expertise. We introduce GSO, a benchmark for evaluating language models' capabilities in developing high-performance software. We develop an automated pipeline that generates and executes performance tests to analyze repository commit histories to identify 102 challenging optimization tasks across 10 codebases, spanni… ▽ More

    Submitted 30 May, 2025; v1 submitted 29 May, 2025; originally announced May 2025.

    Comments: Website: https://gso-bench.github.io/

  8. arXiv:2505.18875  [pdf, ps, other

    cs.CV

    Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

    Authors: Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, Jianfei Chen, Song Han, Kurt Keutzer, Ion Stoica

    Abstract: Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention. By computing only critical tokens, sparse attention reduces computational costs and offers a promising acceleration approach. However, we identify that existing methods fail to approach optimal generation quality under the same computation budget for tw… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

  9. arXiv:2505.15146  [pdf, ps, other

    cs.AI

    lmgame-Bench: How Good are LLMs at Playing Games?

    Authors: Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P. Xing, Ion Stoica, Tajana Rosing, Haojian Jin, Hao Zhang

    Abstract: Playing video games requires perception, memory, and planning, exactly the faculties modern large language model (LLM) agents are expected to master. We study the major challenges in using popular video games to evaluate modern LLMs and find that directly dropping LLMs into games cannot make an effective evaluation, for three reasons -- brittle vision perception, prompt sensitivity, and potential… ▽ More

    Submitted 3 June, 2025; v1 submitted 21 May, 2025; originally announced May 2025.

  10. arXiv:2505.13389  [pdf, ps, other

    cs.CV

    VSA: Faster Video Diffusion with Trainable Sparse Attention

    Authors: Peiyuan Zhang, Haofeng Huang, Yongqi Chen, Will Lin, Zhengzhong Liu, Ion Stoica, Eric Xing, Hao Zhang

    Abstract: Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions. We turn this observation into VSA, a trainable, hardware-efficient sparse attention that replaces full attention at \emph{both} training and inference. In VSA, a lightweight coarse stage pools tokens into tiles and identifies hi… ▽ More

    Submitted 26 May, 2025; v1 submitted 19 May, 2025; originally announced May 2025.

  11. arXiv:2505.07203  [pdf, other

    cs.DC

    PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications

    Authors: Kuntai Du, Bowen Wang, Chen Zhang, Yiming Cheng, Qing Lan, Hejian Sang, Yihua Cheng, Jiayi Yao, Xiaoxuan Liu, Yifan Qiao, Ion Stoica, Junchen Jiang

    Abstract: Besides typical generative applications, like ChatGPT, GitHub Copilot, and Cursor, we observe an emerging trend that LLMs are increasingly used in traditional discriminative tasks, such as recommendation, credit verification, and data labeling. The key characteristic of these emerging use cases is that the LLM generates only a single output token, rather than an arbitrarily long sequence of tokens… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

  12. arXiv:2505.04021  [pdf, other

    cs.DC cs.AI cs.LG cs.PF

    Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving

    Authors: Shan Yu, Jiarong Xing, Yifan Qiao, Mingyuan Ma, Yangmin Li, Yang Wang, Shuo Yang, Zhiqiang Xie, Shiyi Cao, Ke Bao, Ion Stoica, Harry Xu, Ying Sheng

    Abstract: Serving large language models (LLMs) is expensive, especially for providers hosting many models, making cost reduction essential. The unique workload patterns of serving multiple LLMs (i.e., multi-LLM serving) create new opportunities and challenges for this task. The long-tail popularity of models and their long idle periods present opportunities to improve utilization through GPU sharing. Howeve… ▽ More

    Submitted 12 May, 2025; v1 submitted 6 May, 2025; originally announced May 2025.

  13. arXiv:2504.17307  [pdf, other

    cs.NI

    An Extensible Software Transport Layer for GPU Networking

    Authors: Yang Zhou, Zhongjie Chen, Ziming Mao, ChonLam Lao, Shuo Yang, Pravein Govindan Kannan, Jiaqi Gao, Yilong Zhao, Yongji Wu, Kaichao You, Fengyuan Ren, Zhiying Xu, Costin Raiciu, Ion Stoica

    Abstract: Fast-evolving machine learning (ML) workloads have increasing requirements for networking. However, host network transport on RDMA NICs is hard to evolve, causing problems for ML workloads. For example, single-path RDMA traffic is prone to flow collisions that severely degrade collective communication performance. We present UCCL, an extensible software transport layer to evolve GPU networking. UC… ▽ More

    Submitted 24 April, 2025; originally announced April 2025.

  14. arXiv:2504.16324  [pdf, other

    cs.DC cs.AR

    The Dawn of Disaggregation and the Coherence Conundrum: A Call for Federated Coherence

    Authors: Jaewan Hong, Marcos K. Aguilera, Emmanuel Amaro, Vincent Liu, Aurojit Panda, Ion Stoica

    Abstract: Disaggregated memory is an upcoming data center technology that will allow nodes (servers) to share data efficiently. Sharing data creates a debate on the level of cache coherence the system should provide. While current proposals aim to provide coherence for all or parts of the disaggregated memory, we argue that this approach is problematic, because of scalability limitations and hardware comple… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

  15. arXiv:2504.13171  [pdf, other

    cs.AI cs.CL

    Sleep-time Compute: Beyond Inference Scaling at Test-time

    Authors: Kevin Lin, Charlie Snell, Yu Wang, Charles Packer, Sarah Wooders, Ion Stoica, Joseph E. Gonzalez

    Abstract: Scaling test-time compute has emerged as a key ingredient for enabling large language models (LLMs) to solve difficult problems, but comes with high latency and inference cost. We introduce sleep-time compute, which allows models to "think" offline about contexts before queries are presented: by anticipating what queries users might ask and pre-computing useful quantities, we can significantly red… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

    Comments: Code and data released at: https://github.com/letta-ai/sleep-time-compute

  16. arXiv:2504.07164  [pdf, other

    cs.SE cs.CL cs.LG

    R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents

    Authors: Naman Jain, Jaskirat Singh, Manish Shetty, Liang Zheng, Koushik Sen, Ion Stoica

    Abstract: Improving open-source models on real-world SWE tasks (solving GITHUB issues) faces two key challenges: 1) scalable curation of execution environments to train these models, and, 2) optimal scaling of test-time compute. We introduce AgentGym, the largest procedurally-curated executable gym environment for training real-world SWE-agents, consisting of more than 8.7K tasks. AgentGym is powered by two… ▽ More

    Submitted 9 April, 2025; originally announced April 2025.

    Comments: Website: https://r2e-gym.github.io/

  17. arXiv:2504.03871  [pdf, other

    cs.DC cs.LG

    HeterMoE: Efficient Training of Mixture-of-Experts Models on Heterogeneous GPUs

    Authors: Yongji Wu, Xueshen Liu, Shuowei Jin, Ceyu Xu, Feng Qian, Z. Morley Mao, Matthew Lentz, Danyang Zhuo, Ion Stoica

    Abstract: The Mixture-of-Experts (MoE) architecture has become increasingly popular as a method to scale up large language models (LLMs). To save costs, heterogeneity-aware training solutions have been proposed to utilize GPU clusters made up of both newer and older-generation GPUs. However, existing solutions are agnostic to the performance characteristics of different MoE model components (i.e., attention… ▽ More

    Submitted 4 April, 2025; originally announced April 2025.

  18. arXiv:2503.20127  [pdf, other

    cs.RO cs.NI

    Bandwidth Allocation for Cloud-Augmented Autonomous Driving

    Authors: Peter Schafhalter, Alexander Krentsel, Joseph E. Gonzalez, Sylvia Ratnasamy, Scott Shenker, Ion Stoica

    Abstract: Autonomous vehicle (AV) control systems increasingly rely on ML models for tasks such as perception and planning. Current practice is to run these models on the car's local hardware due to real-time latency constraints and reliability concerns, which limits model size and thus accuracy. Prior work has observed that we could augment current systems by running larger models in the cloud, relying on… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

    Comments: 18 pages, 11 figures

  19. arXiv:2503.18292  [pdf, other

    cs.DC

    Jenga: Effective Memory Management for Serving LLM with Heterogeneity

    Authors: Chen Zhang, Kuntai Du, Shu Liu, Woosuk Kwon, Xiangxi Mo, Yufeng Wang, Xiaoxuan Liu, Kaichao You, Zhuohan Li, Mingsheng Long, Jidong Zhai, Joseph Gonzalez, Ion Stoica

    Abstract: Large language models (LLMs) are widely used but expensive to run, especially as inference workloads grow. To lower costs, maximizing the request batch size by managing GPU memory efficiently is crucial. While PagedAttention has recently been proposed to improve the efficiency of memory management, we find that the growing heterogeneity in the embeddings dimensions, attention, and access patterns… ▽ More

    Submitted 23 March, 2025; originally announced March 2025.

    Comments: 16 pages, 19 figures

  20. arXiv:2503.13657  [pdf, other

    cs.AI

    Why Do Multi-Agent LLM Systems Fail?

    Authors: Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica

    Abstract: Despite growing enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains on popular benchmarks often remain minimal compared with single-agent frameworks. This gap highlights the need to systematically analyze the challenges hindering MAS effectiveness. We present MAST (Multi-Agent System Failure Taxonomy), the first empirically grounded taxonomy designed to understand MAS failures.… ▽ More

    Submitted 22 April, 2025; v1 submitted 17 March, 2025; originally announced March 2025.

    Comments: ArXiv v2

  21. arXiv:2502.20818  [pdf, other

    cs.DC

    SkyStore: Cost-Optimized Object Storage Across Regions and Clouds

    Authors: Shu Liu, Xiangxi Mo, Moshik Hershcovitch, Henric Zhang, Audrey Cheng, Guy Girmonsky, Gil Vernik, Michael Factor, Tiemo Bang, Soujanya Ponnapalli, Natacha Crooks, Joseph E. Gonzalez, Danny Harnik, Ion Stoica

    Abstract: Modern applications span multiple clouds to reduce costs, avoid vendor lock-in, and leverage low-availability resources in another cloud. However, standard object stores operate within a single cloud, forcing users to manually manage data placement across clouds, i.e., navigate their diverse APIs and handle heterogeneous costs for network and storage. This is often a complex choice: users must eit… ▽ More

    Submitted 28 February, 2025; originally announced February 2025.

  22. arXiv:2502.20694  [pdf, other

    cs.CV cs.AI

    WorldModelBench: Judging Video Generation Models As World Models

    Authors: Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E. Gonzalez, Ion Stoica, Song Han, Yao Lu

    Abstract: Video generation models have rapidly progressed, positioning themselves as video world models capable of supporting decision-making applications like robotics and autonomous driving. However, current benchmarks fail to rigorously evaluate these claims, focusing only on general video quality, ignoring important factors to world models such as physics adherence. To bridge this gap, we propose WorldM… ▽ More

    Submitted 27 February, 2025; originally announced February 2025.

  23. arXiv:2502.14855  [pdf, other

    cs.LG cs.CL

    Prompt-to-Leaderboard

    Authors: Evan Frick, Connor Chen, Joseph Tennyson, Tianle Li, Wei-Lin Chiang, Anastasios N. Angelopoulos, Ion Stoica

    Abstract: Large language model (LLM) evaluations typically rely on aggregated metrics like accuracy or human preference, averaging across users and prompts. This averaging obscures user- and prompt-specific variations in model performance. To address this, we propose Prompt-to-Leaderboard (P2L), a method that produces leaderboards specific to a prompt. The core idea is to train an LLM taking natural languag… ▽ More

    Submitted 10 March, 2025; v1 submitted 20 February, 2025; originally announced February 2025.

  24. arXiv:2502.14815  [pdf, other

    cs.AI cs.CL cs.LG cs.MA

    Optimizing Model Selection for Compound AI Systems

    Authors: Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Matei Zaharia, James Zou, Ion Stoica

    Abstract: Compound AI systems that combine multiple LLM calls, such as self-refine and multi-agent-debate, achieve strong performance on many AI tasks. We address a core question in optimizing compound systems: for each LLM call or module in the system, how should one decide which LLM to use? We show that these LLM choices have a large effect on quality, but the search space is exponential. We propose LLMSe… ▽ More

    Submitted 20 February, 2025; originally announced February 2025.

  25. arXiv:2502.14382  [pdf, other

    cs.LG cs.AI

    S*: Test Time Scaling for Code Generation

    Authors: Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, Ion Stoica

    Abstract: Increasing test-time compute for LLMs shows promise across domains but remains underexplored in code generation, despite extensive study in math. In this paper, we propose S*, the first hybrid test-time scaling framework that substantially improves the coverage and selection accuracy of generated code. S* extends the existing parallel scaling paradigm with sequential scaling to push performance bo… ▽ More

    Submitted 20 February, 2025; originally announced February 2025.

  26. arXiv:2502.13965  [pdf, other

    cs.LG cs.AI cs.DC

    Autellix: An Efficient Serving Engine for LLM Agents as General Programs

    Authors: Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E. Gonzalez, Ion Stoica

    Abstract: Large language model (LLM) applications are evolving beyond simple chatbots into dynamic, general-purpose agentic programs, which scale LLM calls and output tokens to help AI agents reason, explore, and solve complex tasks. However, existing LLM serving systems ignore dependencies between programs and calls, missing significant opportunities for optimization. Our analysis reveals that programs sub… ▽ More

    Submitted 19 February, 2025; originally announced February 2025.

  27. arXiv:2502.09328  [pdf, other

    cs.SE

    Copilot Arena: A Platform for Code LLM Evaluation in the Wild

    Authors: Wayne Chi, Valerie Chen, Anastasios Nikolas Angelopoulos, Wei-Lin Chiang, Aditya Mittal, Naman Jain, Tianjun Zhang, Ion Stoica, Chris Donahue, Ameet Talwalkar

    Abstract: Evaluating in-the-wild coding capabilities of large language models (LLMs) is a challenging endeavor with no clear solution. We introduce Copilot Arena, a platform to collect user preferences for code generation through native integration into a developer's working environment. Copilot Arena comprises a novel interface for comparing pairs of model outputs, a sampling strategy optimized to reduce l… ▽ More

    Submitted 13 February, 2025; originally announced February 2025.

  28. arXiv:2502.08235  [pdf, other

    cs.AI

    The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks

    Authors: Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, Nicholas Thumiger, Aditya Desai, Ion Stoica, Ana Klimovic, Graham Neubig, Joseph E. Gonzalez

    Abstract: Large Reasoning Models (LRMs) represent a breakthrough in AI problem-solving capabilities, but their effectiveness in interactive environments can be limited. This paper introduces and analyzes overthinking in LRMs. A phenomenon where models favor extended internal reasoning chains over environmental interaction. Through experiments on software engineering tasks using SWE Bench Verified, we observ… ▽ More

    Submitted 12 February, 2025; originally announced February 2025.

  29. arXiv:2502.07374  [pdf, other

    cs.AI

    LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!

    Authors: Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi Mo, Eric Tang, Sumanth Hegde, Kourosh Hakhamaneshi, Shishir G. Patil, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica

    Abstract: Large reasoning models (LRMs) tackle complex reasoning problems by following long chain-of-thoughts (Long CoT) that incorporate reflection, backtracking, and self-validation. However, the training techniques and data requirements to elicit Long CoT remain poorly understood. In this work, we find that a Large Language model (LLM) can effectively learn Long CoT reasoning through data-efficient super… ▽ More

    Submitted 18 February, 2025; v1 submitted 11 February, 2025; originally announced February 2025.

  30. arXiv:2502.06155  [pdf, other

    cs.CV

    Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile

    Authors: Hangliang Ding, Dacheng Li, Runlong Su, Peiyuan Zhang, Zhijie Deng, Ion Stoica, Hao Zhang

    Abstract: Despite the promise of synthesizing high-fidelity videos, Diffusion Transformers (DiTs) with 3D full attention suffer from expensive inference due to the complexity of attention computation and numerous sampling steps. For example, the popular Open-Sora-Plan model consumes more than 9 minutes for generating a single video of 29 frames. This paper addresses the inefficiency issue from two aspects:… ▽ More

    Submitted 17 February, 2025; v1 submitted 10 February, 2025; originally announced February 2025.

  31. arXiv:2502.04507  [pdf, ps, other

    cs.CV

    Fast Video Generation with Sliding Tile Attention

    Authors: Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhengzhong Liu, Hao Zhang

    Abstract: Diffusion Transformers (DiTs) with 3D full attention power state-of-the-art video generation, but suffer from prohibitive compute cost -- when generating just a 5-second 720P video, attention alone takes 800 out of 945 seconds of total inference time. This paper introduces sliding tile attention (STA) to address this challenge. STA leverages the observation that attention scores in pretrained vide… ▽ More

    Submitted 4 June, 2025; v1 submitted 6 February, 2025; originally announced February 2025.

    Comments: Accepted by ICML 2025

  32. arXiv:2502.02770  [pdf, other

    cs.LG cs.CL

    Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning

    Authors: Chaofan Lin, Jiaming Tang, Shuo Yang, Hanshuo Wang, Tian Tang, Boyu Tian, Ion Stoica, Song Han, Mingyu Gao

    Abstract: Leveraging attention sparsity to accelerate long-context large language models (LLMs) has been a hot research topic. However, current algorithms such as sparse attention or key-value (KV) cache compression tend to use a fixed budget, which presents a significant challenge during deployment because it fails to account for the dynamic nature of real-world scenarios, where the optimal balance between… ▽ More

    Submitted 5 February, 2025; v1 submitted 4 February, 2025; originally announced February 2025.

  33. arXiv:2502.01776  [pdf, other

    cs.CV cs.LG

    Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity

    Authors: Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, Jianfei Chen, Ion Stoica, Kurt Keutzer, Song Han

    Abstract: Diffusion Transformers (DiTs) dominate video generation but their high computational cost severely limits real-world applicability, usually requiring tens of minutes to generate a few seconds of video even on high-performance GPUs. This inefficiency primarily arises from the quadratic computational complexity of 3D Full Attention with respect to the context length. In this paper, we propose a trai… ▽ More

    Submitted 26 April, 2025; v1 submitted 3 February, 2025; originally announced February 2025.

    Comments: 17 pages, 11 figures, 3 tables

  34. arXiv:2502.01697  [pdf, other

    cs.CL cs.AI cs.LG

    BARE: Leveraging Base Language Models for Few-Shot Synthetic Data Generation

    Authors: Alan Zhu, Parth Asawa, Jared Quincy Davis, Lingjiao Chen, Boris Hanin, Ion Stoica, Joseph E. Gonzalez, Matei Zaharia

    Abstract: As the demand for high-quality data in model training grows, researchers and developers are increasingly generating synthetic data to tune and train LLMs. However, current data generation methods rely on seed sets containing tens of thousands of examples to prompt instruction-tuned models. This reliance can be especially problematic when the curation of high-quality examples is expensive or diffic… ▽ More

    Submitted 21 May, 2025; v1 submitted 2 February, 2025; originally announced February 2025.

  35. arXiv:2501.14312  [pdf, other

    cs.DC cs.LG

    Locality-aware Fair Scheduling in LLM Serving

    Authors: Shiyi Cao, Yichuan Wang, Ziming Mao, Pin-Lun Hsu, Liangsheng Yin, Tian Xia, Dacheng Li, Shu Liu, Yineng Zhang, Yang Zhou, Ying Sheng, Joseph Gonzalez, Ion Stoica

    Abstract: Large language model (LLM) inference workload dominates a wide variety of modern AI applications, ranging from multi-turn conversation to document analysis. Balancing fairness and efficiency is critical for managing diverse client workloads with varying prefix patterns. Unfortunately, existing fair scheduling algorithms for LLM serving, such as Virtual Token Counter (VTC), fail to take prefix loca… ▽ More

    Submitted 24 January, 2025; originally announced January 2025.

  36. arXiv:2501.12407  [pdf, other

    cs.DC cs.LG

    The Streaming Batch Model for Efficient and Fault-Tolerant Heterogeneous Execution

    Authors: Frank Sifei Luan, Ziming Mao, Ron Yifeng Wang, Charlotte Lin, Amog Kamsetty, Hao Chen, Cheng Su, Balaji Veeramani, Scott Lee, SangBin Cho, Clark Zinzow, Eric Liang, Ion Stoica, Stephanie Wang

    Abstract: While ML model training and inference are both GPU-intensive, CPU-based data processing is often the bottleneck. Distributed data processing systems based on the batch or stream processing models assume homogeneous resource requirements. They excel at CPU-based computation but either under-utilize heterogeneous resources or impose high overheads on failure and reconfiguration. We introduce the str… ▽ More

    Submitted 16 February, 2025; v1 submitted 16 January, 2025; originally announced January 2025.

  37. arXiv:2501.07493  [pdf, other

    cs.LG cs.CR

    Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards

    Authors: Yangsibo Huang, Milad Nasr, Anastasios Angelopoulos, Nicholas Carlini, Wei-Lin Chiang, Christopher A. Choquette-Choo, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Ken Ziyu Liu, Ion Stoica, Florian Tramer, Chiyuan Zhang

    Abstract: It is now common to evaluate Large Language Models (LLMs) by having humans manually vote to evaluate model outputs, in contrast to typical benchmarks that evaluate knowledge or skill at some particular task. Chatbot Arena, the most popular benchmark of this type, ranks models by asking users to select the better response between two randomly selected models (without revealing which model was respo… ▽ More

    Submitted 13 January, 2025; originally announced January 2025.

  38. arXiv:2412.20993  [pdf, other

    cs.LG cs.CL

    Efficiently Scaling LLM Reasoning with Certaindex

    Authors: Yichao Fu, Junda Chen, Siqi Zhu, Zheyu Fu, Zhongdongming Dai, Yonghao Zhuang, Yian Ma, Aurick Qiao, Tajana Rosing, Ion Stoica, Hao Zhang

    Abstract: Test-time reasoning algorithms such as chain-of-thought, self-consistency, and MCTS enhance LLM problem-solving but can wastefully generate many tokens without improving accuracy. At the same time, we observe that these algorithms exhibit answer stabilization: their intermediate solutions often cease to change after a certain point, and further investment of compute does not change their final ans… ▽ More

    Submitted 27 May, 2025; v1 submitted 30 December, 2024; originally announced December 2024.

  39. arXiv:2412.20221  [pdf, other

    cs.OS cs.DC cs.NI

    Revisiting Cache Freshness for Emerging Real-Time Applications

    Authors: Ziming Mao, Rishabh Iyer, Scott Shenker, Ion Stoica

    Abstract: Caching is widely used in industry to improve application performance by reducing data-access latency and taking the load off the backend infrastructure. TTLs have become the de-facto mechanism used to keep cached data reasonably fresh (i.e., not too out of date with the backend). However, the emergence of real-time applications requires tighter data freshness, which is impractical to achieve with… ▽ More

    Submitted 28 December, 2024; originally announced December 2024.

    Comments: HotNets '24

  40. arXiv:2412.18407  [pdf, ps, other

    stat.ML cs.AI cs.LG

    A Statistical Framework for Ranking LLM-Based Chatbots

    Authors: Siavash Ameli, Siyuan Zhuang, Ion Stoica, Michael W. Mahoney

    Abstract: Large language models (LLMs) have transformed natural language processing, with frameworks like Chatbot Arena providing pioneering platforms for evaluating these models. By facilitating millions of pairwise comparisons based on human judgments, Chatbot Arena has become a cornerstone in LLM evaluation, offering rich datasets for ranking models in open-ended conversational tasks. Building upon this… ▽ More

    Submitted 29 May, 2025; v1 submitted 24 December, 2024; originally announced December 2024.

    Journal ref: The Thirteenth International Conference on Learning Representations (2025)

  41. arXiv:2412.14468  [pdf, ps, other

    cs.LG cs.AI

    HashAttention: Semantic Sparsity for Faster Inference

    Authors: Aditya Desai, Shuo Yang, Alejandro Cuadron, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica

    Abstract: Leveraging long contexts is crucial for advanced AI systems, but attention computation poses a scalability challenge. While scaled dot-product attention (SDPA) exhibits token sparsity, i.e. only a few pivotal tokens significantly contribute to output, exploiting this sparsity remains challenging. Existing methods either suffer from quality degradation or require substantial additional resources. W… ▽ More

    Submitted 3 June, 2025; v1 submitted 18 December, 2024; originally announced December 2024.

    Comments: Accepted at ICML'2025

  42. arXiv:2412.08687  [pdf, other

    cs.CV

    VisionArena: 230K Real World User-VLM Conversations with Preference Labels

    Authors: Christopher Chou, Lisa Dunlap, Koki Mashita, Krishna Mandal, Trevor Darrell, Ion Stoica, Joseph E. Gonzalez, Wei-Lin Chiang

    Abstract: With the growing adoption and capabilities of vision-language models (VLMs) comes the need for benchmarks that capture authentic user-VLM interactions. In response, we create VisionArena, a dataset of 230K real-world conversations between users and VLMs. Collected from Chatbot Arena - an open-source platform where users interact with VLMs and submit preference votes - VisionArena spans 73K unique… ▽ More

    Submitted 25 March, 2025; v1 submitted 11 December, 2024; originally announced December 2024.

    Comments: updated for CVPR Camera Ready

  43. arXiv:2412.06394  [pdf, other

    cs.AI cs.CL

    GameArena: Evaluating LLM Reasoning through Live Computer Games

    Authors: Lanxiang Hu, Qiyu Li, Anze Xie, Nan Jiang, Ion Stoica, Haojian Jin, Hao Zhang

    Abstract: Evaluating the reasoning abilities of large language models (LLMs) is challenging. Existing benchmarks often depend on static datasets, which are vulnerable to data contamination and may get saturated over time, or on binary live human feedback that conflates reasoning with other abilities. As the most prominent dynamic benchmark, Chatbot Arena evaluates open-ended questions in real-world settings… ▽ More

    Submitted 15 February, 2025; v1 submitted 9 December, 2024; originally announced December 2024.

  44. arXiv:2412.05408  [pdf, other

    cs.RO cs.AI cs.DC cs.NI

    FogROS2-FT: Fault Tolerant Cloud Robotics

    Authors: Kaiyuan Chen, Kush Hari, Trinity Chung, Michael Wang, Nan Tian, Christian Juette, Jeffrey Ichnowski, Liu Ren, John Kubiatowicz, Ion Stoica, Ken Goldberg

    Abstract: Cloud robotics enables robots to offload complex computational tasks to cloud servers for performance and ease of management. However, cloud compute can be costly, cloud services can suffer occasional downtime, and connectivity between the robot and cloud can be prone to variations in network Quality-of-Service (QoS). We present FogROS2-FT (Fault Tolerant) to mitigate these issues by introducing a… ▽ More

    Submitted 6 December, 2024; originally announced December 2024.

    Comments: IEEE/RSJ International Conference on Intelligent Robots and Systems 2024 Best Paper Finalist

  45. arXiv:2412.05299  [pdf, other

    cs.SE cs.AI cs.CL

    Specifications: The missing link to making the development of LLM systems an engineering discipline

    Authors: Ion Stoica, Matei Zaharia, Joseph Gonzalez, Ken Goldberg, Koushik Sen, Hao Zhang, Anastasios Angelopoulos, Shishir G. Patil, Lingjiao Chen, Wei-Lin Chiang, Jared Q. Davis

    Abstract: Despite the significant strides made by generative AI in just a few short years, its future progress is constrained by the challenge of building modular and robust systems. This capability has been a cornerstone of past technological revolutions, which relied on combining components to create increasingly sophisticated and reliable systems. Cars, airplanes, computers, and software consist of compo… ▽ More

    Submitted 16 December, 2024; v1 submitted 25 November, 2024; originally announced December 2024.

  46. arXiv:2411.16102  [pdf, other

    cs.LG

    BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching

    Authors: Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yang Zhou, Jiarong Xing, Ion Stoica

    Abstract: Offline batch inference, which leverages the flexibility of request batching to achieve higher throughput and lower costs, is becoming more popular for latency-insensitive applications. Meanwhile, recent progress in model capability and modality makes requests more diverse in compute and memory demands, creating unique opportunities for throughput improvement by resource overlapping. However, a re… ▽ More

    Submitted 25 November, 2024; originally announced November 2024.

  47. arXiv:2411.11217  [pdf, other

    cs.DC cs.AI cs.LG

    MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs

    Authors: Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E. Gonzalez, Matei Zaharia, Ion Stoica

    Abstract: Efficient deployment of large language models, particularly Mixture of Experts (MoE), on resource-constrained platforms presents significant challenges, especially in terms of computational efficiency and memory utilization. The MoE architecture, renowned for its ability to increase model capacity without a proportional increase in inference cost, greatly reduces the token generation latency compa… ▽ More

    Submitted 17 November, 2024; originally announced November 2024.

  48. arXiv:2411.09317  [pdf, other

    cs.LG cs.DC

    Pie: Pooling CPU Memory for LLM Inference

    Authors: Yi Xu, Ziming Mao, Xiangxi Mo, Shu Liu, Ion Stoica

    Abstract: The rapid growth of LLMs has revolutionized natural language processing and AI analysis, but their increasing size and memory demands present significant challenges. A common solution is to spill over to CPU memory; however, traditional GPU-CPU memory swapping often results in higher latency and lower throughput. This paper introduces Pie, an LLM inference framework that addresses these challeng… ▽ More

    Submitted 14 November, 2024; originally announced November 2024.

  49. SkyServe: Serving AI Models across Regions and Clouds with Spot Instances

    Authors: Ziming Mao, Tian Xia, Zhanghao Wu, Wei-Lin Chiang, Tyler Griggs, Romil Bhardwaj, Zongheng Yang, Scott Shenker, Ion Stoica

    Abstract: Recent years have witnessed an explosive growth of AI models. The high cost of hosting AI services on GPUs and their demanding service requirements, make it timely and challenging to lower service costs and guarantee service quality. While spot instances have long been offered with a large discount, spot preemptions have discouraged users from using them to host model replicas when serving AI mode… ▽ More

    Submitted 3 March, 2025; v1 submitted 3 November, 2024; originally announced November 2024.

    Comments: EuroSys 25'

  50. arXiv:2411.01142  [pdf, other

    cs.DC cs.AI cs.LG

    NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference

    Authors: Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, Minlan Yu

    Abstract: Online LLM inference powers many exciting applications such as intelligent chatbots and autonomous agents. Modern LLM inference engines widely rely on request batching to improve inference throughput, aiming to make it cost-efficient when running on expensive GPU accelerators. However, the limited GPU memory has largely limited the batch size achieved in practice, leaving significant GPU compute r… ▽ More

    Submitted 2 November, 2024; originally announced November 2024.