-
Arctic Long Sequence Training: Scalable And Efficient Training For Multi-Million Token Sequences
Authors:
Stas Bekman,
Samyam Rajbhandari,
Michael Wyatt,
Jeff Rasley,
Tunji Ruwase,
Zhewei Yao,
Aurick Qiao,
Yuxiong He
Abstract:
Long sequences are critical for applications like RAG, long document summarization, multi-modality, etc., and modern LLMs, like Llama 4 Scout, support max sequence length of up to 10 million tokens. However, outside of enterprise labs, long sequence training is challenging for the AI community with limited system support in the open-source space.
Out-of-box, even on a modern NVIDIA H100 80GB GPU…
▽ More
Long sequences are critical for applications like RAG, long document summarization, multi-modality, etc., and modern LLMs, like Llama 4 Scout, support max sequence length of up to 10 million tokens. However, outside of enterprise labs, long sequence training is challenging for the AI community with limited system support in the open-source space.
Out-of-box, even on a modern NVIDIA H100 80GB GPU cluster, training Llama 8B model with sequence over 32K runs out of memory on a basic Hugging Face (HF) model due to two reasons: i) LLM training workloads are not optimized to fully leverage a single GPU memory, ii) existing solutions for leveraging multiple GPU memory are not easily available to HF models, making long sequence training inaccessible.
We address this with Arctic Long Sequence Training (ALST). It offers a combination of attention-agnostic single GPU and multi-GPU memory optimizations, that enables it to support out-of-box training of multi-million sequence length for a wide variety of HF models.
ALST supports training Meta's Llama 8B model with 500K sequence length on a single H100 GPU, 3.7M on a single 8xH100 GPU node, and over 15M on a 4 node cluster, an increase of over 400x compared to the 32K baseline for the latter. ALST is fully compatible with HF models and open-sourced via Deepspeed https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-pallellism/ and Arctic Training https://github.com/snowflakedb/ArcticTraining/blob/main/projects/sequence-parallelism/README.md.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
A 3D pocket-aware and evolutionary conserved interaction guided diffusion model for molecular optimization
Authors:
Anjie Qiao,
Hao Zhang,
Qianmu Yuan,
Qirui Deng,
Jingtian Su,
Weifeng Huang,
Huihao Zhou,
Guo-Bo Li,
Zhen Wang,
Jinping Lei
Abstract:
Generating molecules that bind to specific protein targets via diffusion models has shown good promise for structure-based drug design and molecule optimization. Especially, the diffusion models with binding interaction guidance enables molecule generation with high affinity through forming favorable interaction within protein pocket. However, the generated molecules may not form interactions with…
▽ More
Generating molecules that bind to specific protein targets via diffusion models has shown good promise for structure-based drug design and molecule optimization. Especially, the diffusion models with binding interaction guidance enables molecule generation with high affinity through forming favorable interaction within protein pocket. However, the generated molecules may not form interactions with the highly conserved residues, which are important for protein functions and bioactivities of the ligands. Herein, we developed a new 3D target-aware diffusion model DiffDecip, which explicitly incorporates the protein-ligand binding interactions and evolutionary conservation information of protein residues into both diffusion and sampling process, for molecule optimization through scaffold decoration. The model performance revealed that DiffDecip outperforms baseline model DiffDec on molecule optimization towards higher affinity through forming more non-covalent interactions with highly conserved residues in the protein pocket.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
A 3D pocket-aware and affinity-guided diffusion model for lead optimization
Authors:
Anjie Qiao,
Junjie Xie,
Weifeng Huang,
Hao Zhang,
Jiahua Rao,
Shuangjia Zheng,
Yuedong Yang,
Zhen Wang,
Guo-Bo Li,
Jinping Lei
Abstract:
Molecular optimization, aimed at improving binding affinity or other molecular properties, is a crucial task in drug discovery that often relies on the expertise of medicinal chemists. Recently, deep learning-based 3D generative models showed promise in enhancing the efficiency of molecular optimization. However, these models often struggle to adequately consider binding affinities with protein ta…
▽ More
Molecular optimization, aimed at improving binding affinity or other molecular properties, is a crucial task in drug discovery that often relies on the expertise of medicinal chemists. Recently, deep learning-based 3D generative models showed promise in enhancing the efficiency of molecular optimization. However, these models often struggle to adequately consider binding affinities with protein targets during lead optimization. Herein, we propose a 3D pocket-aware and affinity-guided diffusion model, named Diffleop, to optimize molecules with enhanced binding affinity. The model explicitly incorporates the knowledge of protein-ligand binding affinity to guide the denoising sampling for molecule generation with high affinity. The comprehensive evaluations indicated that Diffleop outperforms baseline models across multiple metrics, especially in terms of binding affinity.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
Efficiently Scaling LLM Reasoning with Certaindex
Authors:
Yichao Fu,
Junda Chen,
Siqi Zhu,
Zheyu Fu,
Zhongdongming Dai,
Yonghao Zhuang,
Yian Ma,
Aurick Qiao,
Tajana Rosing,
Ion Stoica,
Hao Zhang
Abstract:
Test-time reasoning algorithms such as chain-of-thought, self-consistency, and MCTS enhance LLM problem-solving but can wastefully generate many tokens without improving accuracy. At the same time, we observe that these algorithms exhibit answer stabilization: their intermediate solutions often cease to change after a certain point, and further investment of compute does not change their final ans…
▽ More
Test-time reasoning algorithms such as chain-of-thought, self-consistency, and MCTS enhance LLM problem-solving but can wastefully generate many tokens without improving accuracy. At the same time, we observe that these algorithms exhibit answer stabilization: their intermediate solutions often cease to change after a certain point, and further investment of compute does not change their final answer. To quantify this phenomenon, we introduce Certaindex, an algorithm-agnostic metric measuring this evolving stability, signaling when further computation is unlikely to alter the final result. Certaindex is lightweight, can accelerate reasoning program inference via early exit, and further enables dynamic token allocation, gang scheduling, and many opportunities when integrated with real-world LLM serving systems. To quantify real-world benefits, we built Certaindex as a scheduler into Dynasor, our reasoning-aware LLM serving system, and demonstrate up to 50% compute savings and 3.3x higher throughput in real workloads with no accuracy drop. Our code is available at https://github.com/hao-ai-lab/Dynasor.git
△ Less
Submitted 27 May, 2025; v1 submitted 30 December, 2024;
originally announced December 2024.
-
SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications
Authors:
Gabriele Oliaro,
Zhihao Jia,
Daniel Campos,
Aurick Qiao
Abstract:
Speculative decoding is widely adopted to reduce latency in large language model (LLM) inference by leveraging smaller draft models capable of handling diverse user tasks. However, emerging AI applications, such as LLM-based agents, present unique workload characteristics: instead of diverse independent requests, agentic frameworks typically submit repetitive inference requests, such as multi-agen…
▽ More
Speculative decoding is widely adopted to reduce latency in large language model (LLM) inference by leveraging smaller draft models capable of handling diverse user tasks. However, emerging AI applications, such as LLM-based agents, present unique workload characteristics: instead of diverse independent requests, agentic frameworks typically submit repetitive inference requests, such as multi-agent pipelines performing similar subtasks or self-refinement loops iteratively enhancing outputs. These workloads result in long and highly predictable sequences, which current speculative decoding methods do not effectively exploit. To address this gap, we introduce \emph{SuffixDecoding}, a novel method that utilizes efficient suffix trees to cache long token sequences from prompts and previous outputs. By adaptively speculating more tokens when acceptance likelihood is high and fewer when it is low, SuffixDecoding effectively exploits opportunities for longer speculations while conserving computation when those opportunities are limited. Evaluations on agentic benchmarks, including SWE-Bench and Text-to-SQL, demonstrate that SuffixDecoding achieves speedups of up to 5.3$\times$, outperforming state-of-the-art methods -- 2.8$\times$ faster than model-based approaches like EAGLE-2/3 and 1.9$\times$ faster than model-free approaches such as Token Recycling. SuffixDecoding is open-sourced at https://github.com/snowflakedb/ArcticInference.
△ Less
Submitted 2 June, 2025; v1 submitted 7 November, 2024;
originally announced November 2024.
-
SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation
Authors:
Aurick Qiao,
Zhewei Yao,
Samyam Rajbhandari,
Yuxiong He
Abstract:
LLM inference for enterprise applications, such as summarization, RAG, and code-generation, typically observe much longer prompt than generations, leading to high prefill cost and response latency. We present SwiftKV, a novel model transformation and distillation procedure targeted at reducing the prefill compute (in FLOPs) of prompt tokens while preserving high generation quality. First, SwiftKV…
▽ More
LLM inference for enterprise applications, such as summarization, RAG, and code-generation, typically observe much longer prompt than generations, leading to high prefill cost and response latency. We present SwiftKV, a novel model transformation and distillation procedure targeted at reducing the prefill compute (in FLOPs) of prompt tokens while preserving high generation quality. First, SwiftKV prefills later layers' KV cache using an earlier layer's output, allowing prompt tokens to skip those later layers. Second, SwiftKV employs a lightweight knowledge-preserving distillation procedure that can adapt existing LLMs with minimal accuracy impact. Third, SwiftKV can naturally incorporate KV cache compression to improve inference performance in low-memory scenarios. Our comprehensive experiments show that SwiftKV can effectively reduce prefill computation by 25-50% across several LLM families while incurring minimum quality degradation. In the end-to-end inference serving, SwiftKV realizes up to 2x higher aggregate throughput and 60% lower time per output token. It can achieve a staggering 560 TFlops/GPU of normalized inference throughput, which translates to 16K tokens/s for Llama-3.1-70B. SwiftKV is open-sourced at https://github.com/snowflakedb/arctictraining.
△ Less
Submitted 1 June, 2025; v1 submitted 4 October, 2024;
originally announced October 2024.
-
STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning
Authors:
Jaeseong Lee,
seung-won hwang,
Aurick Qiao,
Daniel F Campos,
Zhewei Yao,
Yuxiong He
Abstract:
Mixture-of-experts (MoEs) have been adopted for reducing inference costs by sparsely activating experts in Large language models (LLMs). Despite this reduction, the massive number of experts in MoEs still makes them expensive to serve. In this paper, we study how to address this, by pruning MoEs. Among pruning methodologies, unstructured pruning has been known to achieve the highest performance fo…
▽ More
Mixture-of-experts (MoEs) have been adopted for reducing inference costs by sparsely activating experts in Large language models (LLMs). Despite this reduction, the massive number of experts in MoEs still makes them expensive to serve. In this paper, we study how to address this, by pruning MoEs. Among pruning methodologies, unstructured pruning has been known to achieve the highest performance for a given pruning ratio, compared to structured pruning, since the latter imposes constraints on the sparsification structure. This is intuitive, as the solution space of unstructured pruning subsumes that of structured pruning. However, our counterintuitive finding reveals that expert pruning, a form of structured pruning, can actually precede unstructured pruning to outperform unstructured-only pruning. As existing expert pruning, requiring $O(\frac{k^n}{\sqrt{n}})$ forward passes for $n$ experts, cannot scale for recent MoEs, we propose a scalable alternative with $O(1)$ complexity, yet outperforming the more expensive methods. The key idea is leveraging a latent structure between experts, based on behavior similarity, such that the greedy decision of whether to prune closely captures the joint pruning effect. Ours is highly effective -- for Snowflake Arctic, a 480B-sized MoE with 128 experts, our method needs only one H100 and two hours to achieve nearly no loss in performance with 40% sparsity, even in generative tasks such as GSM8K, where state-of-the-art unstructured pruning fails to. The code will be made publicly available.
△ Less
Submitted 10 September, 2024;
originally announced September 2024.
-
Efficient LLM Scheduling by Learning to Rank
Authors:
Yichao Fu,
Siqi Zhu,
Runlong Su,
Aurick Qiao,
Ion Stoica,
Hao Zhang
Abstract:
In Large Language Model (LLM) inference, the output length of an LLM request is typically regarded as not known a priori. Consequently, most LLM serving systems employ a simple First-come-first-serve (FCFS) scheduling strategy, leading to Head-Of-Line (HOL) blocking and reduced throughput and service quality. In this paper, we reexamine this assumption -- we show that, although predicting the exac…
▽ More
In Large Language Model (LLM) inference, the output length of an LLM request is typically regarded as not known a priori. Consequently, most LLM serving systems employ a simple First-come-first-serve (FCFS) scheduling strategy, leading to Head-Of-Line (HOL) blocking and reduced throughput and service quality. In this paper, we reexamine this assumption -- we show that, although predicting the exact generation length of each request is infeasible, it is possible to predict the relative ranks of output lengths in a batch of requests, using learning to rank. The ranking information offers valuable guidance for scheduling requests. Building on this insight, we develop a novel scheduler for LLM inference and serving that can approximate the shortest-job-first (SJF) schedule better than existing approaches. We integrate this scheduler with the state-of-the-art LLM serving system and show significant performance improvement in several important applications: 2.8x lower latency in chatbot serving and 6.5x higher throughput in synthetic data generation. Our code is available at https://github.com/hao-ai-lab/vllm-ltr.git
△ Less
Submitted 28 August, 2024;
originally announced August 2024.
-
Nurgle: Exacerbating Resource Consumption in Blockchain State Storage via MPT Manipulation
Authors:
Zheyuan He,
Zihao Li,
Ao Qiao,
Xiapu Luo,
Xiaosong Zhang,
Ting Chen,
Shuwei Song,
Dijun Liu,
Weina Niu
Abstract:
Blockchains, with intricate architectures, encompass various components, e.g., consensus network, smart contracts, decentralized applications, and auxiliary services. While offering numerous advantages, these components expose various attack surfaces, leading to severe threats to blockchains. In this study, we unveil a novel attack surface, i.e., the state storage, in blockchains. The state storag…
▽ More
Blockchains, with intricate architectures, encompass various components, e.g., consensus network, smart contracts, decentralized applications, and auxiliary services. While offering numerous advantages, these components expose various attack surfaces, leading to severe threats to blockchains. In this study, we unveil a novel attack surface, i.e., the state storage, in blockchains. The state storage, based on the Merkle Patricia Trie, plays a crucial role in maintaining blockchain state. Besides, we design Nurgle, the first Denial-of-Service attack targeting the state storage. By proliferating intermediate nodes within the state storage, Nurgle forces blockchains to expend additional resources on state maintenance and verification, impairing their performance. We conduct a comprehensive and systematic evaluation of Nurgle, including the factors affecting it, its impact on blockchains, its financial cost, and practically demonstrating the resulting damage to blockchains. The implications of Nurgle extend beyond the performance degradation of blockchains, potentially reducing trust in them and the value of their cryptocurrencies. Additionally, we further discuss three feasible mitigations against Nurgle. At the time of writing, the vulnerability exploited by Nurgle has been confirmed by six mainstream blockchains, and we received thousands of USD bounty from them.
△ Less
Submitted 15 June, 2024;
originally announced June 2024.
-
Large Language Models for Blockchain Security: A Systematic Literature Review
Authors:
Zheyuan He,
Zihao Li,
Sen Yang,
He Ye,
Ao Qiao,
Xiaosong Zhang,
Xiapu Luo,
Ting Chen
Abstract:
Large Language Models (LLMs) have emerged as powerful tools across various domains within cyber security. Notably, recent studies are increasingly exploring LLMs applied to the context of blockchain security (BS). However, there remains a gap in a comprehensive understanding regarding the full scope of applications, impacts, and potential constraints of LLMs on blockchain security. To fill this ga…
▽ More
Large Language Models (LLMs) have emerged as powerful tools across various domains within cyber security. Notably, recent studies are increasingly exploring LLMs applied to the context of blockchain security (BS). However, there remains a gap in a comprehensive understanding regarding the full scope of applications, impacts, and potential constraints of LLMs on blockchain security. To fill this gap, we undertake a literature review focusing on the studies that apply LLMs in blockchain security (LLM4BS).
Our study aims to comprehensively analyze and understand existing research, and elucidate how LLMs contribute to enhancing the security of blockchain systems. Through a thorough examination of existing literature, we delve into the integration of LLMs into various aspects of blockchain security. We explore the mechanisms through which LLMs can bolster blockchain security, including their applications in smart contract auditing, transaction anomaly detection, vulnerability repair, program analysis of smart contracts, and serving as participants in the cryptocurrency community. Furthermore, we assess the challenges and limitations associated with leveraging LLMs for enhancing blockchain security, considering factors such as scalability, privacy concerns, and ethical concerns. Our thorough review sheds light on the opportunities and potential risks of tasks on LLM4BS, providing valuable insights for researchers, practitioners, and policymakers alike.
△ Less
Submitted 24 March, 2025; v1 submitted 21 March, 2024;
originally announced March 2024.
-
LLM360: Towards Fully Transparent Open-Source LLMs
Authors:
Zhengzhong Liu,
Aurick Qiao,
Willie Neiswanger,
Hongyi Wang,
Bowen Tan,
Tianhua Tao,
Junbo Li,
Yuqi Wang,
Suqi Sun,
Omkar Pangarkar,
Richard Fan,
Yi Gu,
Victor Miller,
Yonghao Zhuang,
Guowei He,
Haonan Li,
Fajri Koto,
Liping Tang,
Nikhil Ranjan,
Zhiqiang Shen,
Xuguang Ren,
Roberto Iriondo,
Cun Mu,
Zhiting Hu,
Mark Schulze
, et al. (3 additional authors not shown)
Abstract:
The recent surge in open-source Large Language Models (LLMs), such as LLaMA, Falcon, and Mistral, provides diverse options for AI practitioners and researchers. However, most LLMs have only released partial artifacts, such as the final model weights or inference code, and technical reports increasingly limit their scope to high-level design choices and surface statistics. These choices hinder prog…
▽ More
The recent surge in open-source Large Language Models (LLMs), such as LLaMA, Falcon, and Mistral, provides diverse options for AI practitioners and researchers. However, most LLMs have only released partial artifacts, such as the final model weights or inference code, and technical reports increasingly limit their scope to high-level design choices and surface statistics. These choices hinder progress in the field by degrading transparency into the training of LLMs and forcing teams to rediscover many details in the training process. We present LLM360, an initiative to fully open-source LLMs, which advocates for all training code and data, model checkpoints, and intermediate results to be made available to the community. The goal of LLM360 is to support open and collaborative AI research by making the end-to-end LLM training process transparent and reproducible by everyone. As a first step of LLM360, we release two 7B parameter LLMs pre-trained from scratch, Amber and CrystalCoder, including their training code, data, intermediate checkpoints, and analyses (at https://www.llm360.ai). We are committed to continually pushing the boundaries of LLMs through this open-source effort. More large-scale and stronger models are underway and will be released in the future.
△ Less
Submitted 11 December, 2023;
originally announced December 2023.
-
Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning
Authors:
Aurick Qiao,
Sang Keun Choe,
Suhas Jayaram Subramanya,
Willie Neiswanger,
Qirong Ho,
Hao Zhang,
Gregory R. Ganger,
Eric P. Xing
Abstract:
Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-optimizing inter-dependent factors both at the per-job level and at the cluster-wide level. Most existing schedulers expect users to specify the number of resources for each job, often leading to inefficient resource use. Some recent schedulers choose job resources for users, but do so without awareness of how D…
▽ More
Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-optimizing inter-dependent factors both at the per-job level and at the cluster-wide level. Most existing schedulers expect users to specify the number of resources for each job, often leading to inefficient resource use. Some recent schedulers choose job resources for users, but do so without awareness of how DL training can be re-optimized to better utilize the provided resources.
Pollux simultaneously considers both aspects. By monitoring the status of each job during training, Pollux models how their goodput (a novel metric we introduce that combines system throughput with statistical efficiency) would change by adding or removing resources. Leveraging these information, Pollux dynamically (re-)assigns resources to improve cluster-wide goodput, while respecting fairness and continually optimizing each DL job to better utilize those resources.
In experiments with real DL jobs and with trace-driven simulations, Pollux reduces average job completion times by 37-50% relative to state-of-the-art DL schedulers, even when they are provided with ideal resource and training configurations for every job. Pollux promotes fairness among DL jobs competing for resources based on a more meaningful measure of useful job progress, and reveals a new opportunity for reducing DL cost in cloud environments. Pollux is implemented and publicly available as part of an open-source project at https://github.com/petuum/adaptdl.
△ Less
Submitted 26 May, 2021; v1 submitted 27 August, 2020;
originally announced August 2020.
-
Fault Tolerance in Iterative-Convergent Machine Learning
Authors:
Aurick Qiao,
Bryon Aragam,
Bingjing Zhang,
Eric P. Xing
Abstract:
Machine learning (ML) training algorithms often possess an inherent self-correcting behavior due to their iterative-convergent nature. Recent systems exploit this property to achieve adaptability and efficiency in unreliable computing environments by relaxing the consistency of execution and allowing calculation errors to be self-corrected during training. However, the behavior of such systems are…
▽ More
Machine learning (ML) training algorithms often possess an inherent self-correcting behavior due to their iterative-convergent nature. Recent systems exploit this property to achieve adaptability and efficiency in unreliable computing environments by relaxing the consistency of execution and allowing calculation errors to be self-corrected during training. However, the behavior of such systems are only well understood for specific types of calculation errors, such as those caused by staleness, reduced precision, or asynchronicity, and for specific types of training algorithms, such as stochastic gradient descent. In this paper, we develop a general framework to quantify the effects of calculation errors on iterative-convergent algorithms and use this framework to design new strategies for checkpoint-based fault tolerance. Our framework yields a worst-case upper bound on the iteration cost of arbitrary perturbations to model parameters during training. Our system, SCAR, employs strategies which reduce the iteration cost upper bound due to perturbations incurred when recovering from checkpoints. We show that SCAR can reduce the iteration cost of partial failures by 78% - 95% when compared with traditional checkpoint-based fault tolerance across a variety of ML models and training algorithms.
△ Less
Submitted 16 October, 2018;
originally announced October 2018.