-
Bridging Large Language Models and Single-Cell Transcriptomics in Dissecting Selective Motor Neuron Vulnerability
Authors:
Douglas Jiang,
Zilin Dai,
Luxuan Zhang,
Qiyi Yu,
Haoqi Sun,
Feng Tian
Abstract:
Understanding cell identity and function through single-cell level sequencing data remains a key challenge in computational biology. We present a novel framework that leverages gene-specific textual annotations from the NCBI Gene database to generate biologically contextualized cell embeddings. For each cell in a single-cell RNA sequencing (scRNA-seq) dataset, we rank genes by expression level, re…
▽ More
Understanding cell identity and function through single-cell level sequencing data remains a key challenge in computational biology. We present a novel framework that leverages gene-specific textual annotations from the NCBI Gene database to generate biologically contextualized cell embeddings. For each cell in a single-cell RNA sequencing (scRNA-seq) dataset, we rank genes by expression level, retrieve their NCBI Gene descriptions, and transform these descriptions into vector embedding representations using large language models (LLMs). The models used include OpenAI text-embedding-ada-002, text-embedding-3-small, and text-embedding-3-large (Jan 2024), as well as domain-specific models BioBERT and SciBERT. Embeddings are computed via an expression-weighted average across the top N most highly expressed genes in each cell, providing a compact, semantically rich representation. This multimodal strategy bridges structured biological data with state-of-the-art language modeling, enabling more interpretable downstream applications such as cell-type clustering, cell vulnerability dissection, and trajectory inference.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets
Authors:
Weiyu Li,
Xuanyang Zhang,
Zheng Sun,
Di Qi,
Hao Li,
Wei Cheng,
Weiwei Cai,
Shihao Wu,
Jiarui Liu,
Zihao Wang,
Xiao Chen,
Feipeng Tian,
Jianxiong Pan,
Zeming Li,
Gang Yu,
Xiangyu Zhang,
Daxin Jiang,
Ping Tan
Abstract:
While generative artificial intelligence has advanced significantly across text, image, audio, and video domains, 3D generation remains comparatively underdeveloped due to fundamental challenges such as data scarcity, algorithmic limitations, and ecosystem fragmentation. To this end, we present Step1X-3D, an open framework addressing these challenges through: (1) a rigorous data curation pipeline…
▽ More
While generative artificial intelligence has advanced significantly across text, image, audio, and video domains, 3D generation remains comparatively underdeveloped due to fundamental challenges such as data scarcity, algorithmic limitations, and ecosystem fragmentation. To this end, we present Step1X-3D, an open framework addressing these challenges through: (1) a rigorous data curation pipeline processing >5M assets to create a 2M high-quality dataset with standardized geometric and textural properties; (2) a two-stage 3D-native architecture combining a hybrid VAE-DiT geometry generator with an diffusion-based texture synthesis module; and (3) the full open-source release of models, training code, and adaptation modules. For geometry generation, the hybrid VAE-DiT component produces TSDF representations by employing perceiver-based latent encoding with sharp edge sampling for detail preservation. The diffusion-based texture synthesis module then ensures cross-view consistency through geometric conditioning and latent-space synchronization. Benchmark results demonstrate state-of-the-art performance that exceeds existing open-source methods, while also achieving competitive quality with proprietary solutions. Notably, the framework uniquely bridges the 2D and 3D generation paradigms by supporting direct transfer of 2D control techniques~(e.g., LoRA) to 3D synthesis. By simultaneously advancing data quality, algorithmic fidelity, and reproducibility, Step1X-3D aims to establish new standards for open research in controllable 3D asset generation.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Real-Time Bit-Level Encryption of Full High-Definition Video Without Diffusion
Authors:
Dong Jiang,
Hui-ran Luo,
Zi-jian Cui,
Xi-jue Zhao,
Lin-sheng Huang,
Liang-liang Lu
Abstract:
Despite the widespread adoption of Shannon's confusion-diffusion architecture in image encryption, the implementation of diffusion to sequentially establish inter-pixel dependencies for attaining plaintext sensitivity constrains algorithmic parallelism, while the execution of multiple rounds of diffusion operations to meet the required sensitivity metrics incurs excessive computational overhead. C…
▽ More
Despite the widespread adoption of Shannon's confusion-diffusion architecture in image encryption, the implementation of diffusion to sequentially establish inter-pixel dependencies for attaining plaintext sensitivity constrains algorithmic parallelism, while the execution of multiple rounds of diffusion operations to meet the required sensitivity metrics incurs excessive computational overhead. Consequently, the pursuit of plaintext sensitivity through diffusion operations is the primary factor limiting the computational efficiency and throughput of video encryption algorithms, rendering them inadequate to meet the demands of real-time encryption for high-resolution video. To address the performance limitation, this paper proposes a real-time video encryption protocol based on heterogeneous parallel computing, which incorporates the SHA-256 hashes of original frames as input, employs multiple CPU threads to concurrently generate encryption-related data, and deploys numerous GPU threads to simultaneously encrypt pixels. By leveraging the extreme input sensitivity of the SHA hash, the proposed protocol achieves the required plaintext sensitivity metrics with only a single round of confusion and XOR operations, significantly reducing computational overhead. Furthermore, through eliminating the reliance on diffusion, it realizes the allocation of a dedicated GPU thread for encrypting each pixel within every channel, effectively enhancing algorithm's parallelism. The experimental results demonstrate that our approach not only exhibits superior statistical properties and robust security but also achieving delay-free bit-level encryption for 1920$\times$1080 resolution (full high definition) video at 30 FPS, with an average encryption time of 25.84 ms on a server equipped with an Intel Xeon Gold 6226R CPU and an NVIDIA GeForce RTX 3090 GPU.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
DialogueReason: Rule-Based RL Sparks Dialogue Reasoning in LLMs
Authors:
Yubo Shu,
Zhewei Huang,
Xin Wu,
Chen Hu,
Shuchang Zhou,
Daxin Jiang
Abstract:
We propose DialogueReason, a reasoning paradigm that uncovers the lost roles in monologue-style reasoning models, aiming to boost diversity and coherency of the reasoning process. Recent advances in RL-based large reasoning models have led to impressive long CoT capabilities and high performance on math and science benchmarks. However, these reasoning models rely mainly on monologue-style reasonin…
▽ More
We propose DialogueReason, a reasoning paradigm that uncovers the lost roles in monologue-style reasoning models, aiming to boost diversity and coherency of the reasoning process. Recent advances in RL-based large reasoning models have led to impressive long CoT capabilities and high performance on math and science benchmarks. However, these reasoning models rely mainly on monologue-style reasoning, which often limits reasoning diversity and coherency, frequently recycling fixed strategies or exhibiting unnecessary shifts in attention. Our work consists of an analysis of monologue reasoning patterns and the development of a dialogue-based reasoning approach. We first introduce the Compound-QA task, which concatenates multiple problems into a single prompt to assess both diversity and coherency of reasoning. Our analysis shows that Compound-QA exposes weaknesses in monologue reasoning, evidenced by both quantitative metrics and qualitative reasoning traces. Building on the analysis, we propose a dialogue-based reasoning, named DialogueReason, structured around agents, environment, and interactions. Using PPO with rule-based rewards, we train open-source LLMs (Qwen-QWQ and Qwen-Base) to adopt dialogue reasoning. We evaluate trained models on MATH, AIME, and GPQA datasets, showing that the dialogue reasoning model outperforms monologue models under more complex compound questions. Additionally, we discuss how dialogue-based reasoning helps enhance interpretability, facilitate more intuitive human interaction, and inspire advances in multi-agent system design.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation
Authors:
Mengze Hong,
Wailing Ng,
Di Jiang,
Chen Jason Zhang
Abstract:
The rapid advancement of Chinese large language models (LLMs) underscores the need for domain-specific evaluations to ensure reliable applications. However, existing benchmarks often lack coverage in vertical domains and offer limited insights into the Chinese working context. Leveraging qualification exams as a unified framework for human expertise evaluation, we introduce QualBench, the first mu…
▽ More
The rapid advancement of Chinese large language models (LLMs) underscores the need for domain-specific evaluations to ensure reliable applications. However, existing benchmarks often lack coverage in vertical domains and offer limited insights into the Chinese working context. Leveraging qualification exams as a unified framework for human expertise evaluation, we introduce QualBench, the first multi-domain Chinese QA benchmark dedicated to localized assessment of Chinese LLMs. The dataset includes over 17,000 questions across six vertical domains, with data selections grounded in 24 Chinese qualifications to closely align with national policies and working standards. Through comprehensive evaluation, the Qwen2.5 model outperformed the more advanced GPT-4o, with Chinese LLMs consistently surpassing non-Chinese models, highlighting the importance of localized domain knowledge in meeting qualification requirements. The best performance of 75.26% reveals the current gaps in domain coverage within model capabilities. Furthermore, we present the failure of LLM collaboration with crowdsourcing mechanisms and suggest the opportunities for multi-domain RAG knowledge enhancement and vertical domain LLM training with Federated Learning.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves
Authors:
Dengyang Jiang,
Mengmeng Wang,
Liuzhuozheng Li,
Lei Zhang,
Haoyu Wang,
Wei Wei,
Guang Dai,
Yanning Zhang,
Jingdong Wang
Abstract:
Recent studies have demonstrated that learning a meaningful internal representation can both accelerate generative training and enhance the generation quality of diffusion transformers. However, existing approaches necessitate to either introduce an external and complex representation training framework or rely on a large-scale, pre-trained representation foundation model to provide representation…
▽ More
Recent studies have demonstrated that learning a meaningful internal representation can both accelerate generative training and enhance the generation quality of diffusion transformers. However, existing approaches necessitate to either introduce an external and complex representation training framework or rely on a large-scale, pre-trained representation foundation model to provide representation guidance during the original generative training process. In this study, we posit that the unique discriminative process inherent to diffusion transformers enables them to offer such guidance without requiring external representation components. We therefore propose Self-Representation Alignment (SRA), a simple yet straightforward method that obtains representation guidance through a self-distillation manner. Specifically, SRA aligns the output latent representation of the diffusion transformer in the earlier layer with higher noise to that in the later layer with lower noise to progressively enhance the overall representation learning during only the generative training process. Experimental results indicate that applying SRA to DiTs and SiTs yields consistent performance improvements. Moreover, SRA not only significantly outperforms approaches relying on auxiliary, complex representation training frameworks but also achieves performance comparable to methods that are heavily dependent on powerful external representation priors.
△ Less
Submitted 13 May, 2025; v1 submitted 5 May, 2025;
originally announced May 2025.
-
Carbon Aware Transformers Through Joint Model-Hardware Optimization
Authors:
Irene Wang,
Newsha Ardalani,
Mostafa Elhoushi,
Daniel Jiang,
Samuel Hsia,
Ekin Sumbul,
Divya Mahajan,
Carole-Jean Wu,
Bilge Acun
Abstract:
The rapid growth of machine learning (ML) systems necessitates a more comprehensive evaluation of their environmental impact, particularly their carbon footprint, which comprises operational carbon from training and inference execution and embodied carbon from hardware manufacturing and its entire life-cycle. Despite the increasing importance of embodied emissions, there is a lack of tools and fra…
▽ More
The rapid growth of machine learning (ML) systems necessitates a more comprehensive evaluation of their environmental impact, particularly their carbon footprint, which comprises operational carbon from training and inference execution and embodied carbon from hardware manufacturing and its entire life-cycle. Despite the increasing importance of embodied emissions, there is a lack of tools and frameworks to holistically quantify and optimize the total carbon footprint of ML systems. To address this, we propose CATransformers, a carbon-aware architecture search framework that enables sustainability-driven co-optimization of ML models and hardware architectures. By incorporating both operational and embodied carbon metrics into early design space exploration of domain-specific hardware accelerators, CATransformers demonstrates that optimizing for carbon yields design choices distinct from those optimized solely for latency or energy efficiency. We apply our framework to multi-modal CLIP-based models, producing CarbonCLIP, a family of CLIP models achieving up to 17% reduction in total carbon emissions while maintaining accuracy and latency compared to state-of-the-art edge small CLIP baselines. This work underscores the need for holistic optimization methods to design high-performance, environmentally sustainable AI systems.
△ Less
Submitted 8 May, 2025; v1 submitted 2 May, 2025;
originally announced May 2025.
-
T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
Authors:
Dongzhi Jiang,
Ziyu Guo,
Renrui Zhang,
Zhuofan Zong,
Hao Li,
Le Zhuo,
Shilin Yan,
Pheng-Ann Heng,
Hongsheng Li
Abstract:
Recent advancements in large language models have demonstrated how chain-of-thought (CoT) and reinforcement learning (RL) can improve performance. However, applying such reasoning strategies to the visual generation domain remains largely unexplored. In this paper, we present T2I-R1, a novel reasoning-enhanced text-to-image generation model, powered by RL with a bi-level CoT reasoning process. Spe…
▽ More
Recent advancements in large language models have demonstrated how chain-of-thought (CoT) and reinforcement learning (RL) can improve performance. However, applying such reasoning strategies to the visual generation domain remains largely unexplored. In this paper, we present T2I-R1, a novel reasoning-enhanced text-to-image generation model, powered by RL with a bi-level CoT reasoning process. Specifically, we identify two levels of CoT that can be utilized to enhance different stages of generation: (1) the semantic-level CoT for high-level planning of the prompt and (2) the token-level CoT for low-level pixel processing during patch-by-patch generation. To better coordinate these two levels of CoT, we introduce BiCoT-GRPO with an ensemble of generation rewards, which seamlessly optimizes both generation CoTs within the same training step. By applying our reasoning strategies to the baseline model, Janus-Pro, we achieve superior performance with 13% improvement on T2I-CompBench and 19% improvement on the WISE benchmark, even surpassing the state-of-the-art model FLUX.1. Code is available at: https://github.com/CaraJ7/T2I-R1
△ Less
Submitted 1 May, 2025;
originally announced May 2025.
-
Step1X-Edit: A Practical Framework for General Image Editing
Authors:
Shiyu Liu,
Yucheng Han,
Peng Xing,
Fukun Yin,
Rui Wang,
Wei Cheng,
Jiaqi Liao,
Yingming Wang,
Honghao Fu,
Chunrui Han,
Guopeng Li,
Yuang Peng,
Quan Sun,
Jingwei Wu,
Yan Cai,
Zheng Ge,
Ranchen Ming,
Lei Xia,
Xianfang Zeng,
Yibo Zhu,
Binxing Jiao,
Xiangyu Zhang,
Gang Yu,
Daxin Jiang
Abstract:
In recent years, image editing models have witnessed remarkable and rapid development. The recent unveiling of cutting-edge multimodal models such as GPT-4o and Gemini2 Flash has introduced highly promising image editing capabilities. These models demonstrate an impressive aptitude for fulfilling a vast majority of user-driven editing requirements, marking a significant advancement in the field of…
▽ More
In recent years, image editing models have witnessed remarkable and rapid development. The recent unveiling of cutting-edge multimodal models such as GPT-4o and Gemini2 Flash has introduced highly promising image editing capabilities. These models demonstrate an impressive aptitude for fulfilling a vast majority of user-driven editing requirements, marking a significant advancement in the field of image manipulation. However, there is still a large gap between the open-source algorithm with these closed-source models. Thus, in this paper, we aim to release a state-of-the-art image editing model, called Step1X-Edit, which can provide comparable performance against the closed-source models like GPT-4o and Gemini2 Flash. More specifically, we adopt the Multimodal LLM to process the reference image and the user's editing instruction. A latent embedding has been extracted and integrated with a diffusion image decoder to obtain the target image. To train the model, we build a data generation pipeline to produce a high-quality dataset. For evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world user instructions. Experimental results on GEdit-Bench demonstrate that Step1X-Edit outperforms existing open-source baselines by a substantial margin and approaches the performance of leading proprietary models, thereby making significant contributions to the field of image editing.
△ Less
Submitted 6 May, 2025; v1 submitted 24 April, 2025;
originally announced April 2025.
-
StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation
Authors:
Yinmin Zhong,
Zili Zhang,
Xiaoniu Song,
Hanpeng Hu,
Chao Jin,
Bingyang Wu,
Nuo Chen,
Yukun Chen,
Yu Zhou,
Changyi Wan,
Hongyu Zhou,
Yimin Jiang,
Yibo Zhu,
Daxin Jiang
Abstract:
Reinforcement learning (RL) has become the core post-training technique for large language models (LLMs). RL for LLMs involves two stages: generation and training. The LLM first generates samples online, which are then used to derive rewards for training. The conventional view holds that the colocated architecture, where the two stages share resources via temporal multiplexing, outperforms the dis…
▽ More
Reinforcement learning (RL) has become the core post-training technique for large language models (LLMs). RL for LLMs involves two stages: generation and training. The LLM first generates samples online, which are then used to derive rewards for training. The conventional view holds that the colocated architecture, where the two stages share resources via temporal multiplexing, outperforms the disaggregated architecture, in which dedicated resources are assigned to each stage. However, in real-world deployments, we observe that the colocated architecture suffers from resource coupling, where the two stages are constrained to use the same resources. This coupling compromises the scalability and cost-efficiency of colocated RL in large-scale training. In contrast, the disaggregated architecture allows for flexible resource allocation, supports heterogeneous training setups, and facilitates cross-datacenter deployment.
StreamRL is designed with disaggregation from first principles and fully unlocks its potential by addressing two types of performance bottlenecks in existing disaggregated RL frameworks: pipeline bubbles, caused by stage dependencies, and skewness bubbles, resulting from long-tail output length distributions. To address pipeline bubbles, StreamRL breaks the traditional stage boundary in synchronous RL algorithms through stream generation and achieves full overlapping in asynchronous RL. To address skewness bubbles, StreamRL employs an output-length ranker model to identify long-tail samples and reduces generation time via skewness-aware dispatching and scheduling. Experiments show that StreamRL improves throughput by up to 2.66x compared to existing state-of-the-art systems, and improves cost-effectiveness by up to 1.33x in a heterogeneous, cross-datacenter setting.
△ Less
Submitted 22 April, 2025;
originally announced April 2025.
-
AffordanceSAM: Segment Anything Once More in Affordance Grounding
Authors:
Dengyang Jiang,
Mengmeng Wang,
Teli Ma,
Hengzhuang Li,
Yong liu,
Guang Dai,
Lei Zhang
Abstract:
Improving the generalization ability of an affordance grounding model to recognize regions for unseen objects and affordance functions is crucial for real-world application. However, current models are still far away from such standards. To address this problem, we introduce AffordanceSAM, an effective approach that extends SAM's generalization capacity to the domain of affordance grounding. For t…
▽ More
Improving the generalization ability of an affordance grounding model to recognize regions for unseen objects and affordance functions is crucial for real-world application. However, current models are still far away from such standards. To address this problem, we introduce AffordanceSAM, an effective approach that extends SAM's generalization capacity to the domain of affordance grounding. For the purpose of thoroughly transferring SAM's robust performance in segmentation to affordance, we initially propose an affordance-adaption module in order to help modify SAM's segmentation output to be adapted to the specific functional regions required for affordance grounding. We concurrently make a coarse-to-fine training recipe to make SAM first be aware of affordance objects and actions coarsely, and then be able to generate affordance heatmaps finely. Both quantitative and qualitative experiments show the strong generalization capacity of our AffordanceSAM, which not only surpasses previous methods under AGD20K benchmark but also shows evidence to handle the task with novel objects and affordance functions.
△ Less
Submitted 22 April, 2025;
originally announced April 2025.
-
Towards Understanding Camera Motions in Any Video
Authors:
Zhiqiu Lin,
Siyuan Cen,
Daniel Jiang,
Jay Karhade,
Hewei Wang,
Chancharik Mitra,
Tiffany Ling,
Yuhan Huang,
Sifan Liu,
Mingyu Chen,
Rushikesh Zawar,
Xue Bai,
Yilun Du,
Chuang Gan,
Deva Ramanan
Abstract:
We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that s…
▽ More
We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some motions like "follow" (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human annotation performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in (a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
PipeWeaver: Addressing Data Dynamicity in Large Multimodal Model Training with Dynamic Interleaved Pipeline
Authors:
Zhenliang Xue,
Hanpeng Hu,
Xing Chen,
Yimin Jiang,
Yixin Song,
Zeyu Mi,
Yibo Zhu,
Daxin Jiang,
Yubin Xia,
Haibo Chen
Abstract:
Large multimodal models (LMMs) have demonstrated excellent capabilities in both understanding and generation tasks with various modalities. While these models can accept flexible combinations of input data, their training efficiency suffers from two major issues: pipeline stage imbalance caused by heterogeneous model architectures, and training data dynamicity stemming from the diversity of multim…
▽ More
Large multimodal models (LMMs) have demonstrated excellent capabilities in both understanding and generation tasks with various modalities. While these models can accept flexible combinations of input data, their training efficiency suffers from two major issues: pipeline stage imbalance caused by heterogeneous model architectures, and training data dynamicity stemming from the diversity of multimodal data.
In this paper, we present PipeWeaver, a dynamic pipeline scheduling framework designed for LMM training. The core of PipeWeaver is dynamic interleaved pipeline, which searches for pipeline schedules dynamically tailored to current training batches. PipeWeaver addresses issues of LMM training with two techniques: adaptive modality-aware partitioning and efficient pipeline schedule search within a hierarchical schedule space. Meanwhile, PipeWeaver utilizes SEMU (Step Emulator), a training simulator for multimodal models, for accurate performance estimations, accelerated by spatial-temporal subgraph reuse to improve search efficiency. Experiments show that PipeWeaver can enhance LMM training efficiency by up to 97.3% compared to state-of-the-art systems, and demonstrate excellent adaptivity to LMM training's data dynamicity.
△ Less
Submitted 18 April, 2025;
originally announced April 2025.
-
Harmony: A Unified Framework for Modality Incremental Learning
Authors:
Yaguang Song,
Xiaoshan Yang,
Dongmei Jiang,
Yaowei Wang,
Changsheng Xu
Abstract:
Incremental learning aims to enable models to continuously acquire knowledge from evolving data streams while preserving previously learned capabilities. While current research predominantly focuses on unimodal incremental learning and multimodal incremental learning where the modalities are consistent, real-world scenarios often present data from entirely new modalities, posing additional challen…
▽ More
Incremental learning aims to enable models to continuously acquire knowledge from evolving data streams while preserving previously learned capabilities. While current research predominantly focuses on unimodal incremental learning and multimodal incremental learning where the modalities are consistent, real-world scenarios often present data from entirely new modalities, posing additional challenges. This paper investigates the feasibility of developing a unified model capable of incremental learning across continuously evolving modal sequences. To this end, we introduce a novel paradigm called Modality Incremental Learning (MIL), where each learning stage involves data from distinct modalities. To address this task, we propose a novel framework named Harmony, designed to achieve modal alignment and knowledge retention, enabling the model to reduce the modal discrepancy and learn from a sequence of distinct modalities, ultimately completing tasks across multiple modalities within a unified framework. Our approach introduces the adaptive compatible feature modulation and cumulative modal bridging. Through constructing historical modal features and performing modal knowledge accumulation and alignment, the proposed components collaboratively bridge modal differences and maintain knowledge retention, even with solely unimodal data available at each learning phase.These components work in concert to establish effective modality connections and maintain knowledge retention, even when only unimodal data is available at each learning stage. Extensive experiments on the MIL task demonstrate that our proposed method significantly outperforms existing incremental learning methods, validating its effectiveness in MIL scenarios.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
Learning Compatible Multi-Prize Subnetworks for Asymmetric Retrieval
Authors:
Yushuai Sun,
Zikun Zhou,
Dongmei Jiang,
Yaowei Wang,
Jun Yu,
Guangming Lu,
Wenjie Pei
Abstract:
Asymmetric retrieval is a typical scenario in real-world retrieval systems, where compatible models of varying capacities are deployed on platforms with different resource configurations. Existing methods generally train pre-defined networks or subnetworks with capacities specifically designed for pre-determined platforms, using compatible learning. Nevertheless, these methods suffer from limited…
▽ More
Asymmetric retrieval is a typical scenario in real-world retrieval systems, where compatible models of varying capacities are deployed on platforms with different resource configurations. Existing methods generally train pre-defined networks or subnetworks with capacities specifically designed for pre-determined platforms, using compatible learning. Nevertheless, these methods suffer from limited flexibility for multi-platform deployment. For example, when introducing a new platform into the retrieval systems, developers have to train an additional model at an appropriate capacity that is compatible with existing models via backward-compatible learning. In this paper, we propose a Prunable Network with self-compatibility, which allows developers to generate compatible subnetworks at any desired capacity through post-training pruning. Thus it allows the creation of a sparse subnetwork matching the resources of the new platform without additional training. Specifically, we optimize both the architecture and weight of subnetworks at different capacities within a dense network in compatible learning. We also design a conflict-aware gradient integration scheme to handle the gradient conflicts between the dense network and subnetworks during compatible learning. Extensive experiments on diverse benchmarks and visual backbones demonstrate the effectiveness of our method. Our code and model are available at https://github.com/Bunny-Black/PrunNet.
△ Less
Submitted 16 April, 2025;
originally announced April 2025.
-
ADT: Tuning Diffusion Models with Adversarial Supervision
Authors:
Dazhong Shen,
Guanglu Song,
Yi Zhang,
Bingqi Ma,
Lujundong Li,
Dongzhi Jiang,
Zhuofan Zong,
Yu Liu
Abstract:
Diffusion models have achieved outstanding image generation by reversing a forward noising process to approximate true data distributions. During training, these models predict diffusion scores from noised versions of true samples in a single forward pass, while inference requires iterative denoising starting from white noise. This training-inference divergences hinder the alignment between infere…
▽ More
Diffusion models have achieved outstanding image generation by reversing a forward noising process to approximate true data distributions. During training, these models predict diffusion scores from noised versions of true samples in a single forward pass, while inference requires iterative denoising starting from white noise. This training-inference divergences hinder the alignment between inference and training data distributions, due to potential prediction biases and cumulative error accumulation. To address this problem, we propose an intuitive but effective fine-tuning framework, called Adversarial Diffusion Tuning (ADT), by stimulating the inference process during optimization and aligning the final outputs with training data by adversarial supervision. Specifically, to achieve robust adversarial training, ADT features a siamese-network discriminator with a fixed pre-trained backbone and lightweight trainable parameters, incorporates an image-to-image sampling strategy to smooth discriminative difficulties, and preserves the original diffusion loss to prevent discriminator hacking. In addition, we carefully constrain the backward-flowing path for back-propagating gradients along the inference path without incurring memory overload or gradient explosion. Finally, extensive experiments on Stable Diffusion models (v1.5, XL, and v3), demonstrate that ADT significantly improves both distribution alignment and image quality.
△ Less
Submitted 15 April, 2025;
originally announced April 2025.
-
Perception-R1: Pioneering Perception Policy with Reinforcement Learning
Authors:
En Yu,
Kangheng Lin,
Liang Zhao,
Jisheng Yin,
Yana Wei,
Yuang Peng,
Haoran Wei,
Jianjian Sun,
Chunrui Han,
Zheng Ge,
Xiangyu Zhang,
Daxin Jiang,
Jingyu Wang,
Wenbing Tao
Abstract:
Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in MLLM post-training for perception policy learning. While promising, our initial experiments reveal that incorporating a thinking process through RL does not consistently lead to performance gains across all visual perception tasks. This leads us to delve into the essential role of RL in th…
▽ More
Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in MLLM post-training for perception policy learning. While promising, our initial experiments reveal that incorporating a thinking process through RL does not consistently lead to performance gains across all visual perception tasks. This leads us to delve into the essential role of RL in the context of visual perception. In this work, we return to the fundamentals and explore the effects of RL on different perception tasks. We observe that the perceptual complexity is a major factor in determining the effectiveness of RL. We also observe that reward design plays a crucial role in further approching the upper limit of model perception. To leverage these findings, we propose Perception-R1, a scalable RL framework using GRPO during MLLM post-training. With a standard Qwen2.5-VL-3B-Instruct, Perception-R1 achieves +4.2% on RefCOCO+, +17.9% on PixMo-Count, +4.2% on PageOCR, and notably, 31.9% AP on COCO2017 val for the first time, establishing a strong baseline for perception policy learning.
△ Less
Submitted 10 April, 2025;
originally announced April 2025.
-
LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding
Authors:
Ziyi Wang,
Haoran Wu,
Yiming Rong,
Deyang Jiang,
Yixin Zhang,
Yunlong Zhao,
Shuang Xu,
Bo XU
Abstract:
Long video understanding is a complex task that requires both spatial detail and temporal awareness. While Vision-Language Models (VLMs) obtain frame-level understanding capabilities through multi-frame input, they suffer from information loss due to the sparse sampling strategy. In contrast, Video Large Language Models (Video-LLMs) capture temporal relationships within visual features but are lim…
▽ More
Long video understanding is a complex task that requires both spatial detail and temporal awareness. While Vision-Language Models (VLMs) obtain frame-level understanding capabilities through multi-frame input, they suffer from information loss due to the sparse sampling strategy. In contrast, Video Large Language Models (Video-LLMs) capture temporal relationships within visual features but are limited by the scarcity of high-quality video-text datasets. To transfer long video understanding capabilities to VLMs with minimal data and computational cost, we propose Lightweight Video Compression (LVC), a novel method featuring the Query-Attention Video Compression mechanism, which effectively tackles the sparse sampling problem in VLMs. By training only the alignment layer with 10k short video-text pairs, LVC significantly enhances the temporal reasoning abilities of VLMs. Extensive experiments show that LVC provides consistent performance improvements across various models, including the InternVL2 series and Phi-3.5-Vision. Notably, the InternVL2-40B-LVC achieves scores of 68.2 and 65.9 on the long video understanding benchmarks MLVU and Video-MME, respectively, with relative improvements of 14.6% and 7.7%. The enhanced models and code will be publicly available soon.
△ Less
Submitted 9 April, 2025;
originally announced April 2025.
-
Multi-Perspective Attention Mechanism for Bias-Aware Sequential Recommendation
Authors:
Mingjian Fu,
Hengsheng Chen,
Dongchun Jiang,
Yanchao Tan
Abstract:
In the era of advancing information technology, recommender systems have emerged as crucial tools for dealing with information overload. However, traditional recommender systems still have limitations in capturing the dynamic evolution of user behavior. To better understand and predict user behavior, especially taking into account the complexity of temporal evolution, sequential recommender system…
▽ More
In the era of advancing information technology, recommender systems have emerged as crucial tools for dealing with information overload. However, traditional recommender systems still have limitations in capturing the dynamic evolution of user behavior. To better understand and predict user behavior, especially taking into account the complexity of temporal evolution, sequential recommender systems have gradually become the focus of research. Currently, many sequential recommendation algorithms ignore the amplification effects of prevalent biases, which leads to recommendation results being susceptible to the Matthew Effect. Additionally, it will impose limitations on the recommender system's ability to deeply perceive and capture the dynamic shifts in user preferences, thereby diminishing the extent of its recommendation reach. To address this issue effectively, we propose a recommendation system based on sequential information and attention mechanism called Multi-Perspective Attention Bias Sequential Recommendation (MABSRec). Firstly, we reconstruct user sequences into three short types and utilize graph neural networks for item weighting. Subsequently, an adaptive multi-bias perspective attention module is proposed to enhance the accuracy of recommendations. Experimental results show that the MABSRec model exhibits significant advantages in all evaluation metrics, demonstrating its excellent performance in the sequence recommendation task.
△ Less
Submitted 26 February, 2025;
originally announced April 2025.
-
SAM2MOT: A Novel Paradigm of Multi-Object Tracking by Segmentation
Authors:
Junjie Jiang,
Zelin Wang,
Manqi Zhao,
Yin Li,
DongSheng Jiang
Abstract:
Segment Anything 2 (SAM2) enables robust single-object tracking using segmentation. To extend this to multi-object tracking (MOT), we propose SAM2MOT, introducing a novel Tracking by Segmentation paradigm. Unlike Tracking by Detection or Tracking by Query, SAM2MOT directly generates tracking boxes from segmentation masks, reducing reliance on detection accuracy. SAM2MOT has two key advantages: zer…
▽ More
Segment Anything 2 (SAM2) enables robust single-object tracking using segmentation. To extend this to multi-object tracking (MOT), we propose SAM2MOT, introducing a novel Tracking by Segmentation paradigm. Unlike Tracking by Detection or Tracking by Query, SAM2MOT directly generates tracking boxes from segmentation masks, reducing reliance on detection accuracy. SAM2MOT has two key advantages: zero-shot generalization, allowing it to work across datasets without fine-tuning, and strong object association, inherited from SAM2. To further improve performance, we integrate a trajectory manager system for precise object addition and removal, and a cross-object interaction module to handle occlusions. Experiments on DanceTrack, UAVDT, and BDD100K show state-of-the-art results. Notably, SAM2MOT outperforms existing methods on DanceTrack by +2.1 HOTA and +4.5 IDF1, highlighting its effectiveness in MOT. Code is available at https://github.com/TripleJoy/SAM2MOT.
△ Less
Submitted 5 May, 2025; v1 submitted 6 April, 2025;
originally announced April 2025.
-
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
Authors:
Jingcheng Hu,
Yinmin Zhang,
Qi Han,
Daxin Jiang,
Xiangyu Zhang,
Heung-Yeung Shum
Abstract:
We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training focusing on scalability, simplicity and accessibility. Through extensive experiments, we demonstrate that a minimalist approach, vanilla PPO with GAE ($λ=1$, $γ=1$) and straightforward rule-based rewards, without any KL regularization, is sufficient to scale up both response length a…
▽ More
We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training focusing on scalability, simplicity and accessibility. Through extensive experiments, we demonstrate that a minimalist approach, vanilla PPO with GAE ($λ=1$, $γ=1$) and straightforward rule-based rewards, without any KL regularization, is sufficient to scale up both response length and benchmark performance, similar to the phenomenon observed in DeepSeek-R1-Zero. Using the same base model as DeepSeek-R1-Zero-Qwen-32B, our implementation achieves superior performance on AIME2024, MATH500, and the GPQA Diamond benchmark while demonstrating remarkable efficiency -- requiring only a tenth of the training steps, compared to DeepSeek-R1-Zero pipeline. In the spirit of open source, we release our source code, parameter settings, training data, and model weights across various sizes.
△ Less
Submitted 31 March, 2025;
originally announced March 2025.
-
M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization?
Authors:
Haolong Yan,
Kaijun Tan,
Yeqing Shen,
Xin Huang,
Zheng Ge,
Xiangyu Zhang,
Si Li,
Daxin Jiang
Abstract:
We investigate a critical yet under-explored question in Large Vision-Language Models (LVLMs): Do LVLMs genuinely comprehend interleaved image-text in the document? Existing document understanding benchmarks often assess LVLMs using question-answer formats, which are information-sparse and difficult to guarantee the coverage of long-range dependencies. To address this issue, we introduce a novel a…
▽ More
We investigate a critical yet under-explored question in Large Vision-Language Models (LVLMs): Do LVLMs genuinely comprehend interleaved image-text in the document? Existing document understanding benchmarks often assess LVLMs using question-answer formats, which are information-sparse and difficult to guarantee the coverage of long-range dependencies. To address this issue, we introduce a novel and challenging Multimodal Document Summarization Benchmark (M-DocSum-Bench), which comprises 500 high-quality arXiv papers, along with interleaved multimodal summaries aligned with human preferences. M-DocSum-Bench is a reference-based generation task and necessitates the generation of interleaved image-text summaries using provided reference images, thereby simultaneously evaluating capabilities in understanding, reasoning, localization, and summarization within complex multimodal document scenarios. To facilitate this benchmark, we develop an automated framework to construct summaries and propose a fine-grained evaluation method called M-DocEval. Moreover, we further develop a robust summarization baseline, i.e., M-DocSum-7B, by progressive two-stage training with diverse instruction and preference data. The extensive results on our M-DocSum-Bench reveal that the leading LVLMs struggle to maintain coherence and accurately integrate information within long and interleaved contexts, often exhibiting confusion between similar images and a lack of robustness. Notably, M-DocSum-7B achieves state-of-the-art performance compared to larger and closed-source models (including GPT-4o, Gemini Pro, Claude-3.5-Sonnet and Qwen2.5-VL-72B, etc.), demonstrating the potential of LVLMs for improved interleaved image-text understanding. The code, data, and models are available at https://github.com/stepfun-ai/M-DocSum-Bench.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
Survival Analysis with Machine Learning for Predicting Li-ion Battery Remaining Useful Life
Authors:
Jingyuan Xue,
Longfei Wei,
Dongjing Jiang,
Fang Sheng,
Russell Greiner,
Jianfei Zhang
Abstract:
Battery degradation significantly impacts the reliability and efficiency of energy storage systems, particularly in electric vehicles and industrial applications. Predicting the remaining useful life (RUL) of lithium-ion batteries is crucial for optimizing maintenance schedules, reducing costs, and improving safety. Traditional RUL prediction methods often struggle with nonlinear degradation patte…
▽ More
Battery degradation significantly impacts the reliability and efficiency of energy storage systems, particularly in electric vehicles and industrial applications. Predicting the remaining useful life (RUL) of lithium-ion batteries is crucial for optimizing maintenance schedules, reducing costs, and improving safety. Traditional RUL prediction methods often struggle with nonlinear degradation patterns and uncertainty quantification. To address these challenges, we propose a hybrid survival analysis framework integrating survival data reconstruction, survival model learning, and survival probability estimation. Our approach transforms battery voltage time series into time-to-failure data using path signatures. The multiple Cox-based survival models and machine-learning-based methods, such as DeepHit and MTLR, are learned to predict battery failure-free probabilities over time. Experiments conducted on the Toyota battery and NASA battery datasets demonstrate the effectiveness of our approach, achieving high time-dependent AUC and concordance index (C-Index) while maintaining a low integrated Brier score. The data and source codes for this work are available to the public at https://github.com/thinkxca/rul.
△ Less
Submitted 6 May, 2025; v1 submitted 16 March, 2025;
originally announced March 2025.
-
Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation
Authors:
Songjun Tu,
Jiahao Lin,
Xiangyu Tian,
Qichao Zhang,
Linjing Li,
Yuqian Fu,
Nan Xu,
Wei He,
Xiangyuan Lan,
Dongmei Jiang,
Dongbin Zhao
Abstract:
Recent advancements in post-training methodologies for large language models (LLMs) have highlighted reinforcement learning (RL) as a critical component for enhancing reasoning. However, the substantial computational costs associated with RL-based approaches have led to growing interest in alternative paradigms, such as Direct Preference Optimization (DPO). In this study, we investigate the effect…
▽ More
Recent advancements in post-training methodologies for large language models (LLMs) have highlighted reinforcement learning (RL) as a critical component for enhancing reasoning. However, the substantial computational costs associated with RL-based approaches have led to growing interest in alternative paradigms, such as Direct Preference Optimization (DPO). In this study, we investigate the effectiveness of DPO in facilitating self-improvement for LLMs through iterative preference-based learning. We demonstrate that a single round of DPO with coarse filtering significantly enhances mathematical reasoning performance, particularly for strong base model. Furthermore, we design an iterative enhancement framework for both the generator and the reward model (RM), enabling their mutual improvement through online interaction across multiple rounds of DPO. Finally, with simple verifiable rewards, our model DPO-VP achieves RL-level performance with significantly lower computational overhead. These findings highlight DPO as a scalable and cost-effective alternative to RL, offering a practical solution for enhancing LLM reasoning in resource-constrained situations.
△ Less
Submitted 27 March, 2025; v1 submitted 17 March, 2025;
originally announced March 2025.
-
Nonlinear Principal Component Analysis with Random Bernoulli Features for Process Monitoring
Authors:
Ke Chen,
Dandan Jiang
Abstract:
The process generates substantial amounts of data with highly complex structures, leading to the development of numerous nonlinear statistical methods. However, most of these methods rely on computations involving large-scale dense kernel matrices. This dependence poses significant challenges in meeting the high computational demands and real-time responsiveness required by online monitoring syste…
▽ More
The process generates substantial amounts of data with highly complex structures, leading to the development of numerous nonlinear statistical methods. However, most of these methods rely on computations involving large-scale dense kernel matrices. This dependence poses significant challenges in meeting the high computational demands and real-time responsiveness required by online monitoring systems. To alleviate the computational burden of dense large-scale matrix multiplication, we incorporate the bootstrap sampling concept into random feature mapping and propose a novel random Bernoulli principal component analysis method to efficiently capture nonlinear patterns in the process. We derive a convergence bound for the kernel matrix approximation constructed using random Bernoulli features, ensuring theoretical robustness. Subsequently, we design four fast process monitoring methods based on random Bernoulli principal component analysis to extend its nonlinear capabilities for handling diverse fault scenarios. Finally, numerical experiments and real-world data analyses are conducted to evaluate the performance of the proposed methods. Results demonstrate that the proposed methods offer excellent scalability and reduced computational complexity, achieving substantial cost savings with minimal performance loss compared to traditional kernel-based approaches.
△ Less
Submitted 16 March, 2025;
originally announced March 2025.
-
3D Gaussian Splatting against Moving Objects for High-Fidelity Street Scene Reconstruction
Authors:
Peizhen Zheng,
Longfei Wei,
Dongjing Jiang,
Jianfei Zhang
Abstract:
The accurate reconstruction of dynamic street scenes is critical for applications in autonomous driving, augmented reality, and virtual reality. Traditional methods relying on dense point clouds and triangular meshes struggle with moving objects, occlusions, and real-time processing constraints, limiting their effectiveness in complex urban environments. While multi-view stereo and neural radiance…
▽ More
The accurate reconstruction of dynamic street scenes is critical for applications in autonomous driving, augmented reality, and virtual reality. Traditional methods relying on dense point clouds and triangular meshes struggle with moving objects, occlusions, and real-time processing constraints, limiting their effectiveness in complex urban environments. While multi-view stereo and neural radiance fields have advanced 3D reconstruction, they face challenges in computational efficiency and handling scene dynamics. This paper proposes a novel 3D Gaussian point distribution method for dynamic street scene reconstruction. Our approach introduces an adaptive transparency mechanism that eliminates moving objects while preserving high-fidelity static scene details. Additionally, iterative refinement of Gaussian point distribution enhances geometric accuracy and texture representation. We integrate directional encoding with spatial position optimization to optimize storage and rendering efficiency, reducing redundancy while maintaining scene integrity. Experimental results demonstrate that our method achieves high reconstruction quality, improved rendering performance, and adaptability in large-scale dynamic environments. These contributions establish a robust framework for real-time, high-precision 3D reconstruction, advancing the practicality of dynamic scene modeling across multiple applications.
△ Less
Submitted 3 April, 2025; v1 submitted 15 March, 2025;
originally announced March 2025.
-
CCRSat: A Collaborative Computation Reuse Framework for Satellite Edge Computing Networks
Authors:
Ye Zhang,
Zhishu Shen,
Dawen Jiang,
Xiangrui Liu,
Qiushi Zheng,
Jiong Jin
Abstract:
In satellite computing applications, such as remote sensing, tasks often involve similar or identical input data, leading to the same processing results. Computation reuse is an emerging paradigm that leverages the execution results of previous tasks to enhance the utilization of computational resources. While this paradigm has been extensively studied in terrestrial networks with abundant computi…
▽ More
In satellite computing applications, such as remote sensing, tasks often involve similar or identical input data, leading to the same processing results. Computation reuse is an emerging paradigm that leverages the execution results of previous tasks to enhance the utilization of computational resources. While this paradigm has been extensively studied in terrestrial networks with abundant computing and caching resources, such as named data networking (NDN), it is essential to develop a framework appropriate for resource-constrained satellite networks, which are expected to have longer task completion times. In this paper, we propose CCRSat, a collaborative computation reuse framework for satellite edge computing networks. CCRSat initially implements local computation reuse on an independent satellite, utilizing a satellite reuse state (SRS) to assess the efficiency of computation reuse. Additionally, an inter-satellite computation reuse algorithm is introduced, which utilizes the collaborative sharing of similarity in previously processed data among multiple satellites. The evaluation results tested on real-world datasets demonstrate that, compared to comparative scenarios, our proposed CCRSat can significantly reduce task completion time by up to 62.1% and computational resource consumption by up to 28.8%.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model
Authors:
Haoyang Huang,
Guoqing Ma,
Nan Duan,
Xing Chen,
Changyi Wan,
Ranchen Ming,
Tianyu Wang,
Bo Wang,
Zhiying Lu,
Aojie Li,
Xianfang Zeng,
Xinhao Zhang,
Gang Yu,
Yuhe Yin,
Qiling Wu,
Wen Sun,
Kang An,
Xin Han,
Deshan Sun,
Wei Ji,
Bizhu Huang,
Brian Li,
Chenfei Wu,
Guanzhe Huang,
Huixin Xiong
, et al. (29 additional authors not shown)
Abstract:
We present Step-Video-TI2V, a state-of-the-art text-driven image-to-video generation model with 30B parameters, capable of generating videos up to 102 frames based on both text and image inputs. We build Step-Video-TI2V-Eval as a new benchmark for the text-driven image-to-video task and compare Step-Video-TI2V with open-source and commercial TI2V engines using this dataset. Experimental results de…
▽ More
We present Step-Video-TI2V, a state-of-the-art text-driven image-to-video generation model with 30B parameters, capable of generating videos up to 102 frames based on both text and image inputs. We build Step-Video-TI2V-Eval as a new benchmark for the text-driven image-to-video task and compare Step-Video-TI2V with open-source and commercial TI2V engines using this dataset. Experimental results demonstrate the state-of-the-art performance of Step-Video-TI2V in the image-to-video generation task. Both Step-Video-TI2V and Step-Video-TI2V-Eval are available at https://github.com/stepfun-ai/Step-Video-TI2V.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
SciVerse: Unveiling the Knowledge Comprehension and Visual Reasoning of LMMs on Multi-modal Scientific Problems
Authors:
Ziyu Guo,
Ray Zhang,
Hao Chen,
Jialin Gao,
Dongzhi Jiang,
Jiaze Wang,
Pheng-Ann Heng
Abstract:
The rapid advancement of Large Multi-modal Models (LMMs) has enabled their application in scientific problem-solving, yet their fine-grained capabilities remain under-explored. In this paper, we introduce SciVerse, a multi-modal scientific evaluation benchmark to thoroughly assess LMMs across 5,735 test instances in five distinct versions. We aim to investigate three key dimensions of LMMs: scient…
▽ More
The rapid advancement of Large Multi-modal Models (LMMs) has enabled their application in scientific problem-solving, yet their fine-grained capabilities remain under-explored. In this paper, we introduce SciVerse, a multi-modal scientific evaluation benchmark to thoroughly assess LMMs across 5,735 test instances in five distinct versions. We aim to investigate three key dimensions of LMMs: scientific knowledge comprehension, multi-modal content interpretation, and Chain-of-Thought (CoT) reasoning. To unveil whether LMMs possess sufficient scientific expertise, we first transform each problem into three versions containing different levels of knowledge required for solving, i.e., Knowledge-free, -lite, and -rich. Then, to explore how LMMs interpret multi-modal scientific content, we annotate another two versions, i.e., Vision-rich and -only, marking more question information from texts to diagrams. Comparing the results of different versions, SciVerse systematically examines the professional knowledge stock and visual perception skills of LMMs in scientific domains. In addition, to rigorously assess CoT reasoning, we propose a new scientific CoT evaluation strategy, conducting a step-wise assessment on knowledge and logical errors in model outputs. Our extensive evaluation of different LMMs on SciVerse reveals critical limitations in their scientific proficiency and provides new insights into future developments. Project page: https://sciverse-cuhk.github.io
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
PiSA: A Self-Augmented Data Engine and Training Strategy for 3D Understanding with Large Models
Authors:
Zilu Guo,
Hongbin Lin,
Zhihao Yuan,
Chaoda Zheng,
Pengshuo Qiu,
Dongzhi Jiang,
Renrui Zhang,
Chun-Mei Feng,
Zhen Li
Abstract:
3D Multimodal Large Language Models (MLLMs) have recently made substantial advancements. However, their potential remains untapped, primarily due to the limited quantity and suboptimal quality of 3D datasets. Current approaches attempt to transfer knowledge from 2D MLLMs to expand 3D instruction data, but still face modality and domain gaps. To this end, we introduce PiSA-Engine (Point-Self-Augmen…
▽ More
3D Multimodal Large Language Models (MLLMs) have recently made substantial advancements. However, their potential remains untapped, primarily due to the limited quantity and suboptimal quality of 3D datasets. Current approaches attempt to transfer knowledge from 2D MLLMs to expand 3D instruction data, but still face modality and domain gaps. To this end, we introduce PiSA-Engine (Point-Self-Augmented-Engine), a new framework for generating instruction point-language datasets enriched with 3D spatial semantics. We observe that existing 3D MLLMs offer a comprehensive understanding of point clouds for annotation, while 2D MLLMs excel at cross-validation by providing complementary information. By integrating holistic 2D and 3D insights from off-the-shelf MLLMs, PiSA-Engine enables a continuous cycle of high-quality data generation. We select PointLLM as the baseline and adopt this co-evolution training framework to develop an enhanced 3D MLLM, termed PointLLM-PiSA. Additionally, we identify limitations in previous 3D benchmarks, which often feature coarse language captions and insufficient category diversity, resulting in inaccurate evaluations. To address this gap, we further introduce PiSA-Bench, a comprehensive 3D benchmark covering six key aspects with detailed and diverse labels. Experimental results demonstrate PointLLM-PiSA's state-of-the-art performance in zero-shot 3D object captioning and generative classification on our PiSA-Bench, achieving significant improvements of 46.45% (+8.33%) and 63.75% (+16.25%), respectively. We will release the code, datasets, and benchmark.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
ProReflow: Progressive Reflow with Decomposed Velocity
Authors:
Lei Ke,
Haohang Xu,
Xuefei Ning,
Yu Li,
Jiajun Li,
Haoling Li,
Yuxuan Lin,
Dongsheng Jiang,
Yujiu Yang,
Linfeng Zhang
Abstract:
Diffusion models have achieved significant progress in both image and video generation while still suffering from huge computation costs. As an effective solution, flow matching aims to reflow the diffusion process of diffusion models into a straight line for a few-step and even one-step generation. However, in this paper, we suggest that the original training pipeline of flow matching is not opti…
▽ More
Diffusion models have achieved significant progress in both image and video generation while still suffering from huge computation costs. As an effective solution, flow matching aims to reflow the diffusion process of diffusion models into a straight line for a few-step and even one-step generation. However, in this paper, we suggest that the original training pipeline of flow matching is not optimal and introduce two techniques to improve it. Firstly, we introduce progressive reflow, which progressively reflows the diffusion models in local timesteps until the whole diffusion progresses, reducing the difficulty of flow matching. Second, we introduce aligned v-prediction, which highlights the importance of direction matching in flow matching over magnitude matching. Experimental results on SDv1.5 and SDXL demonstrate the effectiveness of our method, for example, conducting on SDv1.5 achieves an FID of 10.70 on MSCOCO2014 validation set with only 4 sampling steps, close to our teacher model (32 DDIM steps, FID = 10.05).
△ Less
Submitted 4 March, 2025;
originally announced March 2025.
-
Predictable Scale: Part I -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining
Authors:
Houyi Li,
Wenzhen Zheng,
Jingcheng Hu,
Qiufeng Wang,
Hanshan Zhang,
Zili Wang,
Shijie Xuyang,
Yuantao Fan,
Shuigeng Zhou,
Xiangyu Zhang,
Daxin Jiang
Abstract:
The impressive capabilities of Large Language Models (LLMs) across diverse tasks are now well-established, yet their effective deployment necessitates careful hyperparameter optimization. Through extensive empirical studies involving grid searches across diverse configurations, we discover universal scaling laws governing these hyperparameters: optimal learning rate follows a power-law relationshi…
▽ More
The impressive capabilities of Large Language Models (LLMs) across diverse tasks are now well-established, yet their effective deployment necessitates careful hyperparameter optimization. Through extensive empirical studies involving grid searches across diverse configurations, we discover universal scaling laws governing these hyperparameters: optimal learning rate follows a power-law relationship with both model parameters and data sizes, while optimal batch size scales primarily with data sizes. Our analysis reveals a convex optimization landscape for hyperparameters under fixed models and data size conditions. This convexity implies an optimal hyperparameter plateau. We contribute a universal, plug-and-play optimal hyperparameter tool for the community. Its estimated values on the test set are merely 0.09% away from the globally optimal LLM performance found via an exhaustive search. These laws demonstrate remarkable robustness across variations in model sparsity, training data distribution, and model shape. To our best known, this is the first work that unifies different model shapes and structures, such as Mixture-of-Experts models and dense transformers, as well as establishes optimal hyperparameter scaling laws across diverse data distributions. This exhaustive optimization process demands substantial computational resources, utilizing nearly one million NVIDIA H800 GPU hours to train 3,700 LLMs of varying sizes and hyperparameters from scratch and consuming approximately 100 trillion tokens in total. To facilitate reproducibility and further research, we will progressively release all loss measurements and model checkpoints through our designated repository https://step-law.github.io/
△ Less
Submitted 19 March, 2025; v1 submitted 6 March, 2025;
originally announced March 2025.
-
CREATE-FFPE: Cross-Resolution Compensated and Multi-Frequency Enhanced FS-to-FFPE Stain Transfer for Intraoperative IHC Images
Authors:
Yiyang Lin,
Danling Jiang,
Xinyu Liu,
Yun Miao,
Yixuan Yuan
Abstract:
In the immunohistochemical (IHC) analysis during surgery, frozen-section (FS) images are used to determine the benignity or malignancy of the tumor. However, FS image faces problems such as image contamination and poor nuclear detail, which may disturb the pathologist's diagnosis. In contrast, formalin-fixed and paraffin-embedded (FFPE) image has a higher staining quality, but it requires quite a…
▽ More
In the immunohistochemical (IHC) analysis during surgery, frozen-section (FS) images are used to determine the benignity or malignancy of the tumor. However, FS image faces problems such as image contamination and poor nuclear detail, which may disturb the pathologist's diagnosis. In contrast, formalin-fixed and paraffin-embedded (FFPE) image has a higher staining quality, but it requires quite a long time to prepare and thus is not feasible during surgery. To help pathologists observe IHC images with high quality in surgery, this paper proposes a Cross-REsolution compensATed and multi-frequency Enhanced FS-to-FFPE (CREATE-FFPE) stain transfer framework, which is the first FS-to-FFPE method for the intraoperative IHC images. To solve the slide contamination and poor nuclear detail mentioned above, we propose the cross-resolution compensation module (CRCM) and the wavelet detail guidance module (WDGM). Specifically, CRCM compensates for information loss due to contamination by providing more tissue information across multiple resolutions, while WDGM produces the desirable details in a wavelet way, and the details can be used to guide the stain transfer to be more precise. Experiments show our method can beat all the competing methods on our dataset. In addition, the FID has decreased by 44.4%, and KID*100 has decreased by 71.2% by adding the proposed CRCM and WDGM in ablation studies, and the performance of a downstream microsatellite instability prediction task with public dataset can be greatly improved by performing our FS-to-FFPE stain transfer.
△ Less
Submitted 1 March, 2025;
originally announced March 2025.
-
Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy
Authors:
Zaijing Li,
Yuquan Xie,
Rui Shao,
Gongwei Chen,
Dongmei Jiang,
Liqiang Nie
Abstract:
Building an agent that can mimic human behavior patterns to accomplish various open-world tasks is a long-term goal. To enable agents to effectively learn behavioral patterns across diverse tasks, a key challenge lies in modeling the intricate relationships among observations, actions, and language. To this end, we propose Optimus-2, a novel Minecraft agent that incorporates a Multimodal Large Lan…
▽ More
Building an agent that can mimic human behavior patterns to accomplish various open-world tasks is a long-term goal. To enable agents to effectively learn behavioral patterns across diverse tasks, a key challenge lies in modeling the intricate relationships among observations, actions, and language. To this end, we propose Optimus-2, a novel Minecraft agent that incorporates a Multimodal Large Language Model (MLLM) for high-level planning, alongside a Goal-Observation-Action Conditioned Policy (GOAP) for low-level control. GOAP contains (1) an Action-guided Behavior Encoder that models causal relationships between observations and actions at each timestep, then dynamically interacts with the historical observation-action sequence, consolidating it into fixed-length behavior tokens, and (2) an MLLM that aligns behavior tokens with open-ended language instructions to predict actions auto-regressively. Moreover, we introduce a high-quality Minecraft Goal-Observation-Action (MGOA)} dataset, which contains 25,000 videos across 8 atomic tasks, providing about 30M goal-observation-action pairs. The automated construction method, along with the MGOA dataset, can contribute to the community's efforts to train Minecraft agents. Extensive experimental results demonstrate that Optimus-2 exhibits superior performance across atomic tasks, long-horizon tasks, and open-ended instruction tasks in Minecraft. Please see the project page at https://cybertronagent.github.io/Optimus-2.github.io/.
△ Less
Submitted 11 March, 2025; v1 submitted 27 February, 2025;
originally announced February 2025.
-
Cardiac Evidence Backtracking for Eating Behavior Monitoring using Collocative Electrocardiogram Imagining
Authors:
Xu-Lu Zhang,
Zhen-Qun Yang,
Dong-Mei Jiang,
Ga Liao,
Qing Li,
Ramesh Jain,
Xiao-Yong Wei
Abstract:
Eating monitoring has remained an open challenge in medical research for years due to the lack of non-invasive sensors for continuous monitoring and the reliable methods for automatic behavior detection. In this paper, we present a pilot study using the wearable 24-hour ECG for sensing and tailoring the sophisticated deep learning for ad-hoc and interpretable detection. This is accomplished using…
▽ More
Eating monitoring has remained an open challenge in medical research for years due to the lack of non-invasive sensors for continuous monitoring and the reliable methods for automatic behavior detection. In this paper, we present a pilot study using the wearable 24-hour ECG for sensing and tailoring the sophisticated deep learning for ad-hoc and interpretable detection. This is accomplished using a collocative learning framework in which 1) we construct collocative tensors as pseudo-images from 1D ECG signals to improve the feasibility of 2D image-based deep models; 2) we formulate the cardiac logic of analyzing the ECG data in a comparative way as periodic attention regulators so as to guide the deep inference to collect evidence in a human comprehensible manner; and 3) we improve the interpretability of the framework by enabling the backtracking of evidence with a set of methods designed for Class Activation Mapping (CAM) decoding and decision tree/forest generation. The effectiveness of the proposed framework has been validated on the largest ECG dataset of eating behavior with superior performance over conventional models, and its capacity of cardiac evidence mining has also been verified through the consistency of the evidence it backtracked and that of the previous medical studies.
△ Less
Submitted 20 February, 2025;
originally announced February 2025.
-
Aligned Multi Objective Optimization
Authors:
Yonathan Efroni,
Ben Kretzu,
Daniel Jiang,
Jalaj Bhandari,
Zheqing,
Zhu,
Karen Ullrich
Abstract:
To date, the multi-objective optimization literature has mainly focused on conflicting objectives, studying the Pareto front, or requiring users to balance tradeoffs. Yet, in machine learning practice, there are many scenarios where such conflict does not take place. Recent findings from multi-task learning, reinforcement learning, and LLMs training show that diverse related tasks can enhance perf…
▽ More
To date, the multi-objective optimization literature has mainly focused on conflicting objectives, studying the Pareto front, or requiring users to balance tradeoffs. Yet, in machine learning practice, there are many scenarios where such conflict does not take place. Recent findings from multi-task learning, reinforcement learning, and LLMs training show that diverse related tasks can enhance performance across objectives simultaneously. Despite this evidence, such phenomenon has not been examined from an optimization perspective. This leads to a lack of generic gradient-based methods that can scale to scenarios with a large number of related objectives. To address this gap, we introduce the Aligned Multi-Objective Optimization framework, propose new algorithms for this setting, and provide theoretical guarantees of their superior performance compared to naive approaches.
△ Less
Submitted 3 March, 2025; v1 submitted 19 February, 2025;
originally announced February 2025.
-
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
Authors:
Ailin Huang,
Boyong Wu,
Bruce Wang,
Chao Yan,
Chen Hu,
Chengli Feng,
Fei Tian,
Feiyu Shen,
Jingbei Li,
Mingrui Chen,
Peng Liu,
Ruihang Miao,
Wang You,
Xi Chen,
Xuerui Yang,
Yechang Huang,
Yuxiang Zhang,
Zheng Gong,
Zixin Zhang,
Hongyu Zhou,
Jianjian Sun,
Brian Li,
Chengting Feng,
Changyi Wan,
Hanpeng Hu
, et al. (120 additional authors not shown)
Abstract:
Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contribu…
▽ More
Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.
△ Less
Submitted 18 February, 2025; v1 submitted 17 February, 2025;
originally announced February 2025.
-
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Authors:
Guoqing Ma,
Haoyang Huang,
Kun Yan,
Liangyu Chen,
Nan Duan,
Shengming Yin,
Changyi Wan,
Ranchen Ming,
Xiaoniu Song,
Xing Chen,
Yu Zhou,
Deshan Sun,
Deyu Zhou,
Jian Zhou,
Kaijun Tan,
Kang An,
Mei Chen,
Wei Ji,
Qiling Wu,
Wen Sun,
Xin Han,
Yanan Wei,
Zheng Ge,
Aojie Li,
Bin Wang
, et al. (90 additional authors not shown)
Abstract:
We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded…
▽ More
We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.
△ Less
Submitted 24 February, 2025; v1 submitted 14 February, 2025;
originally announced February 2025.
-
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency
Authors:
Dongzhi Jiang,
Renrui Zhang,
Ziyu Guo,
Yanwei Li,
Yu Qi,
Xinyan Chen,
Liuhui Wang,
Jianhan Jin,
Claire Guo,
Shen Yan,
Bo Zhang,
Chaoyou Fu,
Peng Gao,
Hongsheng Li
Abstract:
Answering questions with Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs), yet its impact on Large Multimodal Models (LMMs) still lacks a systematic assessment and in-depth investigation. In this paper, we introduce MME-CoT, a specialized benchmark evaluating the CoT reasoning performance of LMMs, spanning six domains: math, science, OCR,…
▽ More
Answering questions with Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs), yet its impact on Large Multimodal Models (LMMs) still lacks a systematic assessment and in-depth investigation. In this paper, we introduce MME-CoT, a specialized benchmark evaluating the CoT reasoning performance of LMMs, spanning six domains: math, science, OCR, logic, space-time, and general scenes. As the first comprehensive study in this area, we propose a thorough evaluation suite incorporating three novel metrics that assess the reasoning quality, robustness, and efficiency at a fine-grained level. Leveraging curated high-quality data and a unique evaluation strategy, we conduct an in-depth analysis of state-of-the-art LMMs, uncovering several key insights: 1) Models with reflection mechanism demonstrate a superior CoT quality, with Kimi k1.5 outperforming GPT-4o and demonstrating the highest quality results; 2) CoT prompting often degrades LMM performance on perception-heavy tasks, suggesting a potentially harmful overthinking behavior; and 3) Although the CoT quality is high, LMMs with reflection exhibit significant inefficiency in both normal response and self-correction phases. We hope MME-CoT serves as a foundation for advancing multimodal reasoning in LMMs. Project Page: https://mmecot.github.io/
△ Less
Submitted 13 February, 2025;
originally announced February 2025.
-
DSV: Exploiting Dynamic Sparsity to Accelerate Large-Scale Video DiT Training
Authors:
Xin Tan,
Yuetao Chen,
Yimin Jiang,
Xing Chen,
Kun Yan,
Nan Duan,
Yibo Zhu,
Daxin Jiang,
Hong Xu
Abstract:
Diffusion Transformers (DiTs) have shown remarkable performance in generating high-quality videos. However, the quadratic complexity of 3D full attention remains a bottleneck in scaling DiT training, especially with high-definition, lengthy videos, where it can consume up to 95% of processing time and demand specialized context parallelism.
This paper introduces DSV to accelerate video DiT train…
▽ More
Diffusion Transformers (DiTs) have shown remarkable performance in generating high-quality videos. However, the quadratic complexity of 3D full attention remains a bottleneck in scaling DiT training, especially with high-definition, lengthy videos, where it can consume up to 95% of processing time and demand specialized context parallelism.
This paper introduces DSV to accelerate video DiT training by leveraging the dynamic attention sparsity we empirically observe. DSV uses a two-stage algorithm to capture the dynamic sparsity patterns via low-rank based approximation of the original query and key. It employs custom kernels to efficiently identify critical key-value pairs and compute the sparse attention. To accommodate the new sparsity dimension, DSV adopts a hybrid sparsity-aware context parallelism that re-balances the skewed workload across attention heads and blocks due to sparsity heterogeneity. DSV achieves up to 3.02x higher training throughput, scaling to 128 GPUs and 520k token lengths, without quality loss.
△ Less
Submitted 16 March, 2025; v1 submitted 11 February, 2025;
originally announced February 2025.
-
InfiniteHBD: Building Datacenter-Scale High-Bandwidth Domain for LLM with Optical Circuit Switching Transceivers
Authors:
Chenchen Shou,
Guyue Liu,
Hao Nie,
Huaiyu Meng,
Yu Zhou,
Yimin Jiang,
Wenqing Lv,
Yelong Xu,
Yuanwei Lu,
Zhang Chen,
Yanbo Yu,
Yichen Shen,
Yibo Zhu,
Daxin Jiang
Abstract:
Scaling Large Language Model (LLM) training relies on multi-dimensional parallelism, where High-Bandwidth Domains (HBDs) are critical for communication-intensive parallelism like Tensor Parallelism (TP) and Expert Parallelism (EP). However, existing HBD architectures face fundamental limitations in scalability, cost, and fault resiliency: switch-centric HBDs (e.g., NVL-72) incur prohibitive scalin…
▽ More
Scaling Large Language Model (LLM) training relies on multi-dimensional parallelism, where High-Bandwidth Domains (HBDs) are critical for communication-intensive parallelism like Tensor Parallelism (TP) and Expert Parallelism (EP). However, existing HBD architectures face fundamental limitations in scalability, cost, and fault resiliency: switch-centric HBDs (e.g., NVL-72) incur prohibitive scaling costs, while GPU-centric HBDs (e.g., TPUv3/Dojo) suffer from severe fault propagation. Switch-GPU hybrid HBDs such as TPUv4 takes a middle-ground approach by leveraging Optical Circuit Switches, but the fault explosion radius remains large at the cube level (e.g., 64 TPUs).
We propose InfiniteHBD, a novel transceiver-centric HBD architecture that unifies connectivity and dynamic switching at the transceiver level using Optical Circuit Switching (OCS). By embedding OCS within each transceiver, InfiniteHBD achieves reconfigurable point-to-multipoint connectivity, allowing the topology to adapt into variable-size rings. This design provides: i) datacenter-wide scalability without cost explosion; ii) fault resilience by isolating failures to a single node, and iii) full bandwidth utilization for fault-free GPUs. Key innovations include a Silicon Photonic (SiPh) based low-cost OCS transceiver (OCSTrx), a reconfigurable k-hop ring topology co-designed with intra-/inter-node communication, and an HBD-DCN orchestration algorithm maximizing GPU utilization while minimizing cross-ToR datacenter network traffic. The evaluation demonstrates that InfiniteHBD achieves 31% of the cost of NVL-72, near-zero GPU waste ratio (over one order of magnitude lower than NVL-72 and TPUv4), near-zero cross-ToR traffic when node fault ratios under 7%, and improves Model FLOPs Utilization by 3.37x compared to NVIDIA DGX (8 GPUs per Node).
△ Less
Submitted 7 May, 2025; v1 submitted 6 February, 2025;
originally announced February 2025.
-
ACECODER: Acing Coder RL via Automated Test-Case Synthesis
Authors:
Huaye Zeng,
Dongfu Jiang,
Haozhe Wang,
Ping Nie,
Xiaotong Chen,
Wenhu Chen
Abstract:
Most progress in recent coder models has been driven by supervised fine-tuning (SFT), while the potential of reinforcement learning (RL) remains largely unexplored, primarily due to the lack of reliable reward data/model in the code domain. In this paper, we address this challenge by leveraging automated large-scale test-case synthesis to enhance code model training. Specifically, we design a pipe…
▽ More
Most progress in recent coder models has been driven by supervised fine-tuning (SFT), while the potential of reinforcement learning (RL) remains largely unexplored, primarily due to the lack of reliable reward data/model in the code domain. In this paper, we address this challenge by leveraging automated large-scale test-case synthesis to enhance code model training. Specifically, we design a pipeline that generates extensive (question, test-cases) pairs from existing code data. Using these test cases, we construct preference pairs based on pass rates over sampled programs to train reward models with Bradley-Terry loss. It shows an average of 10-point improvement for Llama-3.1-8B-Ins and 5-point improvement for Qwen2.5-Coder-7B-Ins through best-of-32 sampling, making the 7B model on par with 236B DeepSeek-V2.5. Furthermore, we conduct reinforcement learning with both reward models and test-case pass rewards, leading to consistent improvements across HumanEval, MBPP, BigCodeBench, and LiveCodeBench (V4). Notably, we follow the R1-style training to start from Qwen2.5-Coder-base directly and show that our RL training can improve model on HumanEval-plus by over 25\% and MBPP-plus by 6\% for merely 80 optimization steps. We believe our results highlight the huge potential of reinforcement learning in coder models.
△ Less
Submitted 10 February, 2025; v1 submitted 3 February, 2025;
originally announced February 2025.
-
PolaFormer: Polarity-aware Linear Attention for Vision Transformers
Authors:
Weikang Meng,
Yadan Luo,
Xin Li,
Dongmei Jiang,
Zheng Zhang
Abstract:
Linear attention has emerged as a promising alternative to softmax-based attention, leveraging kernelized feature maps to reduce complexity from quadratic to linear in sequence length. However, the non-negative constraint on feature maps and the relaxed exponential function used in approximation lead to significant information loss compared to the original query-key dot products, resulting in less…
▽ More
Linear attention has emerged as a promising alternative to softmax-based attention, leveraging kernelized feature maps to reduce complexity from quadratic to linear in sequence length. However, the non-negative constraint on feature maps and the relaxed exponential function used in approximation lead to significant information loss compared to the original query-key dot products, resulting in less discriminative attention maps with higher entropy. To address the missing interactions driven by negative values in query-key pairs, we propose a polarity-aware linear attention mechanism that explicitly models both same-signed and opposite-signed query-key interactions, ensuring comprehensive coverage of relational information. Furthermore, to restore the spiky properties of attention maps, we provide a theoretical analysis proving the existence of a class of element-wise functions (with positive first and second derivatives) that can reduce entropy in the attention distribution. For simplicity, and recognizing the distinct contributions of each dimension, we employ a learnable power function for rescaling, allowing strong and weak attention signals to be effectively separated. Extensive experiments demonstrate that the proposed PolaFormer improves performance on various vision tasks, enhancing both expressiveness and efficiency by up to 4.6%.
△ Less
Submitted 4 March, 2025; v1 submitted 24 January, 2025;
originally announced January 2025.
-
MSF: Efficient Diffusion Model Via Multi-Scale Latent Factorize
Authors:
Haohang Xu,
Longyu Chen,
Shuangrui Ding,
Yilin Gao,
Dongsheng Jiang,
Yin Li,
Shugong Xu,
Junqing Yu,
Wei Yang
Abstract:
Diffusion-based generative models have achieved remarkable progress in visual content generation. However, traditional diffusion models directly denoise the entire image from noisy inputs, disregarding the hierarchical structure present in visual signals. This method is computationally intensive, especially for high-resolution image generation. Signal processing often leverages hierarchical decomp…
▽ More
Diffusion-based generative models have achieved remarkable progress in visual content generation. However, traditional diffusion models directly denoise the entire image from noisy inputs, disregarding the hierarchical structure present in visual signals. This method is computationally intensive, especially for high-resolution image generation. Signal processing often leverages hierarchical decompositions; for instance, Fourier analysis decomposes signals by frequency, while wavelet analysis captures localized frequency components, reflecting both spatial and frequency information simultaneously. Inspired by these principles, we propose a multiscale diffusion framework that generates hierarchical visual representations, which are subsequently integrated to form the final output. The diffusion model target, whether raw RGB pixels or latent features from a Variational Autoencoder, s divided into multiple components that each capture distinct spatial levels. The low-resolution component contains the primary informative signal, while higher-resolution components add high-frequency details, such as texture. This approach divides image generation into two stages: producing a low-resolution base signal, followed by a high-resolution residual signal. Both stages can be effectively modeled using simpler, lightweight transformer architectures compared to full-resolution generation. This decomposition is conceptually similar to wavelet decomposition but offers a more streamlined and intuitive design. Our method, termed MSF(short for Multi-Scale Factorization), achieves an FID of 2.2 and an IS of 255.4 on the ImageNet 256x256 benchmark, reducing computational costs by 50% compared to baseline methods.
△ Less
Submitted 22 January, 2025;
originally announced January 2025.
-
CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation
Authors:
Zheng Chong,
Wenqing Zhang,
Shiyue Zhang,
Jun Zheng,
Xiao Dong,
Haoxiang Li,
Yiling Wu,
Dongmei Jiang,
Xiaodan Liang
Abstract:
Virtual try-on (VTON) technology has gained attention due to its potential to transform online retail by enabling realistic clothing visualization of images and videos. However, most existing methods struggle to achieve high-quality results across image and video try-on tasks, especially in long video scenarios. In this work, we introduce CatV2TON, a simple and effective vision-based virtual try-o…
▽ More
Virtual try-on (VTON) technology has gained attention due to its potential to transform online retail by enabling realistic clothing visualization of images and videos. However, most existing methods struggle to achieve high-quality results across image and video try-on tasks, especially in long video scenarios. In this work, we introduce CatV2TON, a simple and effective vision-based virtual try-on (V2TON) method that supports both image and video try-on tasks with a single diffusion transformer model. By temporally concatenating garment and person inputs and training on a mix of image and video datasets, CatV2TON achieves robust try-on performance across static and dynamic settings. For efficient long-video generation, we propose an overlapping clip-based inference strategy that uses sequential frame guidance and Adaptive Clip Normalization (AdaCN) to maintain temporal consistency with reduced resource demands. We also present ViViD-S, a refined video try-on dataset, achieved by filtering back-facing frames and applying 3D mask smoothing for enhanced temporal consistency. Comprehensive experiments demonstrate that CatV2TON outperforms existing methods in both image and video try-on tasks, offering a versatile and reliable solution for realistic virtual try-ons across diverse scenarios.
△ Less
Submitted 20 January, 2025;
originally announced January 2025.
-
Enhancing Generalization in Chain of Thought Reasoning for Smaller Models
Authors:
Maxwell J. Yin,
Dingyi Jiang,
Yongbing Chen,
Boyu Wang,
Charles Ling
Abstract:
Chain-of-Thought (CoT) reasoning in smaller language models is a challenging natural language process problem yet highly desirable in many real-life applications. Existing CoT knowledge distillation methods often suffer from overly conservative memorization in smaller LLMs, leading to low generalization confidence. As fully preserving the CoT ability of teacher model is impossible, we hypothesize…
▽ More
Chain-of-Thought (CoT) reasoning in smaller language models is a challenging natural language process problem yet highly desirable in many real-life applications. Existing CoT knowledge distillation methods often suffer from overly conservative memorization in smaller LLMs, leading to low generalization confidence. As fully preserving the CoT ability of teacher model is impossible, we hypothesize that adversarial CoT fine-tuning is crucial for developing smaller LLM with robust CoT generalization. To this end, we propose \textit{PRompt-Assisted Domain-Adversarial fine-tuning} (PRADA), a principled fine-tuning framework that integrates diverse CoT domains. Specifically, PRADA pioneers two CoT improvements in smaller LLM: (1) Recovering the domain-invariant feature insight which typically lost during distillation with domain adversarial fine-tuning; (2) Enhancing the domain adaptability of CoT prompt engineering by employing domain-adversarial approaches. We theoretically demonstrate the effectiveness of our approach and empirically show that it significantly outperforms the state of the arts in a wide range of tasks. Moreover, our empirical findings reveal that the smaller LLM, when leveraging PRADA, aligns closely with domain knowledge, thereby improving the explainability of our approach.
△ Less
Submitted 16 January, 2025;
originally announced January 2025.
-
Slow Perception: Let's Perceive Geometric Figures Step-by-step
Authors:
Haoran Wei,
Youyang Yin,
Yumeng Li,
Jia Wang,
Liang Zhao,
Jianjian Sun,
Zheng Ge,
Xiangyu Zhang,
Daxin Jiang
Abstract:
Recently, "visual o1" began to enter people's vision, with expectations that this slow-thinking design can solve visual reasoning tasks, especially geometric math problems. However, the reality is that current LVLMs (Large Vision Language Models) can hardly even accurately copy a geometric figure, let alone truly understand the complex inherent logic and spatial relationships within geometric shap…
▽ More
Recently, "visual o1" began to enter people's vision, with expectations that this slow-thinking design can solve visual reasoning tasks, especially geometric math problems. However, the reality is that current LVLMs (Large Vision Language Models) can hardly even accurately copy a geometric figure, let alone truly understand the complex inherent logic and spatial relationships within geometric shapes. We believe accurate copying (strong perception) is the first step to visual o1. Accordingly, we introduce the concept of "slow perception" (SP), which guides the model to gradually perceive basic point-line combinations, as our humans, reconstruct complex geometric structures progressively. There are two-fold stages in SP: a) perception decomposition. Perception is not instantaneous. In this stage, complex geometric figures are broken down into basic simple units to unify geometry representation. b) perception flow, which acknowledges that accurately tracing a line is not an easy task. This stage aims to avoid "long visual jumps" in regressing line segments by using a proposed "perceptual ruler" to trace each line stroke-by-stroke. Surprisingly, such a human-like perception manner enjoys an inference time scaling law -- the slower, the better. Researchers strive to speed up the model's perception in the past, but we slow it down again, allowing the model to read the image step-by-step and carefully.
△ Less
Submitted 26 January, 2025; v1 submitted 29 December, 2024;
originally announced December 2024.
-
Sychronous vs. asynchronous coalitions in multiplayer games, with applications to guts poker
Authors:
Jessica Babyak,
Kevin Buck,
Leah Dichter,
David Jiang,
Kevin Zumbrun
Abstract:
We study the issue introduced by Buck-Lee-Platnick-Wheeler-Zumbrun of synchronous vs. asynchronous coalitions in multiplayer games, that is, the difference between coalitions with full and partial communication, with a specific interest in the context of continuous Guts poker where this problem was originally formulated. We observe for general symmetric multiplayer games, with players 2-n in coali…
▽ More
We study the issue introduced by Buck-Lee-Platnick-Wheeler-Zumbrun of synchronous vs. asynchronous coalitions in multiplayer games, that is, the difference between coalitions with full and partial communication, with a specific interest in the context of continuous Guts poker where this problem was originally formulated. We observe for general symmetric multiplayer games, with players 2-n in coalition against player 1, that there are three values, corresponding to symmetric Nash equilibrium, optimal asynchronous, and optimal synchronous strategies, in that order, for which inequalities may for different examples be strict or nonstrict (i.e., equality) in any combination. Different from Nash equilibria and synchronous optima, which may be phrased as convex optimization problems, or classical 2-player games, determination of asynchronous optima is a nonconvex optimization problem. We discuss methods of numerical approximation of this optimum, and examine performance on 3-player rock-paper-scissors and discretized Guts poker. Finally, we present sufficient conditions guaranteeing different possibilities for behavior, based on concave/convexity properties of the payoff function. These answer in the affirmative the open problem posed by Buck-Lee-Platnick-Wheeler-Zumbrun whether the optimal asynchronous coalition value for 3-player guts is equal to the Nash equilibrium value zero. At the same time, we present a number of new results regarding synchronous coalition play for continuous $3$-player guts.
△ Less
Submitted 25 December, 2024;
originally announced December 2024.
-
Multi-matrix Factorization Attention
Authors:
Jingcheng Hu,
Houyi Li,
Yinmin Zhang,
Zili Wang,
Shuigeng Zhou,
Xiangyu Zhang,
Heung-Yeung Shum,
Daxin Jiang
Abstract:
We propose novel attention architectures, Multi-matrix Factorization Attention (MFA) and MFA-Key-Reuse (MFA-KR). Existing variants for standard Multi-Head Attention (MHA), including SOTA methods like MLA, fail to maintain as strong performance under stringent Key-Value cache (KV cache) constraints. MFA enhances model capacity by efficiently scaling up both the number and dimension of attention hea…
▽ More
We propose novel attention architectures, Multi-matrix Factorization Attention (MFA) and MFA-Key-Reuse (MFA-KR). Existing variants for standard Multi-Head Attention (MHA), including SOTA methods like MLA, fail to maintain as strong performance under stringent Key-Value cache (KV cache) constraints. MFA enhances model capacity by efficiently scaling up both the number and dimension of attention heads through low-rank matrix factorization in the Query-Key (QK) circuit. Extending MFA, MFA-KR further reduces memory requirements by repurposing the key cache as value through value projection re-parameterization. MFA's design enables strong model capacity when working under tight KV cache budget, while MFA-KR is suitable for even harsher KV cache limits with minor performance trade-off. Notably, in our extensive and large-scale experiments, the proposed architecture outperforms MLA and performs comparably to MHA, while reducing KV cache usage by up to 56% and 93.7%, respectively.
△ Less
Submitted 14 January, 2025; v1 submitted 26 December, 2024;
originally announced December 2024.
-
Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels
Authors:
Mingcong Song,
Xinru Tang,
Fengfan Hou,
Jing Li,
Wei Wei,
Yipeng Ma,
Runqiu Xiao,
Hongjie Si,
Dingcheng Jiang,
Shouyi Yin,
Yang Hu,
Guoping Long
Abstract:
Meeting growing demands for low latency and cost efficiency in production-grade large language model (LLM) serving systems requires integrating advanced optimization techniques. However, dynamic and unpredictable input-output lengths of LLM, compounded by these optimizations, exacerbate the issues of workload variability, making it difficult to maintain high efficiency on AI accelerators, especial…
▽ More
Meeting growing demands for low latency and cost efficiency in production-grade large language model (LLM) serving systems requires integrating advanced optimization techniques. However, dynamic and unpredictable input-output lengths of LLM, compounded by these optimizations, exacerbate the issues of workload variability, making it difficult to maintain high efficiency on AI accelerators, especially DSAs with tile-based programming models. To address this challenge, we introduce XY-Serve, a versatile, Ascend native, end-to-end production LLM-serving system. The core idea is an abstraction mechanism that smooths out the workload variability by decomposing computations into unified, hardware-friendly, fine-grained meta primitives. For attention, we propose a meta-kernel that computes the basic pattern of matmul-softmax-matmul with architectural-aware tile sizes. For GEMM, we introduce a virtual padding scheme that adapts to dynamic shape changes while using highly efficient GEMM primitives with assorted fixed tile sizes. XY-Serve sits harmoniously with vLLM. Experimental results show up to 89% end-to-end throughput improvement compared with current publicly available baselines on Ascend NPUs. Additionally, our approach outperforms existing GEMM (average 14.6% faster) and attention (average 21.5% faster) kernels relative to existing libraries. While the work is Ascend native, we believe the approach can be readily applicable to SIMT architectures as well.
△ Less
Submitted 23 December, 2024;
originally announced December 2024.