Skip to main content

Showing 1–50 of 366 results for author: Jiang, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.07896  [pdf, ps, other

    q-bio.GN cs.AI

    Bridging Large Language Models and Single-Cell Transcriptomics in Dissecting Selective Motor Neuron Vulnerability

    Authors: Douglas Jiang, Zilin Dai, Luxuan Zhang, Qiyi Yu, Haoqi Sun, Feng Tian

    Abstract: Understanding cell identity and function through single-cell level sequencing data remains a key challenge in computational biology. We present a novel framework that leverages gene-specific textual annotations from the NCBI Gene database to generate biologically contextualized cell embeddings. For each cell in a single-cell RNA sequencing (scRNA-seq) dataset, we rank genes by expression level, re… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

  2. arXiv:2505.07747  [pdf, other

    cs.CV

    Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets

    Authors: Weiyu Li, Xuanyang Zhang, Zheng Sun, Di Qi, Hao Li, Wei Cheng, Weiwei Cai, Shihao Wu, Jiarui Liu, Zihao Wang, Xiao Chen, Feipeng Tian, Jianxiong Pan, Zeming Li, Gang Yu, Xiangyu Zhang, Daxin Jiang, Ping Tan

    Abstract: While generative artificial intelligence has advanced significantly across text, image, audio, and video domains, 3D generation remains comparatively underdeveloped due to fundamental challenges such as data scarcity, algorithmic limitations, and ecosystem fragmentation. To this end, we present Step1X-3D, an open framework addressing these challenges through: (1) a rigorous data curation pipeline… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

    Comments: Technical report

  3. arXiv:2505.07158  [pdf, ps, other

    cs.CR

    Real-Time Bit-Level Encryption of Full High-Definition Video Without Diffusion

    Authors: Dong Jiang, Hui-ran Luo, Zi-jian Cui, Xi-jue Zhao, Lin-sheng Huang, Liang-liang Lu

    Abstract: Despite the widespread adoption of Shannon's confusion-diffusion architecture in image encryption, the implementation of diffusion to sequentially establish inter-pixel dependencies for attaining plaintext sensitivity constrains algorithmic parallelism, while the execution of multiple rounds of diffusion operations to meet the required sensitivity metrics incurs excessive computational overhead. C… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

  4. arXiv:2505.07049  [pdf, ps, other

    cs.AI

    DialogueReason: Rule-Based RL Sparks Dialogue Reasoning in LLMs

    Authors: Yubo Shu, Zhewei Huang, Xin Wu, Chen Hu, Shuchang Zhou, Daxin Jiang

    Abstract: We propose DialogueReason, a reasoning paradigm that uncovers the lost roles in monologue-style reasoning models, aiming to boost diversity and coherency of the reasoning process. Recent advances in RL-based large reasoning models have led to impressive long CoT capabilities and high performance on math and science benchmarks. However, these reasoning models rely mainly on monologue-style reasonin… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

  5. arXiv:2505.05225  [pdf, ps, other

    cs.CL

    QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation

    Authors: Mengze Hong, Wailing Ng, Di Jiang, Chen Jason Zhang

    Abstract: The rapid advancement of Chinese large language models (LLMs) underscores the need for domain-specific evaluations to ensure reliable applications. However, existing benchmarks often lack coverage in vertical domains and offer limited insights into the Chinese working context. Leveraging qualification exams as a unified framework for human expertise evaluation, we introduce QualBench, the first mu… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

  6. arXiv:2505.02831  [pdf, other

    cs.CV

    No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves

    Authors: Dengyang Jiang, Mengmeng Wang, Liuzhuozheng Li, Lei Zhang, Haoyu Wang, Wei Wei, Guang Dai, Yanning Zhang, Jingdong Wang

    Abstract: Recent studies have demonstrated that learning a meaningful internal representation can both accelerate generative training and enhance the generation quality of diffusion transformers. However, existing approaches necessitate to either introduce an external and complex representation training framework or rely on a large-scale, pre-trained representation foundation model to provide representation… ▽ More

    Submitted 13 May, 2025; v1 submitted 5 May, 2025; originally announced May 2025.

    Comments: Self-Representation Alignment for Diffusion Transformers. Code: https://github.com/vvvvvjdy/SRA

  7. arXiv:2505.01386  [pdf, other

    cs.LG cs.AR

    Carbon Aware Transformers Through Joint Model-Hardware Optimization

    Authors: Irene Wang, Newsha Ardalani, Mostafa Elhoushi, Daniel Jiang, Samuel Hsia, Ekin Sumbul, Divya Mahajan, Carole-Jean Wu, Bilge Acun

    Abstract: The rapid growth of machine learning (ML) systems necessitates a more comprehensive evaluation of their environmental impact, particularly their carbon footprint, which comprises operational carbon from training and inference execution and embodied carbon from hardware manufacturing and its entire life-cycle. Despite the increasing importance of embodied emissions, there is a lack of tools and fra… ▽ More

    Submitted 8 May, 2025; v1 submitted 2 May, 2025; originally announced May 2025.

  8. arXiv:2505.00703  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT

    Authors: Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, Hongsheng Li

    Abstract: Recent advancements in large language models have demonstrated how chain-of-thought (CoT) and reinforcement learning (RL) can improve performance. However, applying such reasoning strategies to the visual generation domain remains largely unexplored. In this paper, we present T2I-R1, a novel reasoning-enhanced text-to-image generation model, powered by RL with a bi-level CoT reasoning process. Spe… ▽ More

    Submitted 1 May, 2025; originally announced May 2025.

    Comments: Project Page: https://github.com/CaraJ7/T2I-R1

  9. arXiv:2504.17761  [pdf, other

    cs.CV

    Step1X-Edit: A Practical Framework for General Image Editing

    Authors: Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, Daxin Jiang

    Abstract: In recent years, image editing models have witnessed remarkable and rapid development. The recent unveiling of cutting-edge multimodal models such as GPT-4o and Gemini2 Flash has introduced highly promising image editing capabilities. These models demonstrate an impressive aptitude for fulfilling a vast majority of user-driven editing requirements, marking a significant advancement in the field of… ▽ More

    Submitted 6 May, 2025; v1 submitted 24 April, 2025; originally announced April 2025.

    Comments: code: https://github.com/stepfun-ai/Step1X-Edit

  10. arXiv:2504.15930  [pdf, other

    cs.LG cs.DC

    StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation

    Authors: Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, Hongyu Zhou, Yimin Jiang, Yibo Zhu, Daxin Jiang

    Abstract: Reinforcement learning (RL) has become the core post-training technique for large language models (LLMs). RL for LLMs involves two stages: generation and training. The LLM first generates samples online, which are then used to derive rewards for training. The conventional view holds that the colocated architecture, where the two stages share resources via temporal multiplexing, outperforms the dis… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

  11. arXiv:2504.15650  [pdf, other

    cs.CV

    AffordanceSAM: Segment Anything Once More in Affordance Grounding

    Authors: Dengyang Jiang, Mengmeng Wang, Teli Ma, Hengzhuang Li, Yong liu, Guang Dai, Lei Zhang

    Abstract: Improving the generalization ability of an affordance grounding model to recognize regions for unseen objects and affordance functions is crucial for real-world application. However, current models are still far away from such standards. To address this problem, we introduce AffordanceSAM, an effective approach that extends SAM's generalization capacity to the domain of affordance grounding. For t… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

    Comments: SAM Meets Affordance Grounding

  12. arXiv:2504.15376  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.MM

    Towards Understanding Camera Motions in Any Video

    Authors: Zhiqiu Lin, Siyuan Cen, Daniel Jiang, Jay Karhade, Hewei Wang, Chancharik Mitra, Tiffany Ling, Yuhan Huang, Sifan Liu, Mingyu Chen, Rushikesh Zawar, Xue Bai, Yilun Du, Chuang Gan, Deva Ramanan

    Abstract: We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that s… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

    Comments: Project site: https://linzhiqiu.github.io/papers/camerabench/

  13. arXiv:2504.14145  [pdf, other

    cs.DC cs.AI

    PipeWeaver: Addressing Data Dynamicity in Large Multimodal Model Training with Dynamic Interleaved Pipeline

    Authors: Zhenliang Xue, Hanpeng Hu, Xing Chen, Yimin Jiang, Yixin Song, Zeyu Mi, Yibo Zhu, Daxin Jiang, Yubin Xia, Haibo Chen

    Abstract: Large multimodal models (LMMs) have demonstrated excellent capabilities in both understanding and generation tasks with various modalities. While these models can accept flexible combinations of input data, their training efficiency suffers from two major issues: pipeline stage imbalance caused by heterogeneous model architectures, and training data dynamicity stemming from the diversity of multim… ▽ More

    Submitted 18 April, 2025; originally announced April 2025.

  14. arXiv:2504.13218  [pdf, other

    cs.LG cs.AI cs.MM

    Harmony: A Unified Framework for Modality Incremental Learning

    Authors: Yaguang Song, Xiaoshan Yang, Dongmei Jiang, Yaowei Wang, Changsheng Xu

    Abstract: Incremental learning aims to enable models to continuously acquire knowledge from evolving data streams while preserving previously learned capabilities. While current research predominantly focuses on unimodal incremental learning and multimodal incremental learning where the modalities are consistent, real-world scenarios often present data from entirely new modalities, posing additional challen… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

  15. arXiv:2504.11879  [pdf, other

    cs.CV

    Learning Compatible Multi-Prize Subnetworks for Asymmetric Retrieval

    Authors: Yushuai Sun, Zikun Zhou, Dongmei Jiang, Yaowei Wang, Jun Yu, Guangming Lu, Wenjie Pei

    Abstract: Asymmetric retrieval is a typical scenario in real-world retrieval systems, where compatible models of varying capacities are deployed on platforms with different resource configurations. Existing methods generally train pre-defined networks or subnetworks with capacities specifically designed for pre-determined platforms, using compatible learning. Nevertheless, these methods suffer from limited… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

    Comments: Accepted to CVPR 2025

  16. arXiv:2504.11423  [pdf, other

    cs.CV cs.AI

    ADT: Tuning Diffusion Models with Adversarial Supervision

    Authors: Dazhong Shen, Guanglu Song, Yi Zhang, Bingqi Ma, Lujundong Li, Dongzhi Jiang, Zhuofan Zong, Yu Liu

    Abstract: Diffusion models have achieved outstanding image generation by reversing a forward noising process to approximate true data distributions. During training, these models predict diffusion scores from noised versions of true samples in a single forward pass, while inference requires iterative denoising starting from white noise. This training-inference divergences hinder the alignment between infere… ▽ More

    Submitted 15 April, 2025; originally announced April 2025.

  17. arXiv:2504.07954  [pdf, other

    cs.CV cs.CL

    Perception-R1: Pioneering Perception Policy with Reinforcement Learning

    Authors: En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Jingyu Wang, Wenbing Tao

    Abstract: Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in MLLM post-training for perception policy learning. While promising, our initial experiments reveal that incorporating a thinking process through RL does not consistently lead to performance gains across all visual perception tasks. This leads us to delve into the essential role of RL in th… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

    Comments: Github page: https://github.com/linkangheng/PR1

  18. arXiv:2504.06835  [pdf, other

    cs.CV

    LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding

    Authors: Ziyi Wang, Haoran Wu, Yiming Rong, Deyang Jiang, Yixin Zhang, Yunlong Zhao, Shuang Xu, Bo XU

    Abstract: Long video understanding is a complex task that requires both spatial detail and temporal awareness. While Vision-Language Models (VLMs) obtain frame-level understanding capabilities through multi-frame input, they suffer from information loss due to the sparse sampling strategy. In contrast, Video Large Language Models (Video-LLMs) capture temporal relationships within visual features but are lim… ▽ More

    Submitted 9 April, 2025; originally announced April 2025.

  19. arXiv:2504.05323  [pdf, other

    cs.IR cs.AI cs.CL

    Multi-Perspective Attention Mechanism for Bias-Aware Sequential Recommendation

    Authors: Mingjian Fu, Hengsheng Chen, Dongchun Jiang, Yanchao Tan

    Abstract: In the era of advancing information technology, recommender systems have emerged as crucial tools for dealing with information overload. However, traditional recommender systems still have limitations in capturing the dynamic evolution of user behavior. To better understand and predict user behavior, especially taking into account the complexity of temporal evolution, sequential recommender system… ▽ More

    Submitted 26 February, 2025; originally announced April 2025.

    Comments: 30 pages,10 figures,4 tables

    ACM Class: I.2

  20. arXiv:2504.04519  [pdf, other

    cs.CV

    SAM2MOT: A Novel Paradigm of Multi-Object Tracking by Segmentation

    Authors: Junjie Jiang, Zelin Wang, Manqi Zhao, Yin Li, DongSheng Jiang

    Abstract: Segment Anything 2 (SAM2) enables robust single-object tracking using segmentation. To extend this to multi-object tracking (MOT), we propose SAM2MOT, introducing a novel Tracking by Segmentation paradigm. Unlike Tracking by Detection or Tracking by Query, SAM2MOT directly generates tracking boxes from segmentation masks, reducing reliance on detection accuracy. SAM2MOT has two key advantages: zer… ▽ More

    Submitted 5 May, 2025; v1 submitted 6 April, 2025; originally announced April 2025.

  21. arXiv:2503.24290  [pdf, other

    cs.LG cs.CL

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

    Authors: Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, Heung-Yeung Shum

    Abstract: We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training focusing on scalability, simplicity and accessibility. Through extensive experiments, we demonstrate that a minimalist approach, vanilla PPO with GAE ($λ=1$, $γ=1$) and straightforward rule-based rewards, without any KL regularization, is sufficient to scale up both response length a… ▽ More

    Submitted 31 March, 2025; originally announced March 2025.

  22. arXiv:2503.21839  [pdf, other

    cs.CV cs.AI cs.LG

    M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization?

    Authors: Haolong Yan, Kaijun Tan, Yeqing Shen, Xin Huang, Zheng Ge, Xiangyu Zhang, Si Li, Daxin Jiang

    Abstract: We investigate a critical yet under-explored question in Large Vision-Language Models (LVLMs): Do LVLMs genuinely comprehend interleaved image-text in the document? Existing document understanding benchmarks often assess LVLMs using question-answer formats, which are information-sparse and difficult to guarantee the coverage of long-range dependencies. To address this issue, we introduce a novel a… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

  23. arXiv:2503.13558  [pdf, other

    eess.SP cs.AI cs.LG

    Survival Analysis with Machine Learning for Predicting Li-ion Battery Remaining Useful Life

    Authors: Jingyuan Xue, Longfei Wei, Dongjing Jiang, Fang Sheng, Russell Greiner, Jianfei Zhang

    Abstract: Battery degradation significantly impacts the reliability and efficiency of energy storage systems, particularly in electric vehicles and industrial applications. Predicting the remaining useful life (RUL) of lithium-ion batteries is crucial for optimizing maintenance schedules, reducing costs, and improving safety. Traditional RUL prediction methods often struggle with nonlinear degradation patte… ▽ More

    Submitted 6 May, 2025; v1 submitted 16 March, 2025; originally announced March 2025.

  24. arXiv:2503.12854  [pdf, other

    cs.CL

    Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation

    Authors: Songjun Tu, Jiahao Lin, Xiangyu Tian, Qichao Zhang, Linjing Li, Yuqian Fu, Nan Xu, Wei He, Xiangyuan Lan, Dongmei Jiang, Dongbin Zhao

    Abstract: Recent advancements in post-training methodologies for large language models (LLMs) have highlighted reinforcement learning (RL) as a critical component for enhancing reasoning. However, the substantial computational costs associated with RL-based approaches have led to growing interest in alternative paradigms, such as Direct Preference Optimization (DPO). In this study, we investigate the effect… ▽ More

    Submitted 27 March, 2025; v1 submitted 17 March, 2025; originally announced March 2025.

  25. arXiv:2503.12456  [pdf, ps, other

    stat.ML cs.LG stat.AP stat.ME

    Nonlinear Principal Component Analysis with Random Bernoulli Features for Process Monitoring

    Authors: Ke Chen, Dandan Jiang

    Abstract: The process generates substantial amounts of data with highly complex structures, leading to the development of numerous nonlinear statistical methods. However, most of these methods rely on computations involving large-scale dense kernel matrices. This dependence poses significant challenges in meeting the high computational demands and real-time responsiveness required by online monitoring syste… ▽ More

    Submitted 16 March, 2025; originally announced March 2025.

  26. arXiv:2503.12001  [pdf, other

    cs.CV

    3D Gaussian Splatting against Moving Objects for High-Fidelity Street Scene Reconstruction

    Authors: Peizhen Zheng, Longfei Wei, Dongjing Jiang, Jianfei Zhang

    Abstract: The accurate reconstruction of dynamic street scenes is critical for applications in autonomous driving, augmented reality, and virtual reality. Traditional methods relying on dense point clouds and triangular meshes struggle with moving objects, occlusions, and real-time processing constraints, limiting their effectiveness in complex urban environments. While multi-view stereo and neural radiance… ▽ More

    Submitted 3 April, 2025; v1 submitted 15 March, 2025; originally announced March 2025.

  27. arXiv:2503.11946  [pdf, other

    cs.DC

    CCRSat: A Collaborative Computation Reuse Framework for Satellite Edge Computing Networks

    Authors: Ye Zhang, Zhishu Shen, Dawen Jiang, Xiangrui Liu, Qiushi Zheng, Jiong Jin

    Abstract: In satellite computing applications, such as remote sensing, tasks often involve similar or identical input data, leading to the same processing results. Computation reuse is an emerging paradigm that leverages the execution results of previous tasks to enhance the utilization of computational resources. While this paradigm has been extensively studied in terrestrial networks with abundant computi… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

  28. arXiv:2503.11251  [pdf, other

    cs.CV cs.CL

    Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model

    Authors: Haoyang Huang, Guoqing Ma, Nan Duan, Xing Chen, Changyi Wan, Ranchen Ming, Tianyu Wang, Bo Wang, Zhiying Lu, Aojie Li, Xianfang Zeng, Xinhao Zhang, Gang Yu, Yuhe Yin, Qiling Wu, Wen Sun, Kang An, Xin Han, Deshan Sun, Wei Ji, Bizhu Huang, Brian Li, Chenfei Wu, Guanzhe Huang, Huixin Xiong , et al. (29 additional authors not shown)

    Abstract: We present Step-Video-TI2V, a state-of-the-art text-driven image-to-video generation model with 30B parameters, capable of generating videos up to 102 frames based on both text and image inputs. We build Step-Video-TI2V-Eval as a new benchmark for the text-driven image-to-video task and compare Step-Video-TI2V with open-source and commercial TI2V engines using this dataset. Experimental results de… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

    Comments: 7 pages

  29. arXiv:2503.10627  [pdf, other

    cs.CV cs.AI cs.CL

    SciVerse: Unveiling the Knowledge Comprehension and Visual Reasoning of LMMs on Multi-modal Scientific Problems

    Authors: Ziyu Guo, Ray Zhang, Hao Chen, Jialin Gao, Dongzhi Jiang, Jiaze Wang, Pheng-Ann Heng

    Abstract: The rapid advancement of Large Multi-modal Models (LMMs) has enabled their application in scientific problem-solving, yet their fine-grained capabilities remain under-explored. In this paper, we introduce SciVerse, a multi-modal scientific evaluation benchmark to thoroughly assess LMMs across 5,735 test instances in five distinct versions. We aim to investigate three key dimensions of LMMs: scient… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

    Comments: Initially released in September 2024. Project page: https://sciverse-cuhk.github.io

  30. arXiv:2503.10529  [pdf, other

    cs.CV cs.AI

    PiSA: A Self-Augmented Data Engine and Training Strategy for 3D Understanding with Large Models

    Authors: Zilu Guo, Hongbin Lin, Zhihao Yuan, Chaoda Zheng, Pengshuo Qiu, Dongzhi Jiang, Renrui Zhang, Chun-Mei Feng, Zhen Li

    Abstract: 3D Multimodal Large Language Models (MLLMs) have recently made substantial advancements. However, their potential remains untapped, primarily due to the limited quantity and suboptimal quality of 3D datasets. Current approaches attempt to transfer knowledge from 2D MLLMs to expand 3D instruction data, but still face modality and domain gaps. To this end, we introduce PiSA-Engine (Point-Self-Augmen… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

    Comments: Technical Report

  31. arXiv:2503.04824  [pdf, other

    cs.GR cs.AI cs.CV

    ProReflow: Progressive Reflow with Decomposed Velocity

    Authors: Lei Ke, Haohang Xu, Xuefei Ning, Yu Li, Jiajun Li, Haoling Li, Yuxuan Lin, Dongsheng Jiang, Yujiu Yang, Linfeng Zhang

    Abstract: Diffusion models have achieved significant progress in both image and video generation while still suffering from huge computation costs. As an effective solution, flow matching aims to reflow the diffusion process of diffusion models into a straight line for a few-step and even one-step generation. However, in this paper, we suggest that the original training pipeline of flow matching is not opti… ▽ More

    Submitted 4 March, 2025; originally announced March 2025.

    Comments: Our codes will be released at Github

  32. arXiv:2503.04715  [pdf, other

    cs.LG cs.AI

    Predictable Scale: Part I -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining

    Authors: Houyi Li, Wenzhen Zheng, Jingcheng Hu, Qiufeng Wang, Hanshan Zhang, Zili Wang, Shijie Xuyang, Yuantao Fan, Shuigeng Zhou, Xiangyu Zhang, Daxin Jiang

    Abstract: The impressive capabilities of Large Language Models (LLMs) across diverse tasks are now well-established, yet their effective deployment necessitates careful hyperparameter optimization. Through extensive empirical studies involving grid searches across diverse configurations, we discover universal scaling laws governing these hyperparameters: optimal learning rate follows a power-law relationshi… ▽ More

    Submitted 19 March, 2025; v1 submitted 6 March, 2025; originally announced March 2025.

    Comments: 22 pages

    ACM Class: F.2.2; I.2.7

  33. arXiv:2503.00697  [pdf, other

    cs.CV cs.AI eess.IV

    CREATE-FFPE: Cross-Resolution Compensated and Multi-Frequency Enhanced FS-to-FFPE Stain Transfer for Intraoperative IHC Images

    Authors: Yiyang Lin, Danling Jiang, Xinyu Liu, Yun Miao, Yixuan Yuan

    Abstract: In the immunohistochemical (IHC) analysis during surgery, frozen-section (FS) images are used to determine the benignity or malignancy of the tumor. However, FS image faces problems such as image contamination and poor nuclear detail, which may disturb the pathologist's diagnosis. In contrast, formalin-fixed and paraffin-embedded (FFPE) image has a higher staining quality, but it requires quite a… ▽ More

    Submitted 1 March, 2025; originally announced March 2025.

  34. arXiv:2502.19902  [pdf, other

    cs.AI

    Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy

    Authors: Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, Liqiang Nie

    Abstract: Building an agent that can mimic human behavior patterns to accomplish various open-world tasks is a long-term goal. To enable agents to effectively learn behavioral patterns across diverse tasks, a key challenge lies in modeling the intricate relationships among observations, actions, and language. To this end, we propose Optimus-2, a novel Minecraft agent that incorporates a Multimodal Large Lan… ▽ More

    Submitted 11 March, 2025; v1 submitted 27 February, 2025; originally announced February 2025.

    Comments: Accept to CVPR 2025, Project page: https://cybertronagent.github.io/Optimus-2.github.io/

  35. arXiv:2502.14430  [pdf, ps, other

    cs.LG cs.CE

    Cardiac Evidence Backtracking for Eating Behavior Monitoring using Collocative Electrocardiogram Imagining

    Authors: Xu-Lu Zhang, Zhen-Qun Yang, Dong-Mei Jiang, Ga Liao, Qing Li, Ramesh Jain, Xiao-Yong Wei

    Abstract: Eating monitoring has remained an open challenge in medical research for years due to the lack of non-invasive sensors for continuous monitoring and the reliable methods for automatic behavior detection. In this paper, we present a pilot study using the wearable 24-hour ECG for sensing and tailoring the sophisticated deep learning for ad-hoc and interpretable detection. This is accomplished using… ▽ More

    Submitted 20 February, 2025; originally announced February 2025.

  36. arXiv:2502.14096  [pdf, other

    cs.LG math.OC

    Aligned Multi Objective Optimization

    Authors: Yonathan Efroni, Ben Kretzu, Daniel Jiang, Jalaj Bhandari, Zheqing, Zhu, Karen Ullrich

    Abstract: To date, the multi-objective optimization literature has mainly focused on conflicting objectives, studying the Pareto front, or requiring users to balance tradeoffs. Yet, in machine learning practice, there are many scenarios where such conflict does not take place. Recent findings from multi-task learning, reinforcement learning, and LLMs training show that diverse related tasks can enhance perf… ▽ More

    Submitted 3 March, 2025; v1 submitted 19 February, 2025; originally announced February 2025.

  37. arXiv:2502.11946  [pdf, other

    cs.CL cs.AI cs.HC cs.SD eess.AS

    Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

    Authors: Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, Peng Liu, Ruihang Miao, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Gong, Zixin Zhang, Hongyu Zhou, Jianjian Sun, Brian Li, Chengting Feng, Changyi Wan, Hanpeng Hu , et al. (120 additional authors not shown)

    Abstract: Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contribu… ▽ More

    Submitted 18 February, 2025; v1 submitted 17 February, 2025; originally announced February 2025.

  38. arXiv:2502.10248  [pdf, other

    cs.CV cs.CL

    Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

    Authors: Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qiling Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang , et al. (90 additional authors not shown)

    Abstract: We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded… ▽ More

    Submitted 24 February, 2025; v1 submitted 14 February, 2025; originally announced February 2025.

    Comments: 36 pages, 14 figures

  39. arXiv:2502.09621  [pdf, other

    cs.CV cs.AI cs.CL

    MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

    Authors: Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, Bo Zhang, Chaoyou Fu, Peng Gao, Hongsheng Li

    Abstract: Answering questions with Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs), yet its impact on Large Multimodal Models (LMMs) still lacks a systematic assessment and in-depth investigation. In this paper, we introduce MME-CoT, a specialized benchmark evaluating the CoT reasoning performance of LMMs, spanning six domains: math, science, OCR,… ▽ More

    Submitted 13 February, 2025; originally announced February 2025.

    Comments: Project Page: https://mmecot.github.io/

  40. arXiv:2502.07590  [pdf, other

    cs.DC cs.CV

    DSV: Exploiting Dynamic Sparsity to Accelerate Large-Scale Video DiT Training

    Authors: Xin Tan, Yuetao Chen, Yimin Jiang, Xing Chen, Kun Yan, Nan Duan, Yibo Zhu, Daxin Jiang, Hong Xu

    Abstract: Diffusion Transformers (DiTs) have shown remarkable performance in generating high-quality videos. However, the quadratic complexity of 3D full attention remains a bottleneck in scaling DiT training, especially with high-definition, lengthy videos, where it can consume up to 95% of processing time and demand specialized context parallelism. This paper introduces DSV to accelerate video DiT train… ▽ More

    Submitted 16 March, 2025; v1 submitted 11 February, 2025; originally announced February 2025.

  41. arXiv:2502.03885  [pdf, other

    cs.NI cs.DC cs.LG

    InfiniteHBD: Building Datacenter-Scale High-Bandwidth Domain for LLM with Optical Circuit Switching Transceivers

    Authors: Chenchen Shou, Guyue Liu, Hao Nie, Huaiyu Meng, Yu Zhou, Yimin Jiang, Wenqing Lv, Yelong Xu, Yuanwei Lu, Zhang Chen, Yanbo Yu, Yichen Shen, Yibo Zhu, Daxin Jiang

    Abstract: Scaling Large Language Model (LLM) training relies on multi-dimensional parallelism, where High-Bandwidth Domains (HBDs) are critical for communication-intensive parallelism like Tensor Parallelism (TP) and Expert Parallelism (EP). However, existing HBD architectures face fundamental limitations in scalability, cost, and fault resiliency: switch-centric HBDs (e.g., NVL-72) incur prohibitive scalin… ▽ More

    Submitted 7 May, 2025; v1 submitted 6 February, 2025; originally announced February 2025.

  42. arXiv:2502.01718  [pdf, other

    cs.SE cs.AI cs.CL

    ACECODER: Acing Coder RL via Automated Test-Case Synthesis

    Authors: Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, Wenhu Chen

    Abstract: Most progress in recent coder models has been driven by supervised fine-tuning (SFT), while the potential of reinforcement learning (RL) remains largely unexplored, primarily due to the lack of reliable reward data/model in the code domain. In this paper, we address this challenge by leveraging automated large-scale test-case synthesis to enhance code model training. Specifically, we design a pipe… ▽ More

    Submitted 10 February, 2025; v1 submitted 3 February, 2025; originally announced February 2025.

    Comments: 9 pages, 1 figure, 8 tables

  43. arXiv:2501.15061  [pdf, other

    cs.CV cs.AI

    PolaFormer: Polarity-aware Linear Attention for Vision Transformers

    Authors: Weikang Meng, Yadan Luo, Xin Li, Dongmei Jiang, Zheng Zhang

    Abstract: Linear attention has emerged as a promising alternative to softmax-based attention, leveraging kernelized feature maps to reduce complexity from quadratic to linear in sequence length. However, the non-negative constraint on feature maps and the relaxed exponential function used in approximation lead to significant information loss compared to the original query-key dot products, resulting in less… ▽ More

    Submitted 4 March, 2025; v1 submitted 24 January, 2025; originally announced January 2025.

  44. arXiv:2501.13349  [pdf, other

    cs.CV

    MSF: Efficient Diffusion Model Via Multi-Scale Latent Factorize

    Authors: Haohang Xu, Longyu Chen, Shuangrui Ding, Yilin Gao, Dongsheng Jiang, Yin Li, Shugong Xu, Junqing Yu, Wei Yang

    Abstract: Diffusion-based generative models have achieved remarkable progress in visual content generation. However, traditional diffusion models directly denoise the entire image from noisy inputs, disregarding the hierarchical structure present in visual signals. This method is computationally intensive, especially for high-resolution image generation. Signal processing often leverages hierarchical decomp… ▽ More

    Submitted 22 January, 2025; originally announced January 2025.

  45. arXiv:2501.11325  [pdf, other

    cs.CV cs.AI

    CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation

    Authors: Zheng Chong, Wenqing Zhang, Shiyue Zhang, Jun Zheng, Xiao Dong, Haoxiang Li, Yiling Wu, Dongmei Jiang, Xiaodan Liang

    Abstract: Virtual try-on (VTON) technology has gained attention due to its potential to transform online retail by enabling realistic clothing visualization of images and videos. However, most existing methods struggle to achieve high-quality results across image and video try-on tasks, especially in long video scenarios. In this work, we introduce CatV2TON, a simple and effective vision-based virtual try-o… ▽ More

    Submitted 20 January, 2025; originally announced January 2025.

    Comments: 11 pages, 8 figures, 5 tables

    MSC Class: 68T42 (Primary) 168T45 (Secondary) ACM Class: I.4.9

  46. arXiv:2501.09804  [pdf, other

    cs.LG cs.AI cs.CL

    Enhancing Generalization in Chain of Thought Reasoning for Smaller Models

    Authors: Maxwell J. Yin, Dingyi Jiang, Yongbing Chen, Boyu Wang, Charles Ling

    Abstract: Chain-of-Thought (CoT) reasoning in smaller language models is a challenging natural language process problem yet highly desirable in many real-life applications. Existing CoT knowledge distillation methods often suffer from overly conservative memorization in smaller LLMs, leading to low generalization confidence. As fully preserving the CoT ability of teacher model is impossible, we hypothesize… ▽ More

    Submitted 16 January, 2025; originally announced January 2025.

  47. arXiv:2412.20631  [pdf, other

    cs.CV

    Slow Perception: Let's Perceive Geometric Figures Step-by-step

    Authors: Haoran Wei, Youyang Yin, Yumeng Li, Jia Wang, Liang Zhao, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Daxin Jiang

    Abstract: Recently, "visual o1" began to enter people's vision, with expectations that this slow-thinking design can solve visual reasoning tasks, especially geometric math problems. However, the reality is that current LVLMs (Large Vision Language Models) can hardly even accurately copy a geometric figure, let alone truly understand the complex inherent logic and spatial relationships within geometric shap… ▽ More

    Submitted 26 January, 2025; v1 submitted 29 December, 2024; originally announced December 2024.

  48. arXiv:2412.19855  [pdf, other

    cs.GT

    Sychronous vs. asynchronous coalitions in multiplayer games, with applications to guts poker

    Authors: Jessica Babyak, Kevin Buck, Leah Dichter, David Jiang, Kevin Zumbrun

    Abstract: We study the issue introduced by Buck-Lee-Platnick-Wheeler-Zumbrun of synchronous vs. asynchronous coalitions in multiplayer games, that is, the difference between coalitions with full and partial communication, with a specific interest in the context of continuous Guts poker where this problem was originally formulated. We observe for general symmetric multiplayer games, with players 2-n in coali… ▽ More

    Submitted 25 December, 2024; originally announced December 2024.

  49. arXiv:2412.19255  [pdf, other

    cs.LG cs.CL

    Multi-matrix Factorization Attention

    Authors: Jingcheng Hu, Houyi Li, Yinmin Zhang, Zili Wang, Shuigeng Zhou, Xiangyu Zhang, Heung-Yeung Shum, Daxin Jiang

    Abstract: We propose novel attention architectures, Multi-matrix Factorization Attention (MFA) and MFA-Key-Reuse (MFA-KR). Existing variants for standard Multi-Head Attention (MHA), including SOTA methods like MLA, fail to maintain as strong performance under stringent Key-Value cache (KV cache) constraints. MFA enhances model capacity by efficiently scaling up both the number and dimension of attention hea… ▽ More

    Submitted 14 January, 2025; v1 submitted 26 December, 2024; originally announced December 2024.

  50. arXiv:2412.18106  [pdf, other

    cs.AI cs.DC cs.LG

    Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels

    Authors: Mingcong Song, Xinru Tang, Fengfan Hou, Jing Li, Wei Wei, Yipeng Ma, Runqiu Xiao, Hongjie Si, Dingcheng Jiang, Shouyi Yin, Yang Hu, Guoping Long

    Abstract: Meeting growing demands for low latency and cost efficiency in production-grade large language model (LLM) serving systems requires integrating advanced optimization techniques. However, dynamic and unpredictable input-output lengths of LLM, compounded by these optimizations, exacerbate the issues of workload variability, making it difficult to maintain high efficiency on AI accelerators, especial… ▽ More

    Submitted 23 December, 2024; originally announced December 2024.