-
FastVAR: Linear Visual Autoregressive Modeling via Cached Token Pruning
Authors:
Hang Guo,
Yawei Li,
Taolin Zhang,
Jiangshan Wang,
Tao Dai,
Shu-Tao Xia,
Luca Benini
Abstract:
Visual Autoregressive (VAR) modeling has gained popularity for its shift towards next-scale prediction. However, existing VAR paradigms process the entire token map at each scale step, leading to the complexity and runtime scaling dramatically with image resolution. To address this challenge, we propose FastVAR, a post-training acceleration method for efficient resolution scaling with VARs. Our ke…
▽ More
Visual Autoregressive (VAR) modeling has gained popularity for its shift towards next-scale prediction. However, existing VAR paradigms process the entire token map at each scale step, leading to the complexity and runtime scaling dramatically with image resolution. To address this challenge, we propose FastVAR, a post-training acceleration method for efficient resolution scaling with VARs. Our key finding is that the majority of latency arises from the large-scale step where most tokens have already converged. Leveraging this observation, we develop the cached token pruning strategy that only forwards pivotal tokens for scale-specific modeling while using cached tokens from previous scale steps to restore the pruned slots. This significantly reduces the number of forwarded tokens and improves the efficiency at larger resolutions. Experiments show the proposed FastVAR can further speedup FlashAttention-accelerated VAR by 2.7$\times$ with negligible performance drop of <1%. We further extend FastVAR to zero-shot generation of higher resolution images. In particular, FastVAR can generate one 2K image with 15GB memory footprints in 1.5s on a single NVIDIA 3090 GPU. Code is available at https://github.com/csguoh/FastVAR.
△ Less
Submitted 6 April, 2025; v1 submitted 30 March, 2025;
originally announced March 2025.
-
OnSiteVRU: A High-Resolution Trajectory Dataset for High-Density Vulnerable Road Users
Authors:
Zhangcun Yan,
Jianqing Li,
Peng Hang,
Jian Sun
Abstract:
With the acceleration of urbanization and the growth of transportation demands, the safety of vulnerable road users (VRUs, such as pedestrians and cyclists) in mixed traffic flows has become increasingly prominent, necessitating high-precision and diverse trajectory data to support the development and optimization of autonomous driving systems. However, existing datasets fall short in capturing th…
▽ More
With the acceleration of urbanization and the growth of transportation demands, the safety of vulnerable road users (VRUs, such as pedestrians and cyclists) in mixed traffic flows has become increasingly prominent, necessitating high-precision and diverse trajectory data to support the development and optimization of autonomous driving systems. However, existing datasets fall short in capturing the diversity and dynamics of VRU behaviors, making it difficult to meet the research demands of complex traffic environments. To address this gap, this study developed the OnSiteVRU datasets, which cover a variety of scenarios, including intersections, road segments, and urban villages. These datasets provide trajectory data for motor vehicles, electric bicycles, and human-powered bicycles, totaling approximately 17,429 trajectories with a precision of 0.04 seconds. The datasets integrate both aerial-view natural driving data and onboard real-time dynamic detection data, along with environmental information such as traffic signals, obstacles, and real-time maps, enabling a comprehensive reconstruction of interaction events. The results demonstrate that VRU\_Data outperforms traditional datasets in terms of VRU density and scene coverage, offering a more comprehensive representation of VRU behavioral characteristics. This provides critical support for traffic flow modeling, trajectory prediction, and autonomous driving virtual testing. The dataset is publicly available for download at:
https://www.kaggle.com/datasets/zcyan2/mixed-traffic-trajectory-dataset-in-from-shanghai.
△ Less
Submitted 30 March, 2025;
originally announced March 2025.
-
From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D
Authors:
Jiahui Zhang,
Yurui Chen,
Yanpeng Zhou,
Yueming Xu,
Ze Huang,
Jilin Mei,
Junhui Chen,
Yu-Jie Yuan,
Xinyue Cai,
Guowei Huang,
Xingyue Quan,
Hang Xu,
Li Zhang
Abstract:
Recent advances in LVLMs have improved vision-language understanding, but they still struggle with spatial perception, limiting their ability to reason about complex 3D scenes. Unlike previous approaches that incorporate 3D representations into models to improve spatial understanding, we aim to unlock the potential of VLMs by leveraging spatially relevant image data. To this end, we introduce a no…
▽ More
Recent advances in LVLMs have improved vision-language understanding, but they still struggle with spatial perception, limiting their ability to reason about complex 3D scenes. Unlike previous approaches that incorporate 3D representations into models to improve spatial understanding, we aim to unlock the potential of VLMs by leveraging spatially relevant image data. To this end, we introduce a novel 2D spatial data generation and annotation pipeline built upon scene data with 3D ground-truth. This pipeline enables the creation of a diverse set of spatial tasks, ranging from basic perception tasks to more complex reasoning tasks. Leveraging this pipeline, we construct SPAR-7M, a large-scale dataset generated from thousands of scenes across multiple public datasets. In addition, we introduce SPAR-Bench, a benchmark designed to offer a more comprehensive evaluation of spatial capabilities compared to existing spatial benchmarks, supporting both single-view and multi-view inputs. Training on both SPAR-7M and large-scale 2D datasets enables our models to achieve state-of-the-art performance on 2D spatial benchmarks. Further fine-tuning on 3D task-specific datasets yields competitive results, underscoring the effectiveness of our dataset in enhancing spatial reasoning.
△ Less
Submitted 27 May, 2025; v1 submitted 29 March, 2025;
originally announced March 2025.
-
A Pilot Study on Tunable Precision Emulation via Automatic BLAS Offloading
Authors:
Hang Liu,
Junjie Li,
Yinzhi Wang
Abstract:
This study explores the use of automatic BLAS offloading and INT8-based emulation for accelerating traditional HPC workloads on modern GPU architectures. Through the use of low-bitwidth integer units and cache-coherent Unified Memory Architecture, we emulate double-precision matrix multiplications in the MuST application without code changes. We find that accuracy depends on both arithmetic precis…
▽ More
This study explores the use of automatic BLAS offloading and INT8-based emulation for accelerating traditional HPC workloads on modern GPU architectures. Through the use of low-bitwidth integer units and cache-coherent Unified Memory Architecture, we emulate double-precision matrix multiplications in the MuST application without code changes. We find that accuracy depends on both arithmetic precision and the properties of the operator, which can be dealt with through tunable precision emulation. Unlike traditional mixed-precision approaches, this method preserves original algorithms while optimizing hardware utilization. We showcases the potential of improving accuracy and performance at the same time. This work highlights the potential of AI-driven hardware to transform HPC, advocating for adaptive precision strategies in future scientific computing.
△ Less
Submitted 2 April, 2025; v1 submitted 28 March, 2025;
originally announced March 2025.
-
Discussion of "Robust Distance Covariance" by S. Leyder, J. Raymaekers, and P.J. Rousseeuw
Authors:
Hallin Marc,
Davide La Vecchia,
Hang Liu,
Xinyi Xu
Abstract:
Distance covariance and distance correlation have long been regarded as natural measures of dependence between two random vectors, and have been used in a variety of situations for testing independence. Despite their popularity, the robustness of their empirical versions remain highly undiscovered. The paper named "Robust Distance Covariance" by S. Leyder, J. Raymaekers, and P.J. Rousseeuw (below…
▽ More
Distance covariance and distance correlation have long been regarded as natural measures of dependence between two random vectors, and have been used in a variety of situations for testing independence. Despite their popularity, the robustness of their empirical versions remain highly undiscovered. The paper named "Robust Distance Covariance" by S. Leyder, J. Raymaekers, and P.J. Rousseeuw (below referred to as [LRR]), which this article is discussing about, has provided a welcome addition to the literature. Among some intriguing results in [LRR], we find ourselves particularly interested in the so-called "robustness by transformation" that was highlighted when they used a clever trick named "the biloop transformation" to obtain a bounded and redescending influence function. Building on the measure-transportation-based notions of directional ranks and signs, we show how the "robustness via transformation" principle emphasized by [LRR] extends beyond the case of bivariate independence that [LRR] has investigated and also applies in higher-dimension Euclidean spaces and on compact manifolds. The case of directional variables (taking values on (hyper)spheres) is given special attention.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
BOOTPLACE: Bootstrapped Object Placement with Detection Transformers
Authors:
Hang Zhou,
Xinxin Zuo,
Rui Ma,
Li Cheng
Abstract:
In this paper, we tackle the copy-paste image-to-image composition problem with a focus on object placement learning. Prior methods have leveraged generative models to reduce the reliance for dense supervision. However, this often limits their capacity to model complex data distributions. Alternatively, transformer networks with a sparse contrastive loss have been explored, but their over-relaxed…
▽ More
In this paper, we tackle the copy-paste image-to-image composition problem with a focus on object placement learning. Prior methods have leveraged generative models to reduce the reliance for dense supervision. However, this often limits their capacity to model complex data distributions. Alternatively, transformer networks with a sparse contrastive loss have been explored, but their over-relaxed regularization often leads to imprecise object placement. We introduce BOOTPLACE, a novel paradigm that formulates object placement as a placement-by-detection problem. Our approach begins by identifying suitable regions of interest for object placement. This is achieved by training a specialized detection transformer on object-subtracted backgrounds, enhanced with multi-object supervisions. It then semantically associates each target compositing object with detected regions based on their complementary characteristics. Through a boostrapped training approach applied to randomly object-subtracted images, our model enforces meaningful placements through extensive paired data augmentation. Experimental results on established benchmarks demonstrate BOOTPLACE's superior performance in object repositioning, markedly surpassing state-of-the-art baselines on Cityscapes and OPA datasets with notable improvements in IOU scores. Additional ablation studies further showcase the compositionality and generalizability of our approach, supported by user study evaluations.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
CMD-HAR: Cross-Modal Disentanglement for Wearable Human Activity Recognition
Authors:
Hanyu Liu,
Siyao Li,
Ying Yu,
Yixuan Jiang,
Hang Xiao,
Jingxi Long,
Haotian Tang
Abstract:
Human Activity Recognition (HAR) is a fundamental technology for numerous human - centered intelligent applications. Although deep learning methods have been utilized to accelerate feature extraction, issues such as multimodal data mixing, activity heterogeneity, and complex model deployment remain largely unresolved. The aim of this paper is to address issues such as multimodal data mixing, activ…
▽ More
Human Activity Recognition (HAR) is a fundamental technology for numerous human - centered intelligent applications. Although deep learning methods have been utilized to accelerate feature extraction, issues such as multimodal data mixing, activity heterogeneity, and complex model deployment remain largely unresolved. The aim of this paper is to address issues such as multimodal data mixing, activity heterogeneity, and complex model deployment in sensor-based human activity recognition. We propose a spatiotemporal attention modal decomposition alignment fusion strategy to tackle the problem of the mixed distribution of sensor data. Key discriminative features of activities are captured through cross-modal spatio-temporal disentangled representation, and gradient modulation is combined to alleviate data heterogeneity. In addition, a wearable deployment simulation system is constructed. We conducted experiments on a large number of public datasets, demonstrating the effectiveness of the model.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
Low-Rank Adaptation of Pre-Trained Stable Diffusion for Rigid-Body Target ISAR Imaging
Authors:
Boan Zhang,
Hang Dong,
Jiongge Zhang,
Long Tian,
Rongrong Wang,
Zhenhua Wu,
Xiyang Liu,
Hongwei Liu
Abstract:
Traditional range-instantaneous Doppler (RID) methods for rigid-body target imaging often suffer from low resolution due to the limitations of time-frequency analysis (TFA). To address this challenge, our primary focus is on obtaining high resolution time-frequency representations (TFRs) from their low resolution counterparts. Recognizing that the curve features of TFRs are a specific type of text…
▽ More
Traditional range-instantaneous Doppler (RID) methods for rigid-body target imaging often suffer from low resolution due to the limitations of time-frequency analysis (TFA). To address this challenge, our primary focus is on obtaining high resolution time-frequency representations (TFRs) from their low resolution counterparts. Recognizing that the curve features of TFRs are a specific type of texture feature, we argue that pre trained generative models such as Stable Diffusion (SD) are well suited for enhancing TFRs, thanks to their powerful capability in capturing texture representations. Building on this insight, we propose a novel inverse synthetic aperture radar (ISAR) imaging method for rigid-body targets, leveraging the low-rank adaptation (LoRA) of a pre-trained SD model. Our approach adopts the basic structure and pre-trained parameters of SD Turbo while incorporating additional linear operations for LoRA and adversarial training to achieve super-resolution and noise suppression. Then we integrate LoRA-SD into the RID-based ISAR imaging, enabling sharply focused and denoised imaging with super-resolution capabilities. We evaluate our method using both simulated and real radar data. The experimental results demonstrate the superiority of our approach in frequency es timation and ISAR imaging compared to traditional methods. Notably, the generalization capability is verified by training on simulated radar data and testing on measured radar data.
△ Less
Submitted 26 March, 2025;
originally announced March 2025.
-
ImF: Implicit Fingerprint for Large Language Models
Authors:
Wu jiaxuan,
Peng Wanli,
Fu hang,
Xue Yiming,
Wen juan
Abstract:
Training large language models (LLMs) is resource-intensive and expensive, making protecting intellectual property (IP) for LLMs crucial. Recently, embedding fingerprints into LLMs has emerged as a prevalent method for establishing model ownership. However, existing fingerprinting techniques typically embed identifiable patterns with weak semantic coherence, resulting in fingerprints that signific…
▽ More
Training large language models (LLMs) is resource-intensive and expensive, making protecting intellectual property (IP) for LLMs crucial. Recently, embedding fingerprints into LLMs has emerged as a prevalent method for establishing model ownership. However, existing fingerprinting techniques typically embed identifiable patterns with weak semantic coherence, resulting in fingerprints that significantly differ from the natural question-answering (QA) behavior inherent to LLMs. This discrepancy undermines the stealthiness of the embedded fingerprints and makes them vulnerable to adversarial attacks. In this paper, we first demonstrate the critical vulnerability of existing fingerprint embedding methods by introducing a novel adversarial attack named Generation Revision Intervention (GRI) attack. GRI attack exploits the semantic fragility of current fingerprinting methods, effectively erasing fingerprints by disrupting their weakly correlated semantic structures. Our empirical evaluation highlights that traditional fingerprinting approaches are significantly compromised by the GRI attack, revealing severe limitations in their robustness under realistic adversarial conditions. To advance the state-of-the-art in model fingerprinting, we propose a novel model fingerprint paradigm called Implicit Fingerprints (ImF). ImF leverages steganography techniques to subtly embed ownership information within natural texts, subsequently using Chain-of-Thought (CoT) prompting to construct semantically coherent and contextually natural QA pairs. This design ensures that fingerprints seamlessly integrate with the standard model behavior, remaining indistinguishable from regular outputs and substantially reducing the risk of accidental triggering and targeted removal. We conduct a comprehensive evaluation of ImF on 15 diverse LLMs, spanning different architectures and varying scales.
△ Less
Submitted 17 May, 2025; v1 submitted 25 March, 2025;
originally announced March 2025.
-
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks
Authors:
Wenqi Zhang,
Mengna Wang,
Gangao Liu,
Xu Huixin,
Yiwei Jiang,
Yongliang Shen,
Guiyang Hou,
Zhe Zheng,
Hang Zhang,
Xin Li,
Weiming Lu,
Peng Li,
Yueting Zhuang
Abstract:
Recent advances in deep thinking models have demonstrated remarkable reasoning capabilities on mathematical and coding tasks. However, their effectiveness in embodied domains which require continuous interaction with environments through image action interleaved trajectories remains largely -unexplored. We present Embodied Reasoner, a model that extends o1 style reasoning to interactive embodied s…
▽ More
Recent advances in deep thinking models have demonstrated remarkable reasoning capabilities on mathematical and coding tasks. However, their effectiveness in embodied domains which require continuous interaction with environments through image action interleaved trajectories remains largely -unexplored. We present Embodied Reasoner, a model that extends o1 style reasoning to interactive embodied search tasks. Unlike mathematical reasoning that relies primarily on logical deduction, embodied scenarios demand spatial understanding, temporal reasoning, and ongoing self-reflection based on interaction history. To address these challenges, we synthesize 9.3k coherent Observation-Thought-Action trajectories containing 64k interactive images and 90k diverse thinking processes (analysis, spatial reasoning, reflection, planning, and verification). We develop a three-stage training pipeline that progressively enhances the model's capabilities through imitation learning, self-exploration via rejection sampling, and self-correction through reflection tuning. The evaluation shows that our model significantly outperforms those advanced visual reasoning models, e.g., it exceeds OpenAI o1, o3-mini, and Claude-3.7 by +9\%, 24\%, and +13\%. Analysis reveals our model exhibits fewer repeated searches and logical inconsistencies, with particular advantages in complex long-horizon tasks. Real-world environments also show our superiority while exhibiting fewer repeated searches and logical inconsistency cases.
△ Less
Submitted 14 May, 2025; v1 submitted 27 March, 2025;
originally announced March 2025.
-
DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation
Authors:
Haoyu Zhao,
Zhongang Qi,
Cong Wang,
Qingping Zheng,
Guansong Lu,
Fei Chen,
Hang Xu,
Zuxuan Wu
Abstract:
With diffusion transformer (DiT) excelling in video generation, its use in specific tasks has drawn increasing attention. However, adapting DiT for pose-guided human image animation faces two core challenges: (a) existing U-Net-based pose control methods may be suboptimal for the DiT backbone; and (b) removing text guidance, as in previous approaches, often leads to semantic loss and model degrada…
▽ More
With diffusion transformer (DiT) excelling in video generation, its use in specific tasks has drawn increasing attention. However, adapting DiT for pose-guided human image animation faces two core challenges: (a) existing U-Net-based pose control methods may be suboptimal for the DiT backbone; and (b) removing text guidance, as in previous approaches, often leads to semantic loss and model degradation. To address these issues, we propose DynamiCtrl, a novel framework for human animation in video DiT architecture. Specifically, we use a shared VAE encoder for human images and driving poses, unifying them into a common latent space, maintaining pose fidelity, and eliminating the need for an expert pose encoder during video denoising. To integrate pose control into the DiT backbone effectively, we propose a novel Pose-adaptive Layer Norm model. It injects normalized pose features into the denoising process via conditioning on visual tokens, enabling seamless and scalable pose control across DiT blocks. Furthermore, to overcome the shortcomings of text removal, we introduce the "Joint-text" paradigm, which preserves the role of text embeddings to provide global semantic context. Through full-attention blocks, image and pose features are aligned with text features, enhancing semantic consistency, leveraging pretrained knowledge, and enabling multi-level control. Experiments verify the superiority of DynamiCtrl on benchmark and self-collected data (e.g., achieving the best LPIPS of 0.166), demonstrating strong character control and high-quality synthesis. The project page is available at https://gulucaptain.github.io/DynamiCtrl/.
△ Less
Submitted 18 May, 2025; v1 submitted 27 March, 2025;
originally announced March 2025.
-
Learning Class Prototypes for Unified Sparse Supervised 3D Object Detection
Authors:
Yun Zhu,
Le Hui,
Hang Yang,
Jianjun Qian,
Jin Xie,
Jian Yang
Abstract:
Both indoor and outdoor scene perceptions are essential for embodied intelligence. However, current sparse supervised 3D object detection methods focus solely on outdoor scenes without considering indoor settings. To this end, we propose a unified sparse supervised 3D object detection method for both indoor and outdoor scenes through learning class prototypes to effectively utilize unlabeled objec…
▽ More
Both indoor and outdoor scene perceptions are essential for embodied intelligence. However, current sparse supervised 3D object detection methods focus solely on outdoor scenes without considering indoor settings. To this end, we propose a unified sparse supervised 3D object detection method for both indoor and outdoor scenes through learning class prototypes to effectively utilize unlabeled objects. Specifically, we first propose a prototype-based object mining module that converts the unlabeled object mining into a matching problem between class prototypes and unlabeled features. By using optimal transport matching results, we assign prototype labels to high-confidence features, thereby achieving the mining of unlabeled objects. We then present a multi-label cooperative refinement module to effectively recover missed detections through pseudo label quality control and prototype label cooperation. Experiments show that our method achieves state-of-the-art performance under the one object per scene sparse supervised setting across indoor and outdoor datasets. With only one labeled object per scene, our method achieves about 78%, 90%, and 96% performance compared to the fully supervised detector on ScanNet V2, SUN RGB-D, and KITTI, respectively, highlighting the scalability of our method. Code is available at https://github.com/zyrant/CPDet3D.
△ Less
Submitted 13 June, 2025; v1 submitted 26 March, 2025;
originally announced March 2025.
-
AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers
Authors:
Jiazhi Guan,
Kaisiyuan Wang,
Zhiliang Xu,
Quanwei Yang,
Yasheng Sun,
Shengyi He,
Borong Liang,
Yukang Cao,
Yingying Li,
Haocheng Feng,
Errui Ding,
Jingdong Wang,
Youjian Zhao,
Hang Zhou,
Ziwei Liu
Abstract:
Despite the recent progress of audio-driven video generation, existing methods mostly focus on driving facial movements, leading to non-coherent head and body dynamics. Moving forward, it is desirable yet challenging to generate holistic human videos with both accurate lip-sync and delicate co-speech gestures w.r.t. given audio. In this work, we propose AudCast, a generalized audio-driven human vi…
▽ More
Despite the recent progress of audio-driven video generation, existing methods mostly focus on driving facial movements, leading to non-coherent head and body dynamics. Moving forward, it is desirable yet challenging to generate holistic human videos with both accurate lip-sync and delicate co-speech gestures w.r.t. given audio. In this work, we propose AudCast, a generalized audio-driven human video generation framework adopting a cascade Diffusion-Transformers (DiTs) paradigm, which synthesizes holistic human videos based on a reference image and a given audio. 1) Firstly, an audio-conditioned Holistic Human DiT architecture is proposed to directly drive the movements of any human body with vivid gesture dynamics. 2) Then to enhance hand and face details that are well-knownly difficult to handle, a Regional Refinement DiT leverages regional 3D fitting as the bridge to reform the signals, producing the final results. Extensive experiments demonstrate that our framework generates high-fidelity audio-driven holistic human videos with temporal coherence and fine facial and hand details. Resources can be found at https://guanjz20.github.io/projects/AudCast.
△ Less
Submitted 25 March, 2025;
originally announced March 2025.
-
Towards Reliable Time Series Forecasting under Future Uncertainty: Ambiguity and Novelty Rejection Mechanisms
Authors:
Ninghui Feng,
Songning Lai,
Xin Zhou,
Jiayu Yang,
Kunlong Feng,
Zhenxiao Yin,
Fobao Zhou,
Zhangyi Hu,
Yutao Yue,
Yuxuan Liang,
Boyu Wang,
Hang Zhao
Abstract:
In real-world time series forecasting, uncertainty and lack of reliable evaluation pose significant challenges. Notably, forecasting errors often arise from underfitting in-distribution data and failing to handle out-of-distribution inputs. To enhance model reliability, we introduce a dual rejection mechanism combining ambiguity and novelty rejection. Ambiguity rejection, using prediction error va…
▽ More
In real-world time series forecasting, uncertainty and lack of reliable evaluation pose significant challenges. Notably, forecasting errors often arise from underfitting in-distribution data and failing to handle out-of-distribution inputs. To enhance model reliability, we introduce a dual rejection mechanism combining ambiguity and novelty rejection. Ambiguity rejection, using prediction error variance, allows the model to abstain under low confidence, assessed through historical error variance analysis without future ground truth. Novelty rejection, employing Variational Autoencoders and Mahalanobis distance, detects deviations from training data. This dual approach improves forecasting reliability in dynamic environments by reducing errors and adapting to data changes, advancing reliability in complex scenarios.
△ Less
Submitted 25 March, 2025;
originally announced March 2025.
-
Understanding and Improving Information Preservation in Prompt Compression for LLMs
Authors:
Weronika Łajewska,
Momchil Hardalov,
Laura Aina,
Neha Anna John,
Hang Su,
Lluís Màrquez
Abstract:
Recent advancements in large language models (LLMs) have enabled their successful application to a broad range of tasks. However, in information-intensive tasks, the prompt length can grow fast, leading to increased computational requirements, performance degradation, and induced biases from irrelevant or redundant information. Recently, various prompt compression techniques have been introduced t…
▽ More
Recent advancements in large language models (LLMs) have enabled their successful application to a broad range of tasks. However, in information-intensive tasks, the prompt length can grow fast, leading to increased computational requirements, performance degradation, and induced biases from irrelevant or redundant information. Recently, various prompt compression techniques have been introduced to optimize the trade-off between reducing input length and retaining performance. We propose a holistic evaluation framework that allows for in-depth analysis of prompt compression methods. We focus on three key aspects, besides compression ratio: (i) downstream task performance, (ii) grounding in the input context, and (iii) information preservation. Through this framework, we investigate state-of-the-art soft and hard compression methods, showing that they struggle to preserve key details from the original prompt, limiting their performance on complex tasks. We demonstrate that modifying soft prompting methods to control better the granularity of the compressed information can significantly improve their effectiveness -- up to +23\% in downstream task performance, more than +8 BERTScore points in grounding, and 2.7x more entities preserved in compression.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
RoboEngine: Plug-and-Play Robot Data Augmentation with Semantic Robot Segmentation and Background Generation
Authors:
Chengbo Yuan,
Suraj Joshi,
Shaoting Zhu,
Hang Su,
Hang Zhao,
Yang Gao
Abstract:
Visual augmentation has become a crucial technique for enhancing the visual robustness of imitation learning. However, existing methods are often limited by prerequisites such as camera calibration or the need for controlled environments (e.g., green screen setups). In this work, we introduce RoboEngine, the first plug-and-play visual robot data augmentation toolkit. For the first time, users can…
▽ More
Visual augmentation has become a crucial technique for enhancing the visual robustness of imitation learning. However, existing methods are often limited by prerequisites such as camera calibration or the need for controlled environments (e.g., green screen setups). In this work, we introduce RoboEngine, the first plug-and-play visual robot data augmentation toolkit. For the first time, users can effortlessly generate physics- and task-aware robot scenes with just a few lines of code. To achieve this, we present a novel robot scene segmentation dataset, a generalizable high-quality robot segmentation model, and a fine-tuned background generation model, which together form the core components of the out-of-the-box toolkit. Using RoboEngine, we demonstrate the ability to generalize robot manipulation tasks across six entirely new scenes, based solely on demonstrations collected from a single scene, achieving a more than 200% performance improvement compared to the no-augmentation baseline. All datasets, model weights, and the toolkit will be publicly released.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
Feature Learning beyond the Lazy-Rich Dichotomy: Insights from Representational Geometry
Authors:
Chi-Ning Chou,
Hang Le,
Yichen Wang,
SueYeon Chung
Abstract:
The ability to integrate task-relevant information into neural representations is a fundamental aspect of both biological and artificial intelligence. To enable theoretical analysis, recent work has examined whether a network learns task-relevant features (rich learning) or resembles a random feature model (or a kernel machine, i.e., lazy learning). However, this simple lazy-versus-rich dichotomy…
▽ More
The ability to integrate task-relevant information into neural representations is a fundamental aspect of both biological and artificial intelligence. To enable theoretical analysis, recent work has examined whether a network learns task-relevant features (rich learning) or resembles a random feature model (or a kernel machine, i.e., lazy learning). However, this simple lazy-versus-rich dichotomy overlooks the possibility of various subtypes of feature learning that emerge from different architectures, learning rules, and data properties. Furthermore, most existing approaches emphasize weight matrices or neural tangent kernels, limiting their applicability to neuroscience because they do not explicitly characterize representations.
In this work, we introduce an analysis framework based on representational geometry to study feature learning. Instead of analyzing what are the learned features, we focus on characterizing how task-relevant representational manifolds evolve during the learning process. In both theory and experiment, we find that when a network learns features useful for solving a task, the task-relevant manifolds become increasingly untangled. Moreover, by tracking changes in the underlying manifold geometry, we uncover distinct learning stages throughout training, as well as different learning strategies associated with training hyperparameters, uncovering subtypes of feature learning beyond the lazy-versus-rich dichotomy. Applying our method to neuroscience and machine learning, we gain geometric insights into the structural inductive biases of neural circuits solving cognitive tasks and the mechanisms underlying out-of-distribution generalization in image classification. Our framework provides a novel geometric perspective for understanding and quantifying feature learning in both artificial and biological neural networks.
△ Less
Submitted 23 March, 2025;
originally announced March 2025.
-
Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation
Authors:
Ziming Wei,
Bingqian Lin,
Yunshuang Nie,
Jiaqi Chen,
Shikui Ma,
Hang Xu,
Xiaodan Liang
Abstract:
Data scarcity is a long-standing challenge in the Vision-Language Navigation (VLN) field, which extremely hinders the generalization of agents to unseen environments. Previous works primarily rely on additional simulator data or web-collected images/videos to improve the generalization. However, the simulator environments still face limited diversity, and the web-collected data often requires exte…
▽ More
Data scarcity is a long-standing challenge in the Vision-Language Navigation (VLN) field, which extremely hinders the generalization of agents to unseen environments. Previous works primarily rely on additional simulator data or web-collected images/videos to improve the generalization. However, the simulator environments still face limited diversity, and the web-collected data often requires extensive labor to remove the noise. In this paper, we propose a Rewriting-driven AugMentation (RAM) paradigm for VLN, which directly creates the unseen observation-instruction pairs via rewriting human-annotated training data. Benefiting from our rewriting mechanism, new observation-instruction can be obtained in both simulator-free and labor-saving manners to promote generalization. Specifically, we first introduce Object-Enriched Observation Rewriting, where we combine Vision-Language Models (VLMs) and Large Language Models (LLMs) to derive rewritten object-enriched scene descriptions, enabling observation synthesis with diverse objects and spatial layouts via Text-to-Image Generation Models (T2IMs). Then, we propose Observation-Contrast Instruction Rewriting, which generates observation-aligned rewritten instructions by requiring LLMs to reason the difference between original and new observations. We further develop a mixing-then-focusing training strategy with a random observation cropping scheme, effectively enhancing data distribution diversity while suppressing augmentation data noise during training. Experiments on both the discrete environments (R2R, REVERIE, and R4R datasets) and continuous environments (R2R-CE dataset) show the superior performance and impressive generalization ability of our method. Code is available at https://github.com/SaDil13/VLN-RAM.
△ Less
Submitted 23 March, 2025;
originally announced March 2025.
-
Every Sample Matters: Leveraging Mixture-of-Experts and High-Quality Data for Efficient and Accurate Code LLM
Authors:
Codefuse,
Ling Team,
:,
Wenting Cai,
Yuchen Cao,
Chaoyu Chen,
Chen Chen,
Siba Chen,
Qing Cui,
Peng Di,
Junpeng Fang,
Zi Gong,
Ting Guo,
Zhengyu He,
Yang Huang,
Cong Li,
Jianguo Li,
Zheng Li,
Shijie Lian,
BingChang Liu,
Songshan Luo,
Shuo Mao,
Min Shen,
Jian Wu,
Jiaolong Yang
, et al. (8 additional authors not shown)
Abstract:
Recent advancements in code large language models (LLMs) have demonstrated remarkable capabilities in code generation and understanding. It is still challenging to build a code LLM with comprehensive performance yet ultimate efficiency. Many attempts have been released in the open source community to break the trade-off between performance and efficiency, such as the Qwen Coder series and the Deep…
▽ More
Recent advancements in code large language models (LLMs) have demonstrated remarkable capabilities in code generation and understanding. It is still challenging to build a code LLM with comprehensive performance yet ultimate efficiency. Many attempts have been released in the open source community to break the trade-off between performance and efficiency, such as the Qwen Coder series and the DeepSeek Coder series. This paper introduces yet another attempt in this area, namely Ling-Coder-Lite. We leverage the efficient Mixture-of-Experts (MoE) architecture along with a set of high-quality data curation methods (especially those based on program analytics) to build an efficient yet powerful code LLM. Ling-Coder-Lite exhibits on-par performance on 12 representative coding benchmarks compared to state-of-the-art models of similar size, such as Qwen2.5-Coder-7B and DeepSeek-Coder-V2-Lite, while offering competitive latency and throughput. In practice, we achieve a 50\% reduction in deployment resources compared to the similar-sized dense model without performance loss. To facilitate further research and development in this area, we open-source our models as well as a substantial portion of high-quality data for the annealing and post-training stages. The models and data can be accessed at~\url{https://huggingface.co/inclusionAI/Ling-Coder-lite}.
△ Less
Submitted 22 March, 2025;
originally announced March 2025.
-
EasyRobust: A Comprehensive and Easy-to-use Toolkit for Robust and Generalized Vision
Authors:
Xiaofeng Mao,
Yuefeng Chen,
Rong Zhang,
Hui Xue,
Zhao Li,
Hang Su
Abstract:
Deep neural networks (DNNs) has shown great promise in computer vision tasks. However, machine vision achieved by DNNs cannot be as robust as human perception. Adversarial attacks and data distribution shifts have been known as two major scenarios which degrade machine performance and obstacle the wide deployment of machines "in the wild". In order to break these obstructions and facilitate the re…
▽ More
Deep neural networks (DNNs) has shown great promise in computer vision tasks. However, machine vision achieved by DNNs cannot be as robust as human perception. Adversarial attacks and data distribution shifts have been known as two major scenarios which degrade machine performance and obstacle the wide deployment of machines "in the wild". In order to break these obstructions and facilitate the research of model robustness, we develop EasyRobust, a comprehensive and easy-to-use toolkit for training, evaluation and analysis of robust vision models. EasyRobust targets at two types of robustness: 1) Adversarial robustness enables the model to defense against malicious inputs crafted by worst-case perturbations, also known as adversarial examples; 2) Non-adversarial robustness enhances the model performance on natural test images with corruptions or distribution shifts. Thorough benchmarks on image classification enable EasyRobust to provide an accurate robustness evaluation on vision models. We wish our EasyRobust can help for training practically-robust models and promote academic and industrial progress in closing the gap between human and machine vision. Codes and models of EasyRobust have been open-sourced in https://github.com/alibaba/easyrobust.
△ Less
Submitted 21 March, 2025;
originally announced March 2025.
-
Re-HOLD: Video Hand Object Interaction Reenactment via adaptive Layout-instructed Diffusion Model
Authors:
Yingying Fan,
Quanwei Yang,
Kaisiyuan Wang,
Hang Zhou,
Yingying Li,
Haocheng Feng,
Errui Ding,
Yu Wu,
Jingdong Wang
Abstract:
Current digital human studies focusing on lip-syncing and body movement are no longer sufficient to meet the growing industrial demand, while human video generation techniques that support interacting with real-world environments (e.g., objects) have not been well investigated. Despite human hand synthesis already being an intricate problem, generating objects in contact with hands and their inter…
▽ More
Current digital human studies focusing on lip-syncing and body movement are no longer sufficient to meet the growing industrial demand, while human video generation techniques that support interacting with real-world environments (e.g., objects) have not been well investigated. Despite human hand synthesis already being an intricate problem, generating objects in contact with hands and their interactions presents an even more challenging task, especially when the objects exhibit obvious variations in size and shape. To tackle these issues, we present a novel video Reenactment framework focusing on Human-Object Interaction (HOI) via an adaptive Layout-instructed Diffusion model (Re-HOLD). Our key insight is to employ specialized layout representation for hands and objects, respectively. Such representations enable effective disentanglement of hand modeling and object adaptation to diverse motion sequences. To further improve the generation quality of HOI, we design an interactive textural enhancement module for both hands and objects by introducing two independent memory banks. We also propose a layout adjustment strategy for the cross-object reenactment scenario to adaptively adjust unreasonable layouts caused by diverse object sizes during inference. Comprehensive qualitative and quantitative evaluations demonstrate that our proposed framework significantly outperforms existing methods. Project page: https://fyycs.github.io/Re-HOLD.
△ Less
Submitted 25 March, 2025; v1 submitted 21 March, 2025;
originally announced March 2025.
-
A Vehicle-Infrastructure Multi-layer Cooperative Decision-making Framework
Authors:
Yiming Cui,
Shiyu Fang,
Peng Hang,
Jian Sun
Abstract:
Autonomous driving has entered the testing phase, but due to the limited decision-making capabilities of individual vehicle algorithms, safety and efficiency issues have become more apparent in complex scenarios. With the advancement of connected communication technologies, autonomous vehicles equipped with connectivity can leverage vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) comm…
▽ More
Autonomous driving has entered the testing phase, but due to the limited decision-making capabilities of individual vehicle algorithms, safety and efficiency issues have become more apparent in complex scenarios. With the advancement of connected communication technologies, autonomous vehicles equipped with connectivity can leverage vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) communications, offering a potential solution to the decision-making challenges from individual vehicle's perspective. We propose a multi-level vehicle-infrastructure cooperative decision-making framework for complex conflict scenarios at unsignalized intersections. First, based on vehicle states, we define a method for quantifying vehicle impacts and their propagation relationships, using accumulated impact to group vehicles through motif-based graph clustering. Next, within and between vehicle groups, a pass order negotiation process based on Large Language Models (LLM) is employed to determine the vehicle passage order, resulting in planned vehicle actions. Simulation results from ablation experiments show that our approach reduces negotiation complexity and ensures safer, more efficient vehicle passage at intersections, aligning with natural decision-making logic.
△ Less
Submitted 19 March, 2025;
originally announced March 2025.
-
Practical Portfolio Optimization with Metaheuristics:Pre-assignment Constraint and Margin Trading
Authors:
Hang Kin Poon
Abstract:
Portfolio optimization is a critical area in finance, aiming to maximize returns while minimizing risk. Metaheuristic algorithms were shown to solve complex optimization problems efficiently, with Genetic Algorithms and Particle Swarm Optimization being among the most popular methods. This paper introduces an innovative approach to portfolio optimization that incorporates pre-assignment to limit t…
▽ More
Portfolio optimization is a critical area in finance, aiming to maximize returns while minimizing risk. Metaheuristic algorithms were shown to solve complex optimization problems efficiently, with Genetic Algorithms and Particle Swarm Optimization being among the most popular methods. This paper introduces an innovative approach to portfolio optimization that incorporates pre-assignment to limit the search space for investor preferences and better results. Additionally, taking margin trading strategies in account and using a rare performance ratio to evaluate portfolio efficiency. Through an illustrative example, this paper demonstrates that the metaheuristic-based methodology yields superior risk-adjusted returns compared to traditional benchmarks. The results highlight the potential of metaheuristics with help of assets filtering in enhancing portfolio performance in terms of risk adjusted return.
△ Less
Submitted 20 March, 2025;
originally announced March 2025.
-
EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation
Authors:
Zihao Zhang,
Haoran Chen,
Haoyu Zhao,
Guansong Lu,
Yanwei Fu,
Hang Xu,
Zuxuan Wu
Abstract:
Handling complex or nonlinear motion patterns has long posed challenges for video frame interpolation. Although recent advances in diffusion-based methods offer improvements over traditional optical flow-based approaches, they still struggle to generate sharp, temporally consistent frames in scenarios with large motion. To address this limitation, we introduce EDEN, an Enhanced Diffusion for high-…
▽ More
Handling complex or nonlinear motion patterns has long posed challenges for video frame interpolation. Although recent advances in diffusion-based methods offer improvements over traditional optical flow-based approaches, they still struggle to generate sharp, temporally consistent frames in scenarios with large motion. To address this limitation, we introduce EDEN, an Enhanced Diffusion for high-quality large-motion vidEo frame iNterpolation. Our approach first utilizes a transformer-based tokenizer to produce refined latent representations of the intermediate frames for diffusion models. We then enhance the diffusion transformer with temporal attention across the process and incorporate a start-end frame difference embedding to guide the generation of dynamic motion. Extensive experiments demonstrate that EDEN achieves state-of-the-art results across popular benchmarks, including nearly a 10% LPIPS reduction on DAVIS and SNU-FILM, and an 8% improvement on DAIN-HD.
△ Less
Submitted 9 May, 2025; v1 submitted 19 March, 2025;
originally announced March 2025.
-
Bias Evaluation and Mitigation in Retrieval-Augmented Medical Question-Answering Systems
Authors:
Yuelyu Ji,
Hang Zhang,
Yanshan Wang
Abstract:
Medical Question Answering systems based on Retrieval Augmented Generation is promising for clinical decision support because they can integrate external knowledge, thus reducing inaccuracies inherent in standalone large language models (LLMs). However, these systems may unintentionally propagate or amplify biases associated with sensitive demographic attributes like race, gender, and socioeconomi…
▽ More
Medical Question Answering systems based on Retrieval Augmented Generation is promising for clinical decision support because they can integrate external knowledge, thus reducing inaccuracies inherent in standalone large language models (LLMs). However, these systems may unintentionally propagate or amplify biases associated with sensitive demographic attributes like race, gender, and socioeconomic factors. This study systematically evaluates demographic biases within medical RAG pipelines across multiple QA benchmarks, including MedQA, MedMCQA, MMLU, and EquityMedQA. We quantify disparities in retrieval consistency and answer correctness by generating and analyzing queries sensitive to demographic variations. We further implement and compare several bias mitigation strategies to address identified biases, including Chain of Thought reasoning, Counterfactual filtering, Adversarial prompt refinement, and Majority Vote aggregation. Experimental results reveal significant demographic disparities, highlighting that Majority Vote aggregation notably improves accuracy and fairness metrics. Our findings underscore the critical need for explicitly fairness-aware retrieval methods and prompt engineering strategies to develop truly equitable medical QA systems.
△ Less
Submitted 26 March, 2025; v1 submitted 19 March, 2025;
originally announced March 2025.
-
Pseudo Relevance Feedback is Enough to Close the Gap Between Small and Large Dense Retrieval Models
Authors:
Hang Li,
Xiao Wang,
Bevan Koopman,
Guido Zuccon
Abstract:
Scaling dense retrievers to larger large language model (LLM) backbones has been a dominant strategy for improving their retrieval effectiveness. However, this has substantial cost implications: larger backbones require more expensive hardware (e.g. GPUs with more memory) and lead to higher indexing and querying costs (latency, energy consumption). In this paper, we challenge this paradigm by intr…
▽ More
Scaling dense retrievers to larger large language model (LLM) backbones has been a dominant strategy for improving their retrieval effectiveness. However, this has substantial cost implications: larger backbones require more expensive hardware (e.g. GPUs with more memory) and lead to higher indexing and querying costs (latency, energy consumption). In this paper, we challenge this paradigm by introducing PromptPRF, a feature-based pseudo-relevance feedback (PRF) framework that enables small LLM-based dense retrievers to achieve effectiveness comparable to much larger models.
PromptPRF uses LLMs to extract query-independent, structured and unstructured features (e.g., entities, summaries, chain-of-thought keywords, essay) from top-ranked documents. These features are generated offline and integrated into dense query representations via prompting, enabling efficient retrieval without additional training. Unlike prior methods such as GRF, which rely on online, query-specific generation and sparse retrieval, PromptPRF decouples feedback generation from query processing and supports dense retrievers in a fully zero-shot setting.
Experiments on TREC DL and BEIR benchmarks demonstrate that PromptPRF consistently improves retrieval effectiveness and offers favourable cost-effectiveness trade-offs. We further present ablation studies to understand the role of positional feedback and analyse the interplay between feature extractor size, PRF depth, and model performance. Our findings demonstrate that with effective PRF design, scaling the retriever is not always necessary, narrowing the gap between small and large models while reducing inference cost.
△ Less
Submitted 5 June, 2025; v1 submitted 19 March, 2025;
originally announced March 2025.
-
ARC-Calib: Autonomous Markerless Camera-to-Robot Calibration via Exploratory Robot Motions
Authors:
Podshara Chanrungmaneekul,
Yiting Chen,
Joshua T. Grace,
Aaron M. Dollar,
Kaiyu Hang
Abstract:
Camera-to-robot (also known as eye-to-hand) calibration is a critical component of vision-based robot manipulation. Traditional marker-based methods often require human intervention for system setup. Furthermore, existing autonomous markerless calibration methods typically rely on pre-trained robot tracking models that impede their application on edge devices and require fine-tuning for novel robo…
▽ More
Camera-to-robot (also known as eye-to-hand) calibration is a critical component of vision-based robot manipulation. Traditional marker-based methods often require human intervention for system setup. Furthermore, existing autonomous markerless calibration methods typically rely on pre-trained robot tracking models that impede their application on edge devices and require fine-tuning for novel robot embodiments. To address these limitations, this paper proposes a model-based markerless camera-to-robot calibration framework, ARC-Calib, that is fully autonomous and generalizable across diverse robots and scenarios without requiring extensive data collection or learning. First, exploratory robot motions are introduced to generate easily trackable trajectory-based visual patterns in the camera's image frames. Then, a geometric optimization framework is proposed to exploit the coplanarity and collinearity constraints from the observed motions to iteratively refine the estimated calibration result. Our approach eliminates the need for extra effort in either environmental marker setup or data collection and model training, rendering it highly adaptable across a wide range of real-world autonomous systems. Extensive experiments are conducted in both simulation and the real world to validate its robustness and generalizability.
△ Less
Submitted 18 March, 2025;
originally announced March 2025.
-
Stable Virtual Camera: Generative View Synthesis with Diffusion Models
Authors:
Jensen Zhou,
Hang Gao,
Vikram Voleti,
Aaryaman Vasishta,
Chun-Han Yao,
Mark Boss,
Philip Torr,
Christian Rupprecht,
Varun Jampani
Abstract:
We present Stable Virtual Camera (Seva), a generalist diffusion model that creates novel views of a scene, given any number of input views and target cameras. Existing works struggle to generate either large viewpoint changes or temporally smooth samples, while relying on specific task configurations. Our approach overcomes these limitations through simple model design, optimized training recipe,…
▽ More
We present Stable Virtual Camera (Seva), a generalist diffusion model that creates novel views of a scene, given any number of input views and target cameras. Existing works struggle to generate either large viewpoint changes or temporally smooth samples, while relying on specific task configurations. Our approach overcomes these limitations through simple model design, optimized training recipe, and flexible sampling strategy that generalize across view synthesis tasks at test time. As a result, our samples maintain high consistency without requiring additional 3D representation-based distillation, thus streamlining view synthesis in the wild. Furthermore, we show that our method can generate high-quality videos lasting up to half a minute with seamless loop closure. Extensive benchmarking demonstrates that Seva outperforms existing methods across different datasets and settings. Project page with code and model: https://stable-virtual-camera.github.io/.
△ Less
Submitted 1 April, 2025; v1 submitted 18 March, 2025;
originally announced March 2025.
-
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Authors:
Qiying Yu,
Zheng Zhang,
Ruofei Zhu,
Yufeng Yuan,
Xiaochen Zuo,
Yu Yue,
Weinan Dai,
Tiantian Fan,
Gaohong Liu,
Lingjun Liu,
Xin Liu,
Haibin Lin,
Zhiqi Lin,
Bole Ma,
Guangming Sheng,
Yuxuan Tong,
Chi Zhang,
Mofan Zhang,
Wang Zhang,
Hang Zhu,
Jinhua Zhu,
Jiaze Chen,
Jiangjie Chen,
Chengyi Wang,
Hongli Yu
, et al. (10 additional authors not shown)
Abstract:
Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the $\textbf{D}$ecouple…
▽ More
Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the $\textbf{D}$ecoupled Clip and $\textbf{D}$ynamic s$\textbf{A}$mpling $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{DAPO}$) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2.5-32B base model. Unlike previous works that withhold training details, we introduce four key techniques of our algorithm that make large-scale LLM RL a success. In addition, we open-source our training code, which is built on the verl framework, along with a carefully curated and processed dataset. These components of our open-source system enhance reproducibility and support future research in large-scale LLM RL.
△ Less
Submitted 19 May, 2025; v1 submitted 18 March, 2025;
originally announced March 2025.
-
Electromagnetic Duality Symmetry-Protected Dirac-Like Cones
Authors:
Muxuan Yang,
Dongyang Yan,
Lei Gao,
Wei Liu,
Yun Lai,
Yadong Xu,
Zhi Hong Hang,
Jie Luo
Abstract:
Dirac-like cones, featuring conical linear dispersions intersecting with flat bands, typically arise from accidental degeneracy of multiple modes that requires precise tuning of material and structural parameters, inherently limiting their robustness and applications. In this work, by introducing electromagnetic duality symmetry into photonic crystals, we demonstrate the emergence of intrinsically…
▽ More
Dirac-like cones, featuring conical linear dispersions intersecting with flat bands, typically arise from accidental degeneracy of multiple modes that requires precise tuning of material and structural parameters, inherently limiting their robustness and applications. In this work, by introducing electromagnetic duality symmetry into photonic crystals, we demonstrate the emergence of intrinsically robust deterministic Dirac-like cones. We show that such symmetry (achieved through either self-dual particles or non-self-dual particle clusters with duality-glide symmetry) enforces double degeneracies for band structures of photonic crystals. Furthermore, by harnessing the joint duality-structural symmetry, multiple deterministic Dirac-like cones exhibiting exceptional resilience to lattice size variations can be obtained. Our introduction of an extra symmetry into photonic crystals establishes a profound connection between duality symmetry and Dirac physics, providing a robust platform for advanced photonic band engineering.
△ Less
Submitted 18 March, 2025;
originally announced March 2025.
-
Multi-Modal Self-Supervised Semantic Communication
Authors:
Hang Zhao,
Hongru Li,
Dongfang Xu,
Shenghui Song,
Khaled B. Letaief
Abstract:
Semantic communication is emerging as a promising paradigm that focuses on the extraction and transmission of semantic meanings using deep learning techniques. While current research primarily addresses the reduction of semantic communication overhead, it often overlooks the training phase, which can incur significant communication costs in dynamic wireless environments. To address this challenge,…
▽ More
Semantic communication is emerging as a promising paradigm that focuses on the extraction and transmission of semantic meanings using deep learning techniques. While current research primarily addresses the reduction of semantic communication overhead, it often overlooks the training phase, which can incur significant communication costs in dynamic wireless environments. To address this challenge, we propose a multi-modal semantic communication system that leverages multi-modal self-supervised learning to enhance task-agnostic feature extraction. The proposed approach employs self-supervised learning during the pre-training phase to extract task-agnostic semantic features, followed by supervised fine-tuning for downstream tasks. This dual-phase strategy effectively captures both modality-invariant and modality-specific features while minimizing training-related communication overhead. Experimental results on the NYU Depth V2 dataset demonstrate that the proposed method significantly reduces training-related communication overhead while maintaining or exceeding the performance of existing supervised learning approaches. The findings underscore the advantages of multi-modal self-supervised learning in semantic communication, paving the way for more efficient and scalable edge inference systems.
△ Less
Submitted 18 March, 2025;
originally announced March 2025.
-
$φ$-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation
Authors:
Fangzhi Xu,
Hang Yan,
Chang Ma,
Haiteng Zhao,
Jun Liu,
Qika Lin,
Zhiyong Wu
Abstract:
Inference-time optimization scales computation to derive deliberate reasoning steps for effective performance. While previous search-based strategies address the short-sightedness of auto-regressive generation, the vast search space leads to excessive exploration and insufficient exploitation. To strike an efficient balance to derive the optimal step, we frame the decoding strategy as foresight sa…
▽ More
Inference-time optimization scales computation to derive deliberate reasoning steps for effective performance. While previous search-based strategies address the short-sightedness of auto-regressive generation, the vast search space leads to excessive exploration and insufficient exploitation. To strike an efficient balance to derive the optimal step, we frame the decoding strategy as foresight sampling, leveraging simulated future steps to obtain globally optimal step estimation. Built on it, we propose a novel decoding strategy, named $φ$-Decoding. To provide a precise and expressive estimation of step value, $φ$-Decoding approximates two distributions via foresight and clustering. Sampling from the joint distribution, the optimal steps can be selected for exploitation. To support adaptive computation allocation, we propose in-width and in-depth pruning strategies, featuring a light-weight solution to achieve inference efficiency. Extensive experiments across seven benchmarks show $φ$-Decoding outperforms strong baselines in both performance and efficiency. Additional analysis demonstrates its generalization across various LLMs and scalability across a wide range of computing budgets. The code will be released at https://github.com/xufangzhi/phi-Decoding, and the open-source PyPI package is coming soon.
△ Less
Submitted 17 March, 2025;
originally announced March 2025.
-
Rapfi: Distilling Efficient Neural Network for the Game of Gomoku
Authors:
Zhanggen Jin,
Haobin Duan,
Zhiyang Hang
Abstract:
Games have played a pivotal role in advancing artificial intelligence, with AI agents using sophisticated techniques to compete. Despite the success of neural network based game AIs, their performance often requires significant computational resources. In this paper, we present Rapfi, an efficient Gomoku agent that outperforms CNN-based agents in limited computation environments. Rapfi leverages a…
▽ More
Games have played a pivotal role in advancing artificial intelligence, with AI agents using sophisticated techniques to compete. Despite the success of neural network based game AIs, their performance often requires significant computational resources. In this paper, we present Rapfi, an efficient Gomoku agent that outperforms CNN-based agents in limited computation environments. Rapfi leverages a compact neural network with a pattern-based codebook distilled from CNNs, and an incremental update scheme that minimizes computation when input changes are minor. This new network uses computation that is orders of magnitude less to reach a similar accuracy of much larger neural networks such as Resnet. Thanks to our incremental update scheme, depth-first search methods such as the alpha-beta search can be significantly accelerated. With a carefully tuned evaluation and search, Rapfi reached strength surpassing Katagomo, the strongest open-source Gomoku AI based on AlphaZero's algorithm, under limited computational resources where accelerators like GPUs are absent. Rapfi ranked first among 520 Gomoku agents on Botzone and won the championship in GomoCup 2024.
△ Less
Submitted 17 March, 2025;
originally announced March 2025.
-
HICD: Hallucination-Inducing via Attention Dispersion for Contrastive Decoding to Mitigate Hallucinations in Large Language Models
Authors:
Xinyan Jiang,
Hang Ye,
Yongxin Zhu,
Xiaoying Zheng,
Zikang Chen,
Jun Gong
Abstract:
Large Language Models (LLMs) often generate hallucinations, producing outputs that are contextually inaccurate or factually incorrect. We introduce HICD, a novel method designed to induce hallucinations for contrastive decoding to mitigate hallucinations. Unlike existing contrastive decoding methods, HICD selects attention heads crucial to the model's prediction as inducing heads, then induces hal…
▽ More
Large Language Models (LLMs) often generate hallucinations, producing outputs that are contextually inaccurate or factually incorrect. We introduce HICD, a novel method designed to induce hallucinations for contrastive decoding to mitigate hallucinations. Unlike existing contrastive decoding methods, HICD selects attention heads crucial to the model's prediction as inducing heads, then induces hallucinations by dispersing attention of these inducing heads and compares the hallucinated outputs with the original outputs to obtain the final result. Our approach significantly improves performance on tasks requiring contextual faithfulness, such as context completion, reading comprehension, and question answering. It also improves factuality in tasks requiring accurate knowledge recall. We demonstrate that our inducing heads selection and attention dispersion method leads to more "contrast-effective" hallucinations for contrastive decoding, outperforming other hallucination-inducing methods. Our findings provide a promising strategy for reducing hallucinations by inducing hallucinations in a controlled manner, enhancing the performance of LLMs in a wide range of tasks.
△ Less
Submitted 23 May, 2025; v1 submitted 17 March, 2025;
originally announced March 2025.
-
ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memory
Authors:
Liangyu Wang,
Jie Ren,
Hang Xu,
Junxiao Wang,
Huanyi Xie,
David E. Keyes,
Di Wang
Abstract:
Fine-tuning large pre-trained LLMs generally demands extensive GPU memory. Traditional first-order optimizers like SGD encounter substantial difficulties due to increased memory requirements from storing activations and gradients during both the forward and backward phases as the model size expands. Alternatively, zeroth-order (ZO) techniques can compute gradients using just forward operations, el…
▽ More
Fine-tuning large pre-trained LLMs generally demands extensive GPU memory. Traditional first-order optimizers like SGD encounter substantial difficulties due to increased memory requirements from storing activations and gradients during both the forward and backward phases as the model size expands. Alternatively, zeroth-order (ZO) techniques can compute gradients using just forward operations, eliminating the need to store activations. Furthermore, by leveraging CPU capabilities, it's feasible to enhance both the memory and processing power available to a single GPU. We propose a novel framework, ZO2 (Zeroth-Order Offloading), for efficient zeroth-order fine-tuning of LLMs with only limited GPU memory. Our framework dynamically shifts model parameters between the CPU and GPU as required, optimizing computation flow and maximizing GPU usage by minimizing downtime. This integration of parameter adjustments with ZO's double forward operations reduces unnecessary data movement, enhancing the fine-tuning efficacy. Additionally, our framework supports an innovative low-bit precision approach in AMP mode to streamline data exchanges between the CPU and GPU. Employing this approach allows us to fine-tune extraordinarily large models, such as the OPT-175B with more than 175 billion parameters, on a mere 18GB GPU--achievements beyond the reach of traditional methods. Moreover, our framework achieves these results with almost no additional time overhead and absolutely no accuracy loss compared to standard zeroth-order methods. ZO2's code has been open-sourced in https://github.com/liangyuwang/zo2.
△ Less
Submitted 16 March, 2025;
originally announced March 2025.
-
Seeing Sarcasm Through Different Eyes: Analyzing Multimodal Sarcasm Perception in Large Vision-Language Models
Authors:
Junjie Chen,
Xuyang Liu,
Subin Huang,
Linfeng Zhang,
Hang Yu
Abstract:
With the advent of large vision-language models (LVLMs) demonstrating increasingly human-like abilities, a pivotal question emerges: do different LVLMs interpret multimodal sarcasm differently, and can a single model grasp sarcasm from multiple perspectives like humans? To explore this, we introduce an analytical framework using systematically designed prompts on existing multimodal sarcasm datase…
▽ More
With the advent of large vision-language models (LVLMs) demonstrating increasingly human-like abilities, a pivotal question emerges: do different LVLMs interpret multimodal sarcasm differently, and can a single model grasp sarcasm from multiple perspectives like humans? To explore this, we introduce an analytical framework using systematically designed prompts on existing multimodal sarcasm datasets. Evaluating 12 state-of-the-art LVLMs over 2,409 samples, we examine interpretive variations within and across models, focusing on confidence levels, alignment with dataset labels, and recognition of ambiguous "neutral" cases. Our findings reveal notable discrepancies -- across LVLMs and within the same model under varied prompts. While classification-oriented prompts yield higher internal consistency, models diverge markedly when tasked with interpretive reasoning. These results challenge binary labeling paradigms by highlighting sarcasm's subjectivity. We advocate moving beyond rigid annotation schemes toward multi-perspective, uncertainty-aware modeling, offering deeper insights into multimodal sarcasm comprehension. Our code and data are available at: https://github.com/CoderChen01/LVLMSarcasmAnalysis
△ Less
Submitted 15 March, 2025;
originally announced March 2025.
-
Unsupervised Graph Anomaly Detection via Multi-Hypersphere Heterophilic Graph Learning
Authors:
Hang Ni,
Jindong Han,
Nengjun Zhu,
Hao Liu
Abstract:
Graph Anomaly Detection (GAD) plays a vital role in various data mining applications such as e-commerce fraud prevention and malicious user detection. Recently, Graph Neural Network (GNN) based approach has demonstrated great effectiveness in GAD by first encoding graph data into low-dimensional representations and then identifying anomalies under the guidance of supervised or unsupervised signals…
▽ More
Graph Anomaly Detection (GAD) plays a vital role in various data mining applications such as e-commerce fraud prevention and malicious user detection. Recently, Graph Neural Network (GNN) based approach has demonstrated great effectiveness in GAD by first encoding graph data into low-dimensional representations and then identifying anomalies under the guidance of supervised or unsupervised signals. However, existing GNN-based approaches implicitly follow the homophily principle (i.e., the "like attracts like" phenomenon) and fail to learn discriminative embedding for anomalies that connect vast normal nodes. Moreover, such approaches identify anomalies in a unified global perspective but overlook diversified abnormal patterns conditioned on local graph context, leading to suboptimal performance. To overcome the aforementioned limitations, in this paper, we propose a Multi-hypersphere Heterophilic Graph Learning (MHetGL) framework for unsupervised GAD. Specifically, we first devise a Heterophilic Graph Encoding (HGE) module to learn distinguishable representations for potential anomalies by purifying and augmenting their neighborhood in a fully unsupervised manner. Then, we propose a Multi-Hypersphere Learning (MHL) module to enhance the detection capability for context-dependent anomalies by jointly incorporating critical patterns from both global and local perspectives. Extensive experiments on ten real-world datasets show that MHetGL outperforms 14 baselines. Our code is publicly available at https://github.com/KennyNH/MHetGL.
△ Less
Submitted 15 March, 2025;
originally announced March 2025.
-
Local controllability of a free-boundary problem for 1D degenerate parabolic equations
Authors:
Lingyang Liu,
Hang Gao
Abstract:
This paper deals with the local controllability of a free-boundary problem for the 1D boundary-degenerate parabolic equation with distributed controls, locally supported in space. We prove that, if the final time T is fixed and the initial state is sufficiently small, there exist controls that drive the state exactly to rest at time t = T. The proof is based on Schauder's fixed point theorem, comb…
▽ More
This paper deals with the local controllability of a free-boundary problem for the 1D boundary-degenerate parabolic equation with distributed controls, locally supported in space. We prove that, if the final time T is fixed and the initial state is sufficiently small, there exist controls that drive the state exactly to rest at time t = T. The proof is based on Schauder's fixed point theorem, combined with appropriate estimates for solutions to degenerate parabolic equations and for the control function.
△ Less
Submitted 3 May, 2025; v1 submitted 14 March, 2025;
originally announced March 2025.
-
Resonance locking: radian-level phase shifts due to nonlinear hydrodynamics of $g$-modes in merging neutron star binaries
Authors:
K. J. Kwon,
Hang Yu,
Tejaswi Venumadhav
Abstract:
A neutron star (NS) in a binary system deforms due to the companion's tidal gravitational field. As the binary inspirals due to gravitational wave (GW) emission, the NS's deformation evolves; this evolution is typically modeled as the star's linear response to the companion's time-evolving tidal potential. In principle, the fluid elements' displacements can be excited and evolve nonlinearly since…
▽ More
A neutron star (NS) in a binary system deforms due to the companion's tidal gravitational field. As the binary inspirals due to gravitational wave (GW) emission, the NS's deformation evolves; this evolution is typically modeled as the star's linear response to the companion's time-evolving tidal potential. In principle, the fluid elements' displacements can be excited and evolve nonlinearly since the equations of hydrodynamics and the tidal forcing have nonlinear terms. Recently, Kwon, Yu, and Venumadhav (KYV I [arXiv:2410.03831]) showed that nonlinear terms in the hydrodynamic equations of motion make the low-frequency response of NSs, characterized by gravity ($g$-) modes, behave in an anharmonic manner. The anharmonicity is dominantly generated by the mutual coupling of the four lowest-order ($n=1$, $l=|m|=2$) $g$-modes, and allows them to stay locked in a resonant state that oscillates phase-coherently with the orbit throughout the inspiral. As a result, the $g$-modes grow to larger amplitudes than the linear response suggests, leading to an extra phase correction to the frequency-domain GW signal $|ΔΨ|\approx 3\,{\rm rad}$ at a GW frequency of $1.05\,{\rm kHz}$. This effect is part of the truly dynamical tide, in the sense that the amplitude depends not just on the binary's instantaneous frequency but the entire history of the inspiral. In this paper, we explain the phenomenology of resonance locking in detail and analytically validate the numerical dephasing calculations in KYV I. We also demonstrate that the effect is only significant for the lowest-order $g$-modes.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
Remote Photoplethysmography in Real-World and Extreme Lighting Scenarios
Authors:
Hang Shao,
Lei Luo,
Jianjun Qian,
Mengkai Yan,
Shuo Chen,
Jian Yang
Abstract:
Physiological activities can be manifested by the sensitive changes in facial imaging. While they are barely observable to our eyes, computer vision manners can, and the derived remote photoplethysmography (rPPG) has shown considerable promise. However, existing studies mainly rely on spatial skin recognition and temporal rhythmic interactions, so they focus on identifying explicit features under…
▽ More
Physiological activities can be manifested by the sensitive changes in facial imaging. While they are barely observable to our eyes, computer vision manners can, and the derived remote photoplethysmography (rPPG) has shown considerable promise. However, existing studies mainly rely on spatial skin recognition and temporal rhythmic interactions, so they focus on identifying explicit features under ideal light conditions, but perform poorly in-the-wild with intricate obstacles and extreme illumination exposure. In this paper, we propose an end-to-end video transformer model for rPPG. It strives to eliminate complex and unknown external time-varying interferences, whether they are sufficient to occupy subtle biosignal amplitudes or exist as periodic perturbations that hinder network training. In the specific implementation, we utilize global interference sharing, subject background reference, and self-supervised disentanglement to eliminate interference, and further guide learning based on spatiotemporal filtering, reconstruction guidance, and frequency domain and biological prior constraints to achieve effective rPPG. To the best of our knowledge, this is the first robust rPPG model for real outdoor scenarios based on natural face videos, and is lightweight to deploy. Extensive experiments show the competitiveness and performance of our model in rPPG prediction across datasets and scenes.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
Quantum ensemble learning with a programmable superconducting processor
Authors:
Jiachen Chen,
Yaozu Wu,
Zhen Yang,
Shibo Xu,
Xuan Ye,
Daili Li,
Ke Wang,
Chuanyu Zhang,
Feitong Jin,
Xuhao Zhu,
Yu Gao,
Ziqi Tan,
Zhengyi Cui,
Aosai Zhang,
Ning Wang,
Yiren Zou,
Tingting Li,
Fanhao Shen,
Jiarun Zhong,
Zehang Bao,
Zitian Zhu,
Zixuan Song,
Jinfeng Deng,
Hang Dong,
Pengfei Zhang
, et al. (8 additional authors not shown)
Abstract:
Quantum machine learning is among the most exciting potential applications of quantum computing. However, the vulnerability of quantum information to environmental noises and the consequent high cost for realizing fault tolerance has impeded the quantum models from learning complex datasets. Here, we introduce AdaBoost.Q, a quantum adaptation of the classical adaptive boosting (AdaBoost) algorithm…
▽ More
Quantum machine learning is among the most exciting potential applications of quantum computing. However, the vulnerability of quantum information to environmental noises and the consequent high cost for realizing fault tolerance has impeded the quantum models from learning complex datasets. Here, we introduce AdaBoost.Q, a quantum adaptation of the classical adaptive boosting (AdaBoost) algorithm designed to enhance learning capabilities of quantum classifiers. Based on the probabilistic nature of quantum measurement, the algorithm improves the prediction accuracy by refining the attention mechanism during the adaptive training and combination of quantum classifiers. We experimentally demonstrate the versatility of our approach on a programmable superconducting processor, where we observe notable performance enhancements across various quantum machine learning models, including quantum neural networks and quantum convolutional neural networks. With AdaBoost.Q, we achieve an accuracy above 86% for a ten-class classification task over 10,000 test samples, and an accuracy of 100% for a quantum feature recognition task over 1,564 test samples. Our results demonstrate a foundational tool for advancing quantum machine learning towards practical applications, which has broad applicability to both the current noisy and the future fault-tolerant quantum devices.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
UniGoal: Towards Universal Zero-shot Goal-oriented Navigation
Authors:
Hang Yin,
Xiuwei Xu,
Lingqing Zhao,
Ziwei Wang,
Jie Zhou,
Jiwen Lu
Abstract:
In this paper, we propose a general framework for universal zero-shot goal-oriented navigation. Existing zero-shot methods build inference framework upon large language models (LLM) for specific tasks, which differs a lot in overall pipeline and fails to generalize across different types of goal. Towards the aim of universal zero-shot navigation, we propose a uniform graph representation to unify…
▽ More
In this paper, we propose a general framework for universal zero-shot goal-oriented navigation. Existing zero-shot methods build inference framework upon large language models (LLM) for specific tasks, which differs a lot in overall pipeline and fails to generalize across different types of goal. Towards the aim of universal zero-shot navigation, we propose a uniform graph representation to unify different goals, including object category, instance image and text description. We also convert the observation of agent into an online maintained scene graph. With this consistent scene and goal representation, we preserve most structural information compared with pure text and are able to leverage LLM for explicit graph-based reasoning. Specifically, we conduct graph matching between the scene graph and goal graph at each time instant and propose different strategies to generate long-term goal of exploration according to different matching states. The agent first iteratively searches subgraph of goal when zero-matched. With partial matching, the agent then utilizes coordinate projection and anchor pair alignment to infer the goal location. Finally scene graph correction and goal verification are applied for perfect matching. We also present a blacklist mechanism to enable robust switch between stages. Extensive experiments on several benchmarks show that our UniGoal achieves state-of-the-art zero-shot performance on three studied navigation tasks with a single model, even outperforming task-specific zero-shot methods and supervised universal methods.
△ Less
Submitted 18 March, 2025; v1 submitted 13 March, 2025;
originally announced March 2025.
-
Finetuning Generative Trajectory Model with Reinforcement Learning from Human Feedback
Authors:
Derun Li,
Jianwei Ren,
Yue Wang,
Xin Wen,
Pengxiang Li,
Leimeng Xu,
Kun Zhan,
Zhongpu Xia,
Peng Jia,
Xianpeng Lang,
Ningyi Xu,
Hang Zhao
Abstract:
Generating human-like and adaptive trajectories is essential for autonomous driving in dynamic environments. While generative models have shown promise in synthesizing feasible trajectories, they often fail to capture the nuanced variability of human driving styles due to dataset biases and distributional shifts. To address this, we introduce TrajHF, a human feedback-driven finetuning framework fo…
▽ More
Generating human-like and adaptive trajectories is essential for autonomous driving in dynamic environments. While generative models have shown promise in synthesizing feasible trajectories, they often fail to capture the nuanced variability of human driving styles due to dataset biases and distributional shifts. To address this, we introduce TrajHF, a human feedback-driven finetuning framework for generative trajectory models, designed to align motion planning with diverse driving preferences. TrajHF incorporates multi-conditional denoiser and reinforcement learning with human feedback to refine multi-modal trajectory generation beyond conventional imitation learning. This enables better alignment with human driving preferences while maintaining safety and feasibility constraints. TrajHF achieves PDMS of 93.95 on NavSim benchmark, significantly exceeding other methods. TrajHF sets a new paradigm for personalized and adaptable trajectory generation in autonomous driving.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
A filtered Lie splitting method for the Zakharov system with low regularity estimates
Authors:
Lun Ji,
Hang Li,
Chunmei Su
Abstract:
In this paper, we present an error estimate for the filtered Lie splitting scheme applied to the Zakharov system, characterized by solutions exhibiting very low regularity across all dimensions. Our findings are derived from the application of multilinear estimates established within the framework of discrete Bourgain spaces. Specifically, we demonstrate that when the solution…
▽ More
In this paper, we present an error estimate for the filtered Lie splitting scheme applied to the Zakharov system, characterized by solutions exhibiting very low regularity across all dimensions. Our findings are derived from the application of multilinear estimates established within the framework of discrete Bourgain spaces. Specifically, we demonstrate that when the solution $(E,z,z_t) \in H^{s+r+1/2}\times H^{s+r}\times H^{s+r-1}$, the error in $H^{r+1/2}\times H^{r}\times H^{r-1}$ is $\mathcal{O}(τ^{s/2})$ for $s\in(0,2]$, where $r=\max(0,\frac d2-1)$. To the best of our knowledge, this represents the first explicit error estimate for the splitting method based on the original Zakharov system, as well as the first instance where low regularity error estimates for coupled equations have been considered within the Bourgain framework. Furthermore, numerical experiments confirm the validity of our theoretical results.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
Cosh-DiT: Co-Speech Gesture Video Synthesis via Hybrid Audio-Visual Diffusion Transformers
Authors:
Yasheng Sun,
Zhiliang Xu,
Hang Zhou,
Jiazhi Guan,
Quanwei Yang,
Kaisiyuan Wang,
Borong Liang,
Yingying Li,
Haocheng Feng,
Jingdong Wang,
Ziwei Liu,
Koike Hideki
Abstract:
Co-speech gesture video synthesis is a challenging task that requires both probabilistic modeling of human gestures and the synthesis of realistic images that align with the rhythmic nuances of speech. To address these challenges, we propose Cosh-DiT, a Co-speech gesture video system with hybrid Diffusion Transformers that perform audio-to-motion and motion-to-video synthesis using discrete and co…
▽ More
Co-speech gesture video synthesis is a challenging task that requires both probabilistic modeling of human gestures and the synthesis of realistic images that align with the rhythmic nuances of speech. To address these challenges, we propose Cosh-DiT, a Co-speech gesture video system with hybrid Diffusion Transformers that perform audio-to-motion and motion-to-video synthesis using discrete and continuous diffusion modeling, respectively. First, we introduce an audio Diffusion Transformer (Cosh-DiT-A) to synthesize expressive gesture dynamics synchronized with speech rhythms. To capture upper body, facial, and hand movement priors, we employ vector-quantized variational autoencoders (VQ-VAEs) to jointly learn their dependencies within a discrete latent space. Then, for realistic video synthesis conditioned on the generated speech-driven motion, we design a visual Diffusion Transformer (Cosh-DiT-V) that effectively integrates spatial and temporal contexts. Extensive experiments demonstrate that our framework consistently generates lifelike videos with expressive facial expressions and natural, smooth gestures that align seamlessly with speech.
△ Less
Submitted 12 March, 2025;
originally announced March 2025.
-
Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k
Authors:
Xiangyu Peng,
Zangwei Zheng,
Chenhui Shen,
Tom Young,
Xinying Guo,
Binluo Wang,
Hang Xu,
Hongxin Liu,
Mingyan Jiang,
Wenjun Li,
Yuhui Wang,
Anbang Ye,
Gang Ren,
Qianran Ma,
Wanying Liang,
Xiang Lian,
Xiwen Wu,
Yuting Zhong,
Zhuangyan Li,
Chaoyu Gong,
Guojun Lei,
Leijun Cheng,
Limin Zhang,
Minghao Li,
Ruijie Zhang
, et al. (7 additional authors not shown)
Abstract:
Video generation models have achieved remarkable progress in the past year. The quality of AI video continues to improve, but at the cost of larger model size, increased data quantity, and greater demand for training compute. In this report, we present Open-Sora 2.0, a commercial-level video generation model trained for only $200k. With this model, we demonstrate that the cost of training a top-pe…
▽ More
Video generation models have achieved remarkable progress in the past year. The quality of AI video continues to improve, but at the cost of larger model size, increased data quantity, and greater demand for training compute. In this report, we present Open-Sora 2.0, a commercial-level video generation model trained for only $200k. With this model, we demonstrate that the cost of training a top-performing video generation model is highly controllable. We detail all techniques that contribute to this efficiency breakthrough, including data curation, model architecture, training strategy, and system optimization. According to human evaluation results and VBench scores, Open-Sora 2.0 is comparable to global leading video generation models including the open-source HunyuanVideo and the closed-source Runway Gen-3 Alpha. By making Open-Sora 2.0 fully open-source, we aim to democratize access to advanced video generation technology, fostering broader innovation and creativity in content creation. All resources are publicly available at: https://github.com/hpcaitech/Open-Sora.
△ Less
Submitted 23 March, 2025; v1 submitted 12 March, 2025;
originally announced March 2025.
-
MoE-Loco: Mixture of Experts for Multitask Locomotion
Authors:
Runhan Huang,
Shaoting Zhu,
Yilun Du,
Hang Zhao
Abstract:
We present MoE-Loco, a Mixture of Experts (MoE) framework for multitask locomotion for legged robots. Our method enables a single policy to handle diverse terrains, including bars, pits, stairs, slopes, and baffles, while supporting quadrupedal and bipedal gaits. Using MoE, we mitigate the gradient conflicts that typically arise in multitask reinforcement learning, improving both training efficien…
▽ More
We present MoE-Loco, a Mixture of Experts (MoE) framework for multitask locomotion for legged robots. Our method enables a single policy to handle diverse terrains, including bars, pits, stairs, slopes, and baffles, while supporting quadrupedal and bipedal gaits. Using MoE, we mitigate the gradient conflicts that typically arise in multitask reinforcement learning, improving both training efficiency and performance. Our experiments demonstrate that different experts naturally specialize in distinct locomotion behaviors, which can be leveraged for task migration and skill composition. We further validate our approach in both simulation and real-world deployment, showcasing its robustness and adaptability.
△ Less
Submitted 20 May, 2025; v1 submitted 11 March, 2025;
originally announced March 2025.
-
TrackOcc: Camera-based 4D Panoptic Occupancy Tracking
Authors:
Zhuoguang Chen,
Kenan Li,
Xiuyu Yang,
Tao Jiang,
Yiming Li,
Hang Zhao
Abstract:
Comprehensive and consistent dynamic scene understanding from camera input is essential for advanced autonomous systems. Traditional camera-based perception tasks like 3D object tracking and semantic occupancy prediction lack either spatial comprehensiveness or temporal consistency. In this work, we introduce a brand-new task, Camera-based 4D Panoptic Occupancy Tracking, which simultaneously addre…
▽ More
Comprehensive and consistent dynamic scene understanding from camera input is essential for advanced autonomous systems. Traditional camera-based perception tasks like 3D object tracking and semantic occupancy prediction lack either spatial comprehensiveness or temporal consistency. In this work, we introduce a brand-new task, Camera-based 4D Panoptic Occupancy Tracking, which simultaneously addresses panoptic occupancy segmentation and object tracking from camera-only input. Furthermore, we propose TrackOcc, a cutting-edge approach that processes image inputs in a streaming, end-to-end manner with 4D panoptic queries to address the proposed task. Leveraging the localization-aware loss, TrackOcc enhances the accuracy of 4D panoptic occupancy tracking without bells and whistles. Experimental results demonstrate that our method achieves state-of-the-art performance on the Waymo dataset. The source code will be released at https://github.com/Tsinghua-MARS-Lab/TrackOcc.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
FastCache: Optimizing Multimodal LLM Serving through Lightweight KV-Cache Compression Framework
Authors:
Jianian Zhu,
Hang Wu,
Haojie Wang,
Yinghui Li,
Biao Hou,
Ruixuan Li,
Jidong Zhai
Abstract:
Multi-modal Large Language Models (MLLMs) serving systems commonly employ KV-cache compression to reduce memory footprint. However, existing compression methods introduce significant processing overhead and queuing delays, particularly in concurrent serving scenarios. We present \texttt{FastCache}, a novel serving framework that effectively addresses these challenges through two key innovations: (…
▽ More
Multi-modal Large Language Models (MLLMs) serving systems commonly employ KV-cache compression to reduce memory footprint. However, existing compression methods introduce significant processing overhead and queuing delays, particularly in concurrent serving scenarios. We present \texttt{FastCache}, a novel serving framework that effectively addresses these challenges through two key innovations: (1) a dynamic batching strategy that optimizes request scheduling across prefill, compression, and decode stages, and (2) an efficient KV-cache memory pool mechanism that eliminates memory fragmentation while maintaining high GPU utilization. Our comprehensive experiments on the GQA and MileBench datasets demonstrate that \texttt{FastCache} achieves up to 19.3$\times$ reduction in Time-To-First-Token (TTFT) and 12.1$\times$ improvement in throughput compared to state-of-the-art baselines. The system maintains stable performance under high-concurrency scenarios (up to 40 req/s) while reducing average memory consumption by 20\%. These results establish \texttt{FastCache} as an efficient solution for real-world LLM serving systems with KV-cache compression.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
Chameleon: Fast-slow Neuro-symbolic Lane Topology Extraction
Authors:
Zongzheng Zhang,
Xinrun Li,
Sizhe Zou,
Guoxuan Chi,
Siqi Li,
Xuchong Qiu,
Guoliang Wang,
Guantian Zheng,
Leichen Wang,
Hang Zhao,
Hao Zhao
Abstract:
Lane topology extraction involves detecting lanes and traffic elements and determining their relationships, a key perception task for mapless autonomous driving. This task requires complex reasoning, such as determining whether it is possible to turn left into a specific lane. To address this challenge, we introduce neuro-symbolic methods powered by vision-language foundation models (VLMs). Existi…
▽ More
Lane topology extraction involves detecting lanes and traffic elements and determining their relationships, a key perception task for mapless autonomous driving. This task requires complex reasoning, such as determining whether it is possible to turn left into a specific lane. To address this challenge, we introduce neuro-symbolic methods powered by vision-language foundation models (VLMs). Existing approaches have notable limitations: (1) Dense visual prompting with VLMs can achieve strong performance but is costly in terms of both financial resources and carbon footprint, making it impractical for robotics applications. (2) Neuro-symbolic reasoning methods for 3D scene understanding fail to integrate visual inputs when synthesizing programs, making them ineffective in handling complex corner cases. To this end, we propose a fast-slow neuro-symbolic lane topology extraction algorithm, named Chameleon, which alternates between a fast system that directly reasons over detected instances using synthesized programs and a slow system that utilizes a VLM with a chain-of-thought design to handle corner cases. Chameleon leverages the strengths of both approaches, providing an affordable solution while maintaining high performance. We evaluate the method on the OpenLane-V2 dataset, showing consistent improvements across various baseline detectors. Our code, data, and models are publicly available at https://github.com/XR-Lee/neural-symbolic
△ Less
Submitted 10 March, 2025;
originally announced March 2025.