-
Modeling Latent Partner Strategies for Adaptive Zero-Shot Human-Agent Collaboration
Authors:
Benjamin Li,
Shuyang Shi,
Lucia Romero,
Huao Li,
Yaqi Xie,
Woojun Kim,
Stefanos Nikolaidis,
Michael Lewis,
Katia Sycara,
Simon Stepputtis
Abstract:
In collaborative tasks, being able to adapt to your teammates is a necessary requirement for success. When teammates are heterogeneous, such as in human-agent teams, agents need to be able to observe, recognize, and adapt to their human partners in real time. This becomes particularly challenging in tasks with time pressure and complex strategic spaces where the dynamics can change rapidly. In thi…
▽ More
In collaborative tasks, being able to adapt to your teammates is a necessary requirement for success. When teammates are heterogeneous, such as in human-agent teams, agents need to be able to observe, recognize, and adapt to their human partners in real time. This becomes particularly challenging in tasks with time pressure and complex strategic spaces where the dynamics can change rapidly. In this work, we introduce TALENTS, a strategy-conditioned cooperator framework that learns to represent, categorize, and adapt to a range of partner strategies, enabling ad-hoc teamwork. Our approach utilizes a variational autoencoder to learn a latent strategy space from trajectory data. This latent space represents the underlying strategies that agents employ. Subsequently, the system identifies different types of strategy by clustering the data. Finally, a cooperator agent is trained to generate partners for each type of strategy, conditioned on these clusters. In order to adapt to previously unseen partners, we leverage a fixed-share regret minimization algorithm that infers and adjusts the estimated partner strategy dynamically. We assess our approach in a customized version of the Overcooked environment, posing a challenging cooperative cooking task that demands strong coordination across a wide range of possible strategies. Using an online user study, we show that our agent outperforms current baselines when working with unfamiliar human partners.
△ Less
Submitted 7 July, 2025;
originally announced July 2025.
-
Proteus-ID: ID-Consistent and Motion-Coherent Video Customization
Authors:
Guiyu Zhang,
Chen Shi,
Zijian Jiang,
Xunzhi Xiang,
Jingjing Qian,
Shaoshuai Shi,
Li Jiang
Abstract:
Video identity customization seeks to synthesize realistic, temporally coherent videos of a specific subject, given a single reference image and a text prompt. This task presents two core challenges: (1) maintaining identity consistency while aligning with the described appearance and actions, and (2) generating natural, fluid motion without unrealistic stiffness. To address these challenges, we i…
▽ More
Video identity customization seeks to synthesize realistic, temporally coherent videos of a specific subject, given a single reference image and a text prompt. This task presents two core challenges: (1) maintaining identity consistency while aligning with the described appearance and actions, and (2) generating natural, fluid motion without unrealistic stiffness. To address these challenges, we introduce Proteus-ID, a novel diffusion-based framework for identity-consistent and motion-coherent video customization. First, we propose a Multimodal Identity Fusion (MIF) module that unifies visual and textual cues into a joint identity representation using a Q-Former, providing coherent guidance to the diffusion model and eliminating modality imbalance. Second, we present a Time-Aware Identity Injection (TAII) mechanism that dynamically modulates identity conditioning across denoising steps, improving fine-detail reconstruction. Third, we propose Adaptive Motion Learning (AML), a self-supervised strategy that reweights the training loss based on optical-flow-derived motion heatmaps, enhancing motion realism without requiring additional inputs. To support this task, we construct Proteus-Bench, a high-quality dataset comprising 200K curated clips for training and 150 individuals from diverse professions and ethnicities for evaluation. Extensive experiments demonstrate that Proteus-ID outperforms prior methods in identity preservation, text alignment, and motion quality, establishing a new benchmark for video identity customization. Codes and data are publicly available at https://grenoble-zhang.github.io/Proteus-ID/.
△ Less
Submitted 30 June, 2025;
originally announced June 2025.
-
Enhancing Spatial Reasoning in Multimodal Large Language Models through Reasoning-based Segmentation
Authors:
Zhenhua Ning,
Zhuotao Tian,
Shaoshuai Shi,
Guangming Lu,
Daojing He,
Wenjie Pei,
Li Jiang
Abstract:
Recent advances in point cloud perception have demonstrated remarkable progress in scene understanding through vision-language alignment leveraging large language models (LLMs). However, existing methods may still encounter challenges in handling complex instructions that require accurate spatial reasoning, even if the 3D point cloud data provides detailed spatial cues such as size and position fo…
▽ More
Recent advances in point cloud perception have demonstrated remarkable progress in scene understanding through vision-language alignment leveraging large language models (LLMs). However, existing methods may still encounter challenges in handling complex instructions that require accurate spatial reasoning, even if the 3D point cloud data provides detailed spatial cues such as size and position for identifying the targets. To tackle this issue, we propose Relevant Reasoning Segmentation (R$^2$S), a reasoning-based segmentation framework. The framework emulates human cognitive processes by decomposing spatial reasoning into two sequential stages: first identifying relevant elements, then processing instructions guided by their associated visual priors. Furthermore, acknowledging the inadequacy of existing datasets in complex reasoning tasks, we introduce 3D ReasonSeg, a reasoning-based segmentation dataset comprising 25,185 training samples and 3,966 validation samples with precise annotations. Both quantitative and qualitative experiments demonstrate that the R$^2$S and 3D ReasonSeg effectively endow 3D point cloud perception with stronger spatial reasoning capabilities, and we hope that they can serve as a new baseline and benchmark for future work.
△ Less
Submitted 29 June, 2025;
originally announced June 2025.
-
GoIRL: Graph-Oriented Inverse Reinforcement Learning for Multimodal Trajectory Prediction
Authors:
Muleilan Pei,
Shaoshuai Shi,
Lu Zhang,
Peiliang Li,
Shaojie Shen
Abstract:
Trajectory prediction for surrounding agents is a challenging task in autonomous driving due to its inherent uncertainty and underlying multimodality. Unlike prevailing data-driven methods that primarily rely on supervised learning, in this paper, we introduce a novel Graph-oriented Inverse Reinforcement Learning (GoIRL) framework, which is an IRL-based predictor equipped with vectorized context r…
▽ More
Trajectory prediction for surrounding agents is a challenging task in autonomous driving due to its inherent uncertainty and underlying multimodality. Unlike prevailing data-driven methods that primarily rely on supervised learning, in this paper, we introduce a novel Graph-oriented Inverse Reinforcement Learning (GoIRL) framework, which is an IRL-based predictor equipped with vectorized context representations. We develop a feature adaptor to effectively aggregate lane-graph features into grid space, enabling seamless integration with the maximum entropy IRL paradigm to infer the reward distribution and obtain the policy that can be sampled to induce multiple plausible plans. Furthermore, conditioned on the sampled plans, we implement a hierarchical parameterized trajectory generator with a refinement module to enhance prediction accuracy and a probability fusion strategy to boost prediction confidence. Extensive experimental results showcase our approach not only achieves state-of-the-art performance on the large-scale Argoverse & nuScenes motion forecasting benchmarks but also exhibits superior generalization abilities compared to existing supervised models.
△ Less
Submitted 26 June, 2025;
originally announced June 2025.
-
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
Authors:
MiniMax,
:,
Aili Chen,
Aonian Li,
Bangwei Gong,
Binyang Jiang,
Bo Fei,
Bo Yang,
Boji Shan,
Changqing Yu,
Chao Wang,
Cheng Zhu,
Chengjun Xiao,
Chengyu Du,
Chi Zhang,
Chu Qiao,
Chunhao Zhang,
Chunhui Du,
Congchao Guo,
Da Chen,
Deming Ding,
Dianjun Sun,
Dong Li,
Enwei Jiao,
Haigang Zhou
, et al. (103 additional authors not shown)
Abstract:
We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model…
▽ More
We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model natively supports a context length of 1 million tokens, 8x the context size of DeepSeek R1. Furthermore, the lightning attention mechanism in MiniMax-M1 enables efficient scaling of test-time compute. These properties make M1 particularly suitable for complex tasks that require processing long inputs and thinking extensively. MiniMax-M1 is trained using large-scale reinforcement learning (RL) on diverse problems including sandbox-based, real-world software engineering environments. In addition to M1's inherent efficiency advantage for RL training, we propose CISPO, a novel RL algorithm to further enhance RL efficiency. CISPO clips importance sampling weights rather than token updates, outperforming other competitive RL variants. Combining hybrid-attention and CISPO enables MiniMax-M1's full RL training on 512 H800 GPUs to complete in only three weeks, with a rental cost of just $534,700. We release two versions of MiniMax-M1 models with 40K and 80K thinking budgets respectively, where the 40K model represents an intermediate phase of the 80K training. Experiments on standard benchmarks show that our models are comparable or superior to strong open-weight models such as the original DeepSeek-R1 and Qwen3-235B, with particular strengths in complex software engineering, tool utilization, and long-context tasks. We publicly release MiniMax-M1 at https://github.com/MiniMax-AI/MiniMax-M1.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Edit360: 2D Image Edits to 3D Assets from Any Angle
Authors:
Junchao Huang,
Xinting Hu,
Shaoshuai Shi,
Zhuotao Tian,
Li Jiang
Abstract:
Recent advances in diffusion models have significantly improved image generation and editing, but extending these capabilities to 3D assets remains challenging, especially for fine-grained edits that require multi-view consistency. Existing methods typically restrict editing to predetermined viewing angles, severely limiting their flexibility and practical applications. We introduce Edit360, a tun…
▽ More
Recent advances in diffusion models have significantly improved image generation and editing, but extending these capabilities to 3D assets remains challenging, especially for fine-grained edits that require multi-view consistency. Existing methods typically restrict editing to predetermined viewing angles, severely limiting their flexibility and practical applications. We introduce Edit360, a tuning-free framework that extends 2D modifications to multi-view consistent 3D editing. Built upon video diffusion models, Edit360 enables user-specific editing from arbitrary viewpoints while ensuring structural coherence across all views. The framework selects anchor views for 2D modifications and propagates edits across the entire 360-degree range. To achieve this, Edit360 introduces a novel Anchor-View Editing Propagation mechanism, which effectively aligns and merges multi-view information within the latent and attention spaces of diffusion models. The resulting edited multi-view sequences facilitate the reconstruction of high-quality 3D assets, enabling customizable 3D content creation.
△ Less
Submitted 30 June, 2025; v1 submitted 12 June, 2025;
originally announced June 2025.
-
FicGCN: Unveiling the Homomorphic Encryption Efficiency from Irregular Graph Convolutional Networks
Authors:
Zhaoxuan Kan,
Husheng Han,
Shangyi Shi,
Tenghui Hua,
Hang Lu,
Xiaowei Li,
Jianan Mu,
Xing Hu
Abstract:
Graph Convolutional Neural Networks (GCNs) have gained widespread popularity in various fields like personal healthcare and financial systems, due to their remarkable performance. Despite the growing demand for cloud-based GCN services, privacy concerns over sensitive graph data remain significant. Homomorphic Encryption (HE) facilitates Privacy-Preserving Machine Learning (PPML) by allowing compu…
▽ More
Graph Convolutional Neural Networks (GCNs) have gained widespread popularity in various fields like personal healthcare and financial systems, due to their remarkable performance. Despite the growing demand for cloud-based GCN services, privacy concerns over sensitive graph data remain significant. Homomorphic Encryption (HE) facilitates Privacy-Preserving Machine Learning (PPML) by allowing computations to be performed on encrypted data. However, HE introduces substantial computational overhead, particularly for GCN operations that require rotations and multiplications in matrix products. The sparsity of GCNs offers significant performance potential, but their irregularity introduces additional operations that reduce practical gains. In this paper, we propose FicGCN, a HE-based framework specifically designed to harness the sparse characteristics of GCNs and strike a globally optimal balance between aggregation and combination operations. FicGCN employs a latency-aware packing scheme, a Sparse Intra-Ciphertext Aggregation (SpIntra-CA) method to minimize rotation overhead, and a region-based data reordering driven by local adjacency structure. We evaluated FicGCN on several popular datasets, and the results show that FicGCN achieved the best performance across all tested datasets, with up to a 4.10x improvement over the latest design.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
TrajFlow: Multi-modal Motion Prediction via Flow Matching
Authors:
Qi Yan,
Brian Zhang,
Yutong Zhang,
Daniel Yang,
Joshua White,
Di Chen,
Jiachao Liu,
Langechuan Liu,
Binnan Zhuang,
Shaoshuai Shi,
Renjie Liao
Abstract:
Efficient and accurate motion prediction is crucial for ensuring safety and informed decision-making in autonomous driving, particularly under dynamic real-world conditions that necessitate multi-modal forecasts. We introduce TrajFlow, a novel flow matching-based motion prediction framework that addresses the scalability and efficiency challenges of existing generative trajectory prediction method…
▽ More
Efficient and accurate motion prediction is crucial for ensuring safety and informed decision-making in autonomous driving, particularly under dynamic real-world conditions that necessitate multi-modal forecasts. We introduce TrajFlow, a novel flow matching-based motion prediction framework that addresses the scalability and efficiency challenges of existing generative trajectory prediction methods. Unlike conventional generative approaches that employ i.i.d. sampling and require multiple inference passes to capture diverse outcomes, TrajFlow predicts multiple plausible future trajectories in a single pass, significantly reducing computational overhead while maintaining coherence across predictions. Moreover, we propose a ranking loss based on the Plackett-Luce distribution to improve uncertainty estimation of predicted trajectories. Additionally, we design a self-conditioning training technique that reuses the model's own predictions to construct noisy inputs during a second forward pass, thereby improving generalization and accelerating inference. Extensive experiments on the large-scale Waymo Open Motion Dataset (WOMD) demonstrate that TrajFlow achieves state-of-the-art performance across various key metrics, underscoring its effectiveness for safety-critical autonomous driving applications. The code and other details are available on the project website https://traj-flow.github.io/.
△ Less
Submitted 5 July, 2025; v1 submitted 10 June, 2025;
originally announced June 2025.
-
SceneRAG: Scene-level Retrieval-Augmented Generation for Video Understanding
Authors:
Nianbo Zeng,
Haowen Hou,
Fei Richard Yu,
Si Shi,
Ying Tiffany He
Abstract:
Despite recent advances in retrieval-augmented generation (RAG) for video understanding, effectively understanding long-form video content remains underexplored due to the vast scale and high complexity of video data. Current RAG approaches typically segment videos into fixed-length chunks, which often disrupts the continuity of contextual information and fails to capture authentic scene boundarie…
▽ More
Despite recent advances in retrieval-augmented generation (RAG) for video understanding, effectively understanding long-form video content remains underexplored due to the vast scale and high complexity of video data. Current RAG approaches typically segment videos into fixed-length chunks, which often disrupts the continuity of contextual information and fails to capture authentic scene boundaries. Inspired by the human ability to naturally organize continuous experiences into coherent scenes, we present SceneRAG, a unified framework that leverages large language models to segment videos into narrative-consistent scenes by processing ASR transcripts alongside temporal metadata. SceneRAG further sharpens these initial boundaries through lightweight heuristics and iterative correction. For each scene, the framework fuses information from both visual and textual modalities to extract entity relations and dynamically builds a knowledge graph, enabling robust multi-hop retrieval and generation that account for long-range dependencies. Experiments on the LongerVideos benchmark, featuring over 134 hours of diverse content, confirm that SceneRAG substantially outperforms prior baselines, achieving a win rate of up to 72.5 percent on generation tasks.
△ Less
Submitted 9 June, 2025;
originally announced June 2025.
-
Self-supervised ControlNet with Spatio-Temporal Mamba for Real-world Video Super-resolution
Authors:
Shijun Shi,
Jing Xu,
Lijing Lu,
Zhihang Li,
Kai Hu
Abstract:
Existing diffusion-based video super-resolution (VSR) methods are susceptible to introducing complex degradations and noticeable artifacts into high-resolution videos due to their inherent randomness. In this paper, we propose a noise-robust real-world VSR framework by incorporating self-supervised learning and Mamba into pre-trained latent diffusion models. To ensure content consistency across ad…
▽ More
Existing diffusion-based video super-resolution (VSR) methods are susceptible to introducing complex degradations and noticeable artifacts into high-resolution videos due to their inherent randomness. In this paper, we propose a noise-robust real-world VSR framework by incorporating self-supervised learning and Mamba into pre-trained latent diffusion models. To ensure content consistency across adjacent frames, we enhance the diffusion model with a global spatio-temporal attention mechanism using the Video State-Space block with a 3D Selective Scan module, which reinforces coherence at an affordable computational cost. To further reduce artifacts in generated details, we introduce a self-supervised ControlNet that leverages HR features as guidance and employs contrastive learning to extract degradation-insensitive features from LR videos. Finally, a three-stage training strategy based on a mixture of HR-LR videos is proposed to stabilize VSR training. The proposed Self-supervised ControlNet with Spatio-Temporal Continuous Mamba based VSR algorithm achieves superior perceptual quality than state-of-the-arts on real-world VSR benchmark datasets, validating the effectiveness of the proposed model design and training strategies.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
DriveX: Omni Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving
Authors:
Chen Shi,
Shaoshuai Shi,
Kehua Sheng,
Bo Zhang,
Li Jiang
Abstract:
Data-driven learning has advanced autonomous driving, yet task-specific models struggle with out-of-distribution scenarios due to their narrow optimization objectives and reliance on costly annotated data. We present DriveX, a self-supervised world model that learns generalizable scene dynamics and holistic representations (geometric, semantic, and motion) from large-scale driving videos. DriveX i…
▽ More
Data-driven learning has advanced autonomous driving, yet task-specific models struggle with out-of-distribution scenarios due to their narrow optimization objectives and reliance on costly annotated data. We present DriveX, a self-supervised world model that learns generalizable scene dynamics and holistic representations (geometric, semantic, and motion) from large-scale driving videos. DriveX introduces Omni Scene Modeling (OSM), a module that unifies multimodal supervision-3D point cloud forecasting, 2D semantic representation, and image generation-to capture comprehensive scene evolution. To simplify learning complex dynamics, we propose a decoupled latent world modeling strategy that separates world representation learning from future state decoding, augmented by dynamic-aware ray sampling to enhance motion modeling. For downstream adaptation, we design Future Spatial Attention (FSA), a unified paradigm that dynamically aggregates spatiotemporal features from DriveX's predictions to enhance task-specific inference. Extensive experiments demonstrate DriveX's effectiveness: it achieves significant improvements in 3D future point cloud prediction over prior work, while attaining state-of-the-art results on diverse tasks including occupancy prediction, flow estimation, and end-to-end driving. These results validate DriveX's capability as a general-purpose world model, paving the way for robust and unified autonomous driving frameworks.
△ Less
Submitted 25 May, 2025;
originally announced May 2025.
-
What You Read Isn't What You Hear: Linguistic Sensitivity in Deepfake Speech Detection
Authors:
Binh Nguyen,
Shuji Shi,
Ryan Ofman,
Thai Le
Abstract:
Recent advances in text-to-speech technologies have enabled realistic voice generation, fueling audio-based deepfake attacks such as fraud and impersonation. While audio anti-spoofing systems are critical for detecting such threats, prior work has predominantly focused on acoustic-level perturbations, leaving the impact of linguistic variation largely unexplored. In this paper, we investigate the…
▽ More
Recent advances in text-to-speech technologies have enabled realistic voice generation, fueling audio-based deepfake attacks such as fraud and impersonation. While audio anti-spoofing systems are critical for detecting such threats, prior work has predominantly focused on acoustic-level perturbations, leaving the impact of linguistic variation largely unexplored. In this paper, we investigate the linguistic sensitivity of both open-source and commercial anti-spoofing detectors by introducing transcript-level adversarial attacks. Our extensive evaluation reveals that even minor linguistic perturbations can significantly degrade detection accuracy: attack success rates surpass 60% on several open-source detector-voice pairs, and notably one commercial detection accuracy drops from 100% on synthetic audio to just 32%. Through a comprehensive feature attribution analysis, we identify that both linguistic complexity and model-level audio embedding similarity contribute strongly to detector vulnerability. We further demonstrate the real-world risk via a case study replicating the Brad Pitt audio deepfake scam, using transcript adversarial attacks to completely bypass commercial detectors. These results highlight the need to move beyond purely acoustic defenses and account for linguistic variation in the design of robust anti-spoofing systems. All source code will be publicly available.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving
Authors:
Xuesong Chen,
Linjiang Huang,
Tao Ma,
Rongyao Fang,
Shaoshuai Shi,
Hongsheng Li
Abstract:
The integration of Vision-Language Models (VLMs) into autonomous driving systems has shown promise in addressing key challenges such as learning complexity, interpretability, and common-sense reasoning. However, existing approaches often struggle with efficient integration and realtime decision-making due to computational demands. In this paper, we introduce SOLVE, an innovative framework that syn…
▽ More
The integration of Vision-Language Models (VLMs) into autonomous driving systems has shown promise in addressing key challenges such as learning complexity, interpretability, and common-sense reasoning. However, existing approaches often struggle with efficient integration and realtime decision-making due to computational demands. In this paper, we introduce SOLVE, an innovative framework that synergizes VLMs with end-to-end (E2E) models to enhance autonomous vehicle planning. Our approach emphasizes knowledge sharing at the feature level through a shared visual encoder, enabling comprehensive interaction between VLM and E2E components. We propose a Trajectory Chain-of-Thought (T-CoT) paradigm, which progressively refines trajectory predictions, reducing uncertainty and improving accuracy. By employing a temporal decoupling strategy, SOLVE achieves efficient cooperation by aligning high-quality VLM outputs with E2E real-time performance. Evaluated on the nuScenes dataset, our method demonstrates significant improvements in trajectory prediction accuracy, paving the way for more robust and reliable autonomous driving systems.
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
Distributed quantum computing with black-box subroutines
Authors:
X. Xu,
Y. -D. Liu,
S. Shi,
Y. -J. Wang,
D. -S. Wang
Abstract:
In this work, we propose a general protocol for distributed quantum computing that accommodates arbitrary unknown subroutines. It can be applied to scale up quantum computing through multi-chip interconnection, as well as to tasks such as estimating unknown parameters or processes for circuit depth reduction and constructing secure quantum cryptographic protocols. Our protocol builds upon a few te…
▽ More
In this work, we propose a general protocol for distributed quantum computing that accommodates arbitrary unknown subroutines. It can be applied to scale up quantum computing through multi-chip interconnection, as well as to tasks such as estimating unknown parameters or processes for circuit depth reduction and constructing secure quantum cryptographic protocols. Our protocol builds upon a few techniques we develop, such as the oblivious quantum teleportation and control, which can circumvent quantum no-go theorems on the manipulation of unknown objects. Furthermore, we demonstrate that this protocol can be physically implemented using currently available quantum computing platforms. These results suggest that our framework could provide a foundation for developing more advanced quantum algorithms and protocols in the future.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
Occupancy World Model for Robots
Authors:
Zhang Zhang,
Qiang Zhang,
Wei Cui,
Shuai Shi,
Yijie Guo,
Gang Han,
Wen Zhao,
Jingkai Sun,
Jiahang Cao,
Jiaxu Wang,
Hao Cheng,
Xiaozhu Ju,
Zhengping Che,
Renjing Xu,
Jian Tang
Abstract:
Understanding and forecasting the scene evolutions deeply affect the exploration and decision of embodied agents. While traditional methods simulate scene evolutions through trajectory prediction of potential instances, current works use the occupancy world model as a generative framework for describing fine-grained overall scene dynamics. However, existing methods cluster on the outdoor structure…
▽ More
Understanding and forecasting the scene evolutions deeply affect the exploration and decision of embodied agents. While traditional methods simulate scene evolutions through trajectory prediction of potential instances, current works use the occupancy world model as a generative framework for describing fine-grained overall scene dynamics. However, existing methods cluster on the outdoor structured road scenes, while ignoring the exploration of forecasting 3D occupancy scene evolutions for robots in indoor scenes. In this work, we explore a new framework for learning the scene evolutions of observed fine-grained occupancy and propose an occupancy world model based on the combined spatio-temporal receptive field and guided autoregressive transformer to forecast the scene evolutions, called RoboOccWorld. We propose the Conditional Causal State Attention (CCSA), which utilizes camera poses of next state as conditions to guide the autoregressive transformer to adapt and understand the indoor robotics scenarios. In order to effectively exploit the spatio-temporal cues from historical observations, Hybrid Spatio-Temporal Aggregation (HSTA) is proposed to obtain the combined spatio-temporal receptive field based on multi-scale spatio-temporal windows. In addition, we restructure the OccWorld-ScanNet benchmark based on local annotations to facilitate the evaluation of the indoor 3D occupancy scene evolution prediction task. Experimental results demonstrate that our RoboOccWorld outperforms state-of-the-art methods in indoor 3D occupancy scene evolution prediction task. The code will be released soon.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
Quantitative Analysis of Performance Drop in DeepSeek Model Quantization
Authors:
Enbo Zhao,
Yi Shen,
Shuming Shi,
Jieyun Huang,
Zhihao Chen,
Ning Wang,
Siqi Xiao,
Jian Zhang,
Kai Wang,
Shiguo Lian
Abstract:
Recently, there is a high demand for deploying DeepSeek-R1 and V3 locally, possibly because the official service often suffers from being busy and some organizations have data privacy concerns. While single-machine deployment offers infrastructure simplicity, the models' 671B FP8 parameter configuration exceeds the practical memory limits of a standard 8-GPU machine. Quantization is a widely used…
▽ More
Recently, there is a high demand for deploying DeepSeek-R1 and V3 locally, possibly because the official service often suffers from being busy and some organizations have data privacy concerns. While single-machine deployment offers infrastructure simplicity, the models' 671B FP8 parameter configuration exceeds the practical memory limits of a standard 8-GPU machine. Quantization is a widely used technique that helps reduce model memory consumption. However, it is unclear what the performance of DeepSeek-R1 and V3 will be after being quantized. This technical report presents the first quantitative evaluation of multi-bitwidth quantization across the complete DeepSeek model spectrum. Key findings reveal that 4-bit quantization maintains little performance degradation versus FP8 while enabling single-machine deployment on standard NVIDIA GPU devices. We further propose DQ3_K_M, a dynamic 3-bit quantization method that significantly outperforms traditional Q3_K_M variant on various benchmarks, which is also comparable with 4-bit quantization (Q4_K_M) approach in most tasks. Moreover, DQ3_K_M supports single-machine deployment configurations for both NVIDIA H100/A100 and Huawei 910B. Our implementation of DQ3\_K\_M is released at https://github.com/UnicomAI/DeepSeek-Eval, containing optimized 3-bit quantized variants of both DeepSeek-R1 and DeepSeek-V3.
△ Less
Submitted 13 June, 2025; v1 submitted 5 May, 2025;
originally announced May 2025.
-
Exposing the Copycat Problem of Imitation-based Planner: A Novel Closed-Loop Simulator, Causal Benchmark and Joint IL-RL Baseline
Authors:
Hui Zhou,
Shaoshuai Shi,
Hongsheng Li
Abstract:
Machine learning (ML)-based planners have recently gained significant attention. They offer advantages over traditional optimization-based planning algorithms. These advantages include fewer manually selected parameters and faster development. Within ML-based planning, imitation learning (IL) is a common algorithm. It primarily learns driving policies directly from supervised trajectory data. Whil…
▽ More
Machine learning (ML)-based planners have recently gained significant attention. They offer advantages over traditional optimization-based planning algorithms. These advantages include fewer manually selected parameters and faster development. Within ML-based planning, imitation learning (IL) is a common algorithm. It primarily learns driving policies directly from supervised trajectory data. While IL has demonstrated strong performance on many open-loop benchmarks, it remains challenging to determine if the learned policy truly understands fundamental driving principles, rather than simply extrapolating from the ego-vehicle's initial state. Several studies have identified this limitation and proposed algorithms to address it. However, these methods often use original datasets for evaluation. In these datasets, future trajectories are heavily dependent on initial conditions. Furthermore, IL often overfits to the most common scenarios. It struggles to generalize to rare or unseen situations.
To address these challenges, this work proposes: 1) a novel closed-loop simulator supporting both imitation and reinforcement learning, 2) a causal benchmark derived from the Waymo Open Dataset to rigorously assess the impact of the copycat problem, and 3) a novel framework integrating imitation learning and reinforcement learning to overcome the limitations of purely imitative approaches. The code for this work will be released soon.
△ Less
Submitted 20 April, 2025;
originally announced April 2025.
-
RoboOcc: Enhancing the Geometric and Semantic Scene Understanding for Robots
Authors:
Zhang Zhang,
Qiang Zhang,
Wei Cui,
Shuai Shi,
Yijie Guo,
Gang Han,
Wen Zhao,
Hengle Ren,
Renjing Xu,
Jian Tang
Abstract:
3D occupancy prediction enables the robots to obtain spatial fine-grained geometry and semantics of the surrounding scene, and has become an essential task for embodied perception. Existing methods based on 3D Gaussians instead of dense voxels do not effectively exploit the geometry and opacity properties of Gaussians, which limits the network's estimation of complex environments and also limits t…
▽ More
3D occupancy prediction enables the robots to obtain spatial fine-grained geometry and semantics of the surrounding scene, and has become an essential task for embodied perception. Existing methods based on 3D Gaussians instead of dense voxels do not effectively exploit the geometry and opacity properties of Gaussians, which limits the network's estimation of complex environments and also limits the description of the scene by 3D Gaussians. In this paper, we propose a 3D occupancy prediction method which enhances the geometric and semantic scene understanding for robots, dubbed RoboOcc. It utilizes the Opacity-guided Self-Encoder (OSE) to alleviate the semantic ambiguity of overlapping Gaussians and the Geometry-aware Cross-Encoder (GCE) to accomplish the fine-grained geometric modeling of the surrounding scene. We conduct extensive experiments on Occ-ScanNet and EmbodiedOcc-ScanNet datasets, and our RoboOcc achieves state-of the-art performance in both local and global camera settings. Further, in ablation studies of Gaussian parameters, the proposed RoboOcc outperforms the state-of-the-art methods by a large margin of (8.47, 6.27) in IoU and mIoU metric, respectively. The codes will be released soon.
△ Less
Submitted 20 April, 2025;
originally announced April 2025.
-
Contrastive and Variational Approaches in Self-Supervised Learning for Complex Data Mining
Authors:
Yingbin Liang,
Lu Dai,
Shuo Shi,
Minghao Dai,
Junliang Du,
Haige Wang
Abstract:
Complex data mining has wide application value in many fields, especially in the feature extraction and classification tasks of unlabeled data. This paper proposes an algorithm based on self-supervised learning and verifies its effectiveness through experiments. The study found that in terms of the selection of optimizer and learning rate, the combination of AdamW optimizer and 0.002 learning rate…
▽ More
Complex data mining has wide application value in many fields, especially in the feature extraction and classification tasks of unlabeled data. This paper proposes an algorithm based on self-supervised learning and verifies its effectiveness through experiments. The study found that in terms of the selection of optimizer and learning rate, the combination of AdamW optimizer and 0.002 learning rate performed best in all evaluation indicators, indicating that the adaptive optimization method can improve the performance of the model in complex data mining tasks. In addition, the ablation experiment further analyzed the contribution of each module. The results show that contrastive learning, variational modules, and data augmentation strategies play a key role in the generalization ability and robustness of the model. Through the convergence curve analysis of the loss function, the experiment verifies that the method can converge stably during the training process and effectively avoid serious overfitting. Further experimental results show that the model has strong adaptability on different data sets, can effectively extract high-quality features from unlabeled data, and improves classification accuracy. At the same time, under different data distribution conditions, the method can still maintain high detection accuracy, proving its applicability in complex data environments. This study analyzed the role of self-supervised learning methods in complex data mining through systematic experiments and verified its advantages in improving feature extraction quality, optimizing classification performance, and enhancing model stability
△ Less
Submitted 4 April, 2025;
originally announced April 2025.
-
OmniCam: Unified Multimodal Video Generation via Camera Control
Authors:
Xiaoda Yang,
Jiayang Xu,
Kaixuan Luan,
Xinyu Zhan,
Hongshun Qiu,
Shijun Shi,
Hao Li,
Shuai Yang,
Li Zhang,
Checheng Yu,
Cewu Lu,
Lixin Yang
Abstract:
Camera control, which achieves diverse visual effects by changing camera position and pose, has attracted widespread attention. However, existing methods face challenges such as complex interaction and limited control capabilities. To address these issues, we present OmniCam, a unified multimodal camera control framework. Leveraging large language models and video diffusion models, OmniCam generat…
▽ More
Camera control, which achieves diverse visual effects by changing camera position and pose, has attracted widespread attention. However, existing methods face challenges such as complex interaction and limited control capabilities. To address these issues, we present OmniCam, a unified multimodal camera control framework. Leveraging large language models and video diffusion models, OmniCam generates spatio-temporally consistent videos. It supports various combinations of input modalities: the user can provide text or video with expected trajectory as camera path guidance, and image or video as content reference, enabling precise control over camera motion. To facilitate the training of OmniCam, we introduce the OmniTr dataset, which contains a large collection of high-quality long-sequence trajectories, videos, and corresponding descriptions. Experimental results demonstrate that our model achieves state-of-the-art performance in high-quality camera-controlled video generation across various metrics.
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
Command A: An Enterprise-Ready Large Language Model
Authors:
Team Cohere,
:,
Aakanksha,
Arash Ahmadian,
Marwan Ahmed,
Jay Alammar,
Milad Alizadeh,
Yazeed Alnumay,
Sophia Althammer,
Arkady Arkhangorodsky,
Viraat Aryabumi,
Dennis Aumiller,
Raphaël Avalos,
Zahara Aviv,
Sammie Bae,
Saurabh Baji,
Alexandre Barbet,
Max Bartolo,
Björn Bebensee,
Neeral Beladia,
Walter Beller-Morales,
Alexandre Bérard,
Andrew Berneshawi,
Anna Bialas,
Phil Blunsom
, et al. (205 additional authors not shown)
Abstract:
In this report we describe the development of Command A, a powerful large language model purpose-built to excel at real-world enterprise use cases. Command A is an agent-optimised and multilingual-capable model, with support for 23 languages of global business, and a novel hybrid architecture balancing efficiency with top of the range performance. It offers best-in-class Retrieval Augmented Genera…
▽ More
In this report we describe the development of Command A, a powerful large language model purpose-built to excel at real-world enterprise use cases. Command A is an agent-optimised and multilingual-capable model, with support for 23 languages of global business, and a novel hybrid architecture balancing efficiency with top of the range performance. It offers best-in-class Retrieval Augmented Generation (RAG) capabilities with grounding and tool use to automate sophisticated business processes. These abilities are achieved through a decentralised training approach, including self-refinement algorithms and model merging techniques. We also include results for Command R7B which shares capability and architectural similarities to Command A. Weights for both models have been released for research purposes. This technical report details our original training pipeline and presents an extensive evaluation of our models across a suite of enterprise-relevant tasks and public benchmarks, demonstrating excellent performance and efficiency.
△ Less
Submitted 14 April, 2025; v1 submitted 1 April, 2025;
originally announced April 2025.
-
FlexMem: High-Parallel Near-Memory Architecture for Flexible Dataflow in Fully Homomorphic Encryption
Authors:
Shangyi Shi,
Husheng Han,
Jianan Mu,
Xinyao Zheng,
Ling Liang,
Hang Lu,
Zidong Du,
Xiaowei Li,
Xing Hu,
Qi Guo
Abstract:
Fully Homomorphic Encryption (FHE) imposes substantial memory bandwidth demands, presenting significant challenges for efficient hardware acceleration. Near-memory Processing (NMP) has emerged as a promising architectural solution to alleviate the memory bottleneck. However, the irregular memory access patterns and flexible dataflows inherent to FHE limit the effectiveness of existing NMP accelera…
▽ More
Fully Homomorphic Encryption (FHE) imposes substantial memory bandwidth demands, presenting significant challenges for efficient hardware acceleration. Near-memory Processing (NMP) has emerged as a promising architectural solution to alleviate the memory bottleneck. However, the irregular memory access patterns and flexible dataflows inherent to FHE limit the effectiveness of existing NMP accelerators, which fail to fully utilize the available near-memory bandwidth. In this work, we propose FlexMem, a near-memory accelerator featuring high-parallel computational units with varying memory access strides and interconnect topologies to effectively handle irregular memory access patterns. Furthermore, we design polynomial and ciphertext-level dataflows to efficiently utilize near-memory bandwidth under varying degrees of polynomial parallelism and enhance parallel performance. Experimental results demonstrate that FlexMem achieves 1.12 times of performance improvement over state-of-the-art near-memory architectures, with 95.7% of near-memory bandwidth utilization.
△ Less
Submitted 30 March, 2025;
originally announced March 2025.
-
Benchmarking Machine Learning Methods for Distributed Acoustic Sensing
Authors:
Shuaikai Shi,
Qijun Zong
Abstract:
Distributed acoustic sensing (DAS) technology represents an innovative fiber-optic-based sensing methodology that enables real-time acoustic signal monitoring through the detection of minute perturbations along optical fibers. This sensing approach offers compelling advantages, including extensive measurement ranges, exceptional spatial resolution, and an expansive dynamic measurement spectrum.…
▽ More
Distributed acoustic sensing (DAS) technology represents an innovative fiber-optic-based sensing methodology that enables real-time acoustic signal monitoring through the detection of minute perturbations along optical fibers. This sensing approach offers compelling advantages, including extensive measurement ranges, exceptional spatial resolution, and an expansive dynamic measurement spectrum.
The integration of machine learning (ML) paradigms presents transformative potential for DAS technology, encompassing critical domains such as data augmentation, sophisticated preprocessing techniques, and advanced acoustic event classification and recognition. By leveraging ML algorithms, DAS systems can transition from traditional data processing methodologies to more automated and intelligent analytical frameworks.
The computational intelligence afforded by ML-enhanced DAS technologies facilitates unprecedented monitoring capabilities across diverse critical infrastructure sectors. Particularly noteworthy are the technology's applications in transportation infrastructure, energy management systems, and Natural disaster monitoring frameworks, where the precision of data acquisition and the reliability of intelligent decision-making mechanisms are paramount.
This research critically examines the comparative performance characteristics of classical machine learning methodologies and state-of-the-art deep learning models in the context of DAS data recognition and interpretation, offering comprehensive insights into the evolving landscape of intelligent sensing technologies.
△ Less
Submitted 26 March, 2025;
originally announced March 2025.
-
A Universal Model Combining Differential Equations and Neural Networks for Ball Trajectory Prediction
Authors:
Zhiwei Shi,
Chengxi Zhu,
Fan Yang,
Jun Yan,
Zheyun Qin,
Songquan Shi,
Zhumin Chen
Abstract:
This paper presents a data driven universal ball trajectory prediction method integrated with physics equations. Existing methods are designed for specific ball types and struggle to generalize. This challenge arises from three key factors. First, learning-based models require large datasets but suffer from accuracy drops in unseen scenarios. Second, physics-based models rely on complex formulas a…
▽ More
This paper presents a data driven universal ball trajectory prediction method integrated with physics equations. Existing methods are designed for specific ball types and struggle to generalize. This challenge arises from three key factors. First, learning-based models require large datasets but suffer from accuracy drops in unseen scenarios. Second, physics-based models rely on complex formulas and detailed inputs, yet accurately obtaining ball states, such as spin, is often impractical. Third, integrating physical principles with neural networks to achieve high accuracy, fast inference, and strong generalization remains difficult. To address these issues, we propose an innovative approach that incorporates physics-based equations and neural networks. We first derive three generalized physical formulas. Then, using a neural network and observed trajectory points, we infer certain parameters while fitting the remaining ones. These formulas enable precise trajectory prediction with minimal training data: only a few dozen samples. Extensive experiments demonstrate our method superiority in generalization, real-time performance, and accuracy.
△ Less
Submitted 25 March, 2025; v1 submitted 24 March, 2025;
originally announced March 2025.
-
M3Net: Multimodal Multi-task Learning for 3D Detection, Segmentation, and Occupancy Prediction in Autonomous Driving
Authors:
Xuesong Chen,
Shaoshuai Shi,
Tao Ma,
Jingqiu Zhou,
Simon See,
Ka Chun Cheung,
Hongsheng Li
Abstract:
The perception system for autonomous driving generally requires to handle multiple diverse sub-tasks. However, current algorithms typically tackle individual sub-tasks separately, which leads to low efficiency when aiming at obtaining full-perception results. Some multi-task learning methods try to unify multiple tasks with one model, but do not solve the conflicts in multi-task learning. In this…
▽ More
The perception system for autonomous driving generally requires to handle multiple diverse sub-tasks. However, current algorithms typically tackle individual sub-tasks separately, which leads to low efficiency when aiming at obtaining full-perception results. Some multi-task learning methods try to unify multiple tasks with one model, but do not solve the conflicts in multi-task learning. In this paper, we introduce M3Net, a novel multimodal and multi-task network that simultaneously tackles detection, segmentation, and 3D occupancy prediction for autonomous driving and achieves superior performance than single task model. M3Net takes multimodal data as input and multiple tasks via query-token interactions. To enhance the integration of multi-modal features for multi-task learning, we first propose the Modality-Adaptive Feature Integration (MAFI) module, which enables single-modality features to predict channel-wise attention weights for their high-performing tasks, respectively. Based on integrated features, we then develop task-specific query initialization strategies to accommodate the needs of detection/segmentation and 3D occupancy prediction. Leveraging the properly initialized queries, a shared decoder transforms queries and BEV features layer-wise, facilitating multi-task learning. Furthermore, we propose a Task-oriented Channel Scaling (TCS) module in the decoder to mitigate conflicts between optimizing for different tasks. Additionally, our proposed multi-task querying and TCS module support both Transformer-based decoder and Mamba-based decoder, demonstrating its flexibility to different architectures. M3Net achieves state-of-the-art multi-task learning performance on the nuScenes benchmarks.
△ Less
Submitted 23 March, 2025;
originally announced March 2025.
-
SocialJax: An Evaluation Suite for Multi-agent Reinforcement Learning in Sequential Social Dilemmas
Authors:
Zihao Guo,
Shuqing Shi,
Richard Willis,
Tristan Tomilin,
Joel Z. Leibo,
Yali Du
Abstract:
Sequential social dilemmas pose a significant challenge in the field of multi-agent reinforcement learning (MARL), requiring environments that accurately reflect the tension between individual and collective interests. Previous benchmarks and environments, such as Melting Pot, provide an evaluation protocol that measures generalization to new social partners in various test scenarios. However, run…
▽ More
Sequential social dilemmas pose a significant challenge in the field of multi-agent reinforcement learning (MARL), requiring environments that accurately reflect the tension between individual and collective interests. Previous benchmarks and environments, such as Melting Pot, provide an evaluation protocol that measures generalization to new social partners in various test scenarios. However, running reinforcement learning algorithms in traditional environments requires substantial computational resources. In this paper, we introduce SocialJax, a suite of sequential social dilemma environments and algorithms implemented in JAX. JAX is a high-performance numerical computing library for Python that enables significant improvements in operational efficiency. Our experiments demonstrate that the SocialJax training pipeline achieves at least 50\texttimes{} speed-up in real-time performance compared to Melting Pot RLlib baselines. Additionally, we validate the effectiveness of baseline algorithms within SocialJax environments. Finally, we use Schelling diagrams to verify the social dilemma properties of these environments, ensuring that they accurately capture the dynamics of social dilemmas.
△ Less
Submitted 19 May, 2025; v1 submitted 18 March, 2025;
originally announced March 2025.
-
JiSAM: Alleviate Labeling Burden and Corner Case Problems in Autonomous Driving via Minimal Real-World Data
Authors:
Runjian Chen,
Wenqi Shao,
Bo Zhang,
Shaoshuai Shi,
Li Jiang,
Ping Luo
Abstract:
Deep-learning-based autonomous driving (AD) perception introduces a promising picture for safe and environment-friendly transportation. However, the over-reliance on real labeled data in LiDAR perception limits the scale of on-road attempts. 3D real world data is notoriously time-and-energy-consuming to annotate and lacks corner cases like rare traffic participants. On the contrary, in simulators…
▽ More
Deep-learning-based autonomous driving (AD) perception introduces a promising picture for safe and environment-friendly transportation. However, the over-reliance on real labeled data in LiDAR perception limits the scale of on-road attempts. 3D real world data is notoriously time-and-energy-consuming to annotate and lacks corner cases like rare traffic participants. On the contrary, in simulators like CARLA, generating labeled LiDAR point clouds with corner cases is a piece of cake. However, introducing synthetic point clouds to improve real perception is non-trivial. This stems from two challenges: 1) sample efficiency of simulation datasets 2) simulation-to-real gaps. To overcome both challenges, we propose a plug-and-play method called JiSAM , shorthand for Jittering augmentation, domain-aware backbone and memory-based Sectorized AlignMent. In extensive experiments conducted on the famous AD dataset NuScenes, we demonstrate that, with SOTA 3D object detector, JiSAM is able to utilize the simulation data and only labels on 2.5% available real data to achieve comparable performance to models trained on all real data. Additionally, JiSAM achieves more than 15 mAPs on the objects not labeled in the real training set. We will release models and codes.
△ Less
Submitted 13 March, 2025; v1 submitted 11 March, 2025;
originally announced March 2025.
-
Advancing Problem-Based Learning with Clinical Reasoning for Improved Differential Diagnosis in Medical Education
Authors:
Yuansong Xu,
Yuheng Shao,
Jiahe Dong,
Shaohan Shi,
Chang Jiang,
Quan Li
Abstract:
Medical education increasingly emphasizes students' ability to apply knowledge in real-world clinical settings, focusing on evidence-based clinical reasoning and differential diagnoses. Problem-based learning (PBL) addresses traditional teaching limitations by embedding learning into meaningful contexts and promoting active participation. However, current PBL practices are often confined to medica…
▽ More
Medical education increasingly emphasizes students' ability to apply knowledge in real-world clinical settings, focusing on evidence-based clinical reasoning and differential diagnoses. Problem-based learning (PBL) addresses traditional teaching limitations by embedding learning into meaningful contexts and promoting active participation. However, current PBL practices are often confined to medical instructional settings, limiting students' ability to self-direct and refine their approaches based on targeted improvements. Additionally, the unstructured nature of information organization during analysis poses challenges for record-keeping and subsequent review. Existing research enhances PBL realism and immersion but overlooks the construction of logic chains and evidence-based reasoning. To address these gaps, we designed e-MedLearn, a learner-centered PBL system that supports more efficient application and practice of evidence-based clinical reasoning. Through controlled study (N=19) and testing interviews (N=13), we gathered data to assess the system's impact. The findings demonstrate that e-MedLearn improves PBL experiences and provides valuable insights for advancing clinical reasoning-based learning.
△ Less
Submitted 8 March, 2025;
originally announced March 2025.
-
DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models
Authors:
Yi Shen,
Jian Zhang,
Jieyun Huang,
Shuming Shi,
Wenjing Zhang,
Jiangze Yan,
Ning Wang,
Kai Wang,
Zhaoxiang Liu,
Shiguo Lian
Abstract:
Recent advancements in slow thinking reasoning models have shown exceptional performance in complex reasoning tasks. However, these models often exhibit overthinking (generating redundant reasoning steps for simple problems), leading to excessive computational resource usage. While current mitigation strategies uniformly reduce reasoning tokens, they risk degrading performance on challenging tasks…
▽ More
Recent advancements in slow thinking reasoning models have shown exceptional performance in complex reasoning tasks. However, these models often exhibit overthinking (generating redundant reasoning steps for simple problems), leading to excessive computational resource usage. While current mitigation strategies uniformly reduce reasoning tokens, they risk degrading performance on challenging tasks that require extended reasoning. This paper introduces Difficulty-Adaptive Slow Thinking (DAST), a novel framework that enables models to autonomously adjust the length of Chain-of-Thought (CoT) based on problem difficulty. We first propose a Token Length Budget (TLB) metric to quantify difficulty, then leverage budget-aware reward shaping and budget preference optimization to implement DAST. DAST penalizes overlong responses for simple tasks while incentivizing sufficient reasoning for complex problems. Experiments on diverse datasets and model scales demonstrate that DAST effectively mitigates overthinking (reducing token usage by over 30\% on average) while preserving reasoning accuracy on complex problems. Our codes and models are available at https://github.com/AnonymousUser0520/AnonymousRepo01.
△ Less
Submitted 3 June, 2025; v1 submitted 6 March, 2025;
originally announced March 2025.
-
CuDIP: Enhancing Theorem Proving in LLMs via Curriculum Learning-based Direct Preference Optimization
Authors:
Shuming Shi,
Ruobing Zuo,
Gaolei He,
Jianlin Wang,
Chenyang Xu,
Zhengfeng Yang
Abstract:
Automated theorem proving (ATP) is one of the most challenging mathematical reasoning tasks for Large Language Models (LLMs). Most existing LLM-based ATP methods rely on supervised fine-tuning, which results in a limited alignment between the theorem proving process and human preferences. Direct Preference Optimization (DPO), which aligns LLMs with human preferences, has shown positive effects for…
▽ More
Automated theorem proving (ATP) is one of the most challenging mathematical reasoning tasks for Large Language Models (LLMs). Most existing LLM-based ATP methods rely on supervised fine-tuning, which results in a limited alignment between the theorem proving process and human preferences. Direct Preference Optimization (DPO), which aligns LLMs with human preferences, has shown positive effects for certain tasks. However, the lack of high-quality preference data for theorem proving presents a significant challenge. In this paper, we innovatively apply DPO to formal automated theorem proving and introduces a Curriculum Learning-based DPO Iterative Theorem Proving (CuDIP) method. Specifically, we propose a method for constructing preference data which utilizes LLMs and existing theorem proving data to enhance the diversity of the preference data while reducing the reliance on human preference annotations. We then integrate this preference data construction method with curriculum learning to iteratively fine-tune the theorem proving model through DPO. Experimental results on the MiniF2F and ProofNet datasets demonstrate the effectiveness of the proposed method.
△ Less
Submitted 24 February, 2025;
originally announced February 2025.
-
DreamDDP: Accelerating Data Parallel Distributed LLM Training with Layer-wise Scheduled Partial Synchronization
Authors:
Zhenheng Tang,
Zichen Tang,
Junlin Huang,
Xinglin Pan,
Rudan Yan,
Yuxin Wang,
Amelie Chi Zhou,
Shaohuai Shi,
Xiaowen Chu,
Bo Li
Abstract:
The growth of large language models (LLMs) increases challenges of accelerating distributed training across multiple GPUs in different data centers. Moreover, concerns about data privacy and data exhaustion have heightened interest in geo-distributed data centers. Communication in geo-distributed data parallel training (DDP) with stochastic gradient descent (S-SGD) is the main bottleneck in low-ba…
▽ More
The growth of large language models (LLMs) increases challenges of accelerating distributed training across multiple GPUs in different data centers. Moreover, concerns about data privacy and data exhaustion have heightened interest in geo-distributed data centers. Communication in geo-distributed data parallel training (DDP) with stochastic gradient descent (S-SGD) is the main bottleneck in low-bandwidth environments. Local SGD mitigates communication overhead by reducing synchronization frequency, and recent studies have successfully applied it to geo-distributedly pre-train LLMs. However, we identify that its model synchronization mechanism prevents overlapping communication and computation, which makes the system lose opportunities to overlap communication and computation.
To overcome this limitation, we expand the design space of local SGD by layer-wisely decoupling model synchronization. In each iteration, only some layers are synchronized instead of the entire model after a specific number of iterations. Leveraging this methodology, we introduce DreamDDP, a training framework to accelerate low-bandwidth distributed training with three key innovations: (1) partial local SGD with theoretical assurances of convergence rates comparable to S-SGD; (2) overlapping parameter synchronization with computation without extra GPU memory occupation; (3) identifying and exploiting three properties to schedule the communication and computation to reduce the training time based on fine-grained profiling of layer-wise communication and computation time. Empirical evaluations conducted on 32 GPUs using prominent deep learning models, including ResNet-18, ResNet-50, GPT-2, and Llama-2, demonstrate that DreamDDP enhances the convergence properties of Local SGD (and Adam) and achieves speedups ranging from $1.49\times$ to $3.91\times$ over leading baseline methods.
△ Less
Submitted 16 February, 2025;
originally announced February 2025.
-
Outback: Fast and Communication-efficient Index for Key-Value Store on Disaggregated Memory
Authors:
Yi Liu,
Minghao Xie,
Shouqian Shi,
Yuanchao Xu,
Heiner Litz,
Chen Qian
Abstract:
Disaggregated memory systems achieve resource utilization efficiency and system scalability by distributing computation and memory resources into distinct pools of nodes. RDMA is an attractive solution to support high-throughput communication between different disaggregated resource pools. However, existing RDMA solutions face a dilemma: one-sided RDMA completely bypasses computation at memory nod…
▽ More
Disaggregated memory systems achieve resource utilization efficiency and system scalability by distributing computation and memory resources into distinct pools of nodes. RDMA is an attractive solution to support high-throughput communication between different disaggregated resource pools. However, existing RDMA solutions face a dilemma: one-sided RDMA completely bypasses computation at memory nodes, but its communication takes multiple round trips; two-sided RDMA achieves one-round-trip communication but requires non-trivial computation for index lookups at memory nodes, which violates the principle of disaggregated memory. This work presents Outback, a novel indexing solution for key-value stores with a one-round-trip RDMA-based network that does not incur computation-heavy tasks at memory nodes. Outback is the first to utilize dynamic minimal perfect hashing and separates its index into two components: one memory-efficient and compute-heavy component at compute nodes and the other memory-heavy and compute-efficient component at memory nodes. We implement a prototype of Outback and evaluate its performance in a public cloud. The experimental results show that Outback achieves higher throughput than both the state-of-the-art one-sided RDMA and two-sided RDMA-based in-memory KVS by 1.06-5.03x, due to the unique strength of applying a separated perfect hashing index.
△ Less
Submitted 13 February, 2025;
originally announced February 2025.
-
StarCast: A Secure and Spectrum-Efficient Group Communication Scheme for LEO Satellite Networks
Authors:
Chaoyu Zhang,
Hexuan Yu,
Shanghao Shi,
Shaoyu Li,
Yi Shi,
Eric Burger,
Y. Thomas Hou,
Wenjing Lou
Abstract:
Low Earth Orbit (LEO) satellite networks serve as a cornerstone infrastructure for providing ubiquitous connectivity in areas where terrestrial infrastructure is unavailable. With the emergence of Direct-to-Cell (DTC) satellites, these networks can provide direct access to mobile phones and IoT devices without relying on terrestrial base stations, leading to a surge in massive connectivity demands…
▽ More
Low Earth Orbit (LEO) satellite networks serve as a cornerstone infrastructure for providing ubiquitous connectivity in areas where terrestrial infrastructure is unavailable. With the emergence of Direct-to-Cell (DTC) satellites, these networks can provide direct access to mobile phones and IoT devices without relying on terrestrial base stations, leading to a surge in massive connectivity demands for the serving satellite. To address this issue, group communication is an effective paradigm that enables simultaneous content delivery to multiple users and thus optimizes bandwidth reuse. Although extensive research has been conducted to improve group communication performance, securing this communication without compromising its inherent spectrum efficiency remains a critical challenge. To address this, we introduce StarCast, a secure group encryption scheme for LEO satellite networks. Our solution leverages ciphertext-policy attribute-based encryption (CP-ABE) to implement fine-grained access control by embedding access policies directly within the ciphertext. Unlike standard secure communication approaches that require dedicated per-user channels and significantly deplete limited satellite spectrum resources, StarCast maintains efficient spectrum reuse within user groups while ensuring that only authorized users can access transmitted data. Additionally, it significantly reduces the costly key management overhead associated with conventional encryption schemes.
△ Less
Submitted 11 February, 2025;
originally announced February 2025.
-
Speaking the Language of Teamwork: LLM-Guided Credit Assignment in Multi-Agent Reinforcement Learning
Authors:
Muhan Lin,
Shuyang Shi,
Yue Guo,
Vaishnav Tadiparthi,
Behdad Chalaki,
Ehsan Moradi Pari,
Simon Stepputtis,
Woojun Kim,
Joseph Campbell,
Katia Sycara
Abstract:
Credit assignment, the process of attributing credit or blame to individual agents for their contributions to a team's success or failure, remains a fundamental challenge in multi-agent reinforcement learning (MARL), particularly in environments with sparse rewards. Commonly-used approaches such as value decomposition often lead to suboptimal policies in these settings, and designing dense reward…
▽ More
Credit assignment, the process of attributing credit or blame to individual agents for their contributions to a team's success or failure, remains a fundamental challenge in multi-agent reinforcement learning (MARL), particularly in environments with sparse rewards. Commonly-used approaches such as value decomposition often lead to suboptimal policies in these settings, and designing dense reward functions that align with human intuition can be complex and labor-intensive. In this work, we propose a novel framework where a large language model (LLM) generates dense, agent-specific rewards based on a natural language description of the task and the overall team goal. By learning a potential-based reward function over multiple queries, our method reduces the impact of ranking errors while allowing the LLM to evaluate each agent's contribution to the overall task. Through extensive experiments, we demonstrate that our approach achieves faster convergence and higher policy returns compared to state-of-the-art MARL baselines.
△ Less
Submitted 28 February, 2025; v1 submitted 5 February, 2025;
originally announced February 2025.
-
When Everyday Devices Become Weapons: A Closer Look at the Pager and Walkie-talkie Attacks
Authors:
Pantha Protim Sarker,
Upoma Das,
Nitin Varshney,
Shang Shi,
Akshay Kulkarni,
Farimah Farahmandi,
Mark Tehranipoor
Abstract:
Battery-powered technologies like pagers and walkie-talkies have long been integral to civilian and military operations. However, the potential for such everyday devices to be weaponized has largely been underestimated in the realm of cybersecurity. In September 2024, Lebanon experienced a series of unprecedented, coordinated explosions triggered through compromised pagers and walkie-talkies, crea…
▽ More
Battery-powered technologies like pagers and walkie-talkies have long been integral to civilian and military operations. However, the potential for such everyday devices to be weaponized has largely been underestimated in the realm of cybersecurity. In September 2024, Lebanon experienced a series of unprecedented, coordinated explosions triggered through compromised pagers and walkie-talkies, creating a new category of attack in the domain of cyber-physical warfare. This attack not only disrupted critical communication networks but also resulted in injuries, loss of life, and exposed significant national security vulnerabilities, prompting governments and organizations worldwide to reevaluate their cybersecurity frameworks. This article provides an in-depth investigation into the infamous Pager and Walkie-Talkie attacks, analyzing both technical and non-technical dimensions. Furthermore, the study extends its scope to explore vulnerabilities in other battery-powered infrastructures, such as battery management systems, highlighting their potential exploitation. Existing prevention and detection techniques are reviewed, with an emphasis on their limitations and the challenges they face in addressing emerging threats. Finally, the article discusses emerging methodologies, particularly focusing on the role of physical inspection, as a critical component of future security measures. This research aims to provide actionable insights to bolster the resilience of cyber-physical systems in an increasingly interconnected world.
△ Less
Submitted 28 January, 2025;
originally announced January 2025.
-
Humanity's Last Exam
Authors:
Long Phan,
Alice Gatti,
Ziwen Han,
Nathaniel Li,
Josephina Hu,
Hugh Zhang,
Chen Bo Calvin Zhang,
Mohamed Shaaban,
John Ling,
Sean Shi,
Michael Choi,
Anish Agrawal,
Arnav Chopra,
Adam Khoja,
Ryan Kim,
Richard Ren,
Jason Hausenloy,
Oliver Zhang,
Mantas Mazeika,
Dmitry Dodonov,
Tung Nguyen,
Jaeho Lee,
Daron Anderson,
Mikhail Doroshenko,
Alun Cennyth Stokes
, et al. (1084 additional authors not shown)
Abstract:
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of…
▽ More
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.
△ Less
Submitted 19 April, 2025; v1 submitted 24 January, 2025;
originally announced January 2025.
-
FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces
Authors:
Zhenran Xu,
Longyue Wang,
Jifang Wang,
Zhouyi Li,
Senbao Shi,
Xue Yang,
Yiyu Wang,
Baotian Hu,
Jun Yu,
Min Zhang
Abstract:
Virtual film production requires intricate decision-making processes, including scriptwriting, virtual cinematography, and precise actor positioning and actions. Motivated by recent advances in automated decision-making with language agent-based societies, this paper introduces FilmAgent, a novel LLM-based multi-agent collaborative framework for end-to-end film automation in our constructed 3D vir…
▽ More
Virtual film production requires intricate decision-making processes, including scriptwriting, virtual cinematography, and precise actor positioning and actions. Motivated by recent advances in automated decision-making with language agent-based societies, this paper introduces FilmAgent, a novel LLM-based multi-agent collaborative framework for end-to-end film automation in our constructed 3D virtual spaces. FilmAgent simulates various crew roles, including directors, screenwriters, actors, and cinematographers, and covers key stages of a film production workflow: (1) idea development transforms brainstormed ideas into structured story outlines; (2) scriptwriting elaborates on dialogue and character actions for each scene; (3) cinematography determines the camera setups for each shot. A team of agents collaborates through iterative feedback and revisions, thereby verifying intermediate scripts and reducing hallucinations. We evaluate the generated videos on 15 ideas and 4 key aspects. Human evaluation shows that FilmAgent outperforms all baselines across all aspects and scores 3.98 out of 5 on average, showing the feasibility of multi-agent collaboration in filmmaking. Further analysis reveals that FilmAgent, despite using the less advanced GPT-4o model, surpasses the single-agent o1, showing the advantage of a well-coordinated multi-agent system. Lastly, we discuss the complementary strengths and weaknesses of OpenAI's text-to-video model Sora and our FilmAgent in filmmaking.
△ Less
Submitted 22 January, 2025;
originally announced January 2025.
-
FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models
Authors:
Xinglin Pan,
Wenxiang Lin,
Lin Zhang,
Shaohuai Shi,
Zhenheng Tang,
Rui Wang,
Bo Li,
Xiaowen Chu
Abstract:
Recent large language models (LLMs) have tended to leverage sparsity to reduce computations, employing the sparsely activated mixture-of-experts (MoE) technique. MoE introduces four modules, including token routing, token communication, expert computation, and expert parallelism, that impact model quality and training efficiency. To enable versatile usage of MoE models, we introduce FSMoE, a flexi…
▽ More
Recent large language models (LLMs) have tended to leverage sparsity to reduce computations, employing the sparsely activated mixture-of-experts (MoE) technique. MoE introduces four modules, including token routing, token communication, expert computation, and expert parallelism, that impact model quality and training efficiency. To enable versatile usage of MoE models, we introduce FSMoE, a flexible training system optimizing task scheduling with three novel techniques: 1) Unified abstraction and online profiling of MoE modules for task scheduling across various MoE implementations. 2) Co-scheduling intra-node and inter-node communications with computations to minimize communication overheads. 3) To support near-optimal task scheduling, we design an adaptive gradient partitioning method for gradient aggregation and a schedule to adaptively pipeline communications and computations. We conduct extensive experiments with configured MoE layers and real-world MoE models on two GPU clusters. Experimental results show that 1) our FSMoE supports four popular types of MoE routing functions and is more efficient than existing implementations (with up to a 1.42$\times$ speedup), and 2) FSMoE outperforms the state-of-the-art MoE training systems (DeepSpeed-MoE and Tutel) by 1.18$\times$-1.22$\times$ on 1458 MoE layers and 1.19$\times$-3.01$\times$ on real-world MoE models based on GPT-2 and Mixtral using a popular routing function.
△ Less
Submitted 18 January, 2025;
originally announced January 2025.
-
MiniMax-01: Scaling Foundation Models with Lightning Attention
Authors:
MiniMax,
Aonian Li,
Bangwei Gong,
Bo Yang,
Boji Shan,
Chang Liu,
Cheng Zhu,
Chunhao Zhang,
Congchao Guo,
Da Chen,
Dong Li,
Enwei Jiao,
Gengxin Li,
Guojun Zhang,
Haohai Sun,
Houze Dong,
Jiadai Zhu,
Jiaqi Zhuang,
Jiayuan Song,
Jin Zhu,
Jingtao Han,
Jingyang Li,
Junbin Xie,
Junhao Xu,
Junjie Yan
, et al. (65 additional authors not shown)
Abstract:
We introduce MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01, which are comparable to top-tier models while offering superior capabilities in processing longer contexts. The core lies in lightning attention and its efficient scaling. To maximize computational capacity, we integrate it with Mixture of Experts (MoE), creating a model with 32 experts and 456 billion total parameters, o…
▽ More
We introduce MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01, which are comparable to top-tier models while offering superior capabilities in processing longer contexts. The core lies in lightning attention and its efficient scaling. To maximize computational capacity, we integrate it with Mixture of Experts (MoE), creating a model with 32 experts and 456 billion total parameters, of which 45.9 billion are activated for each token. We develop an optimized parallel strategy and highly efficient computation-communication overlap techniques for MoE and lightning attention. This approach enables us to conduct efficient training and inference on models with hundreds of billions of parameters across contexts spanning millions of tokens. The context window of MiniMax-Text-01 can reach up to 1 million tokens during training and extrapolate to 4 million tokens during inference at an affordable cost. Our vision-language model, MiniMax-VL-01 is built through continued training with 512 billion vision-language tokens. Experiments on both standard and in-house benchmarks show that our models match the performance of state-of-the-art models like GPT-4o and Claude-3.5-Sonnet while offering 20-32 times longer context window. We publicly release MiniMax-01 at https://github.com/MiniMax-AI.
△ Less
Submitted 14 January, 2025;
originally announced January 2025.
-
Physics-Driven Learning for Inverse Problems in Quantum Chromodynamics
Authors:
Gert Aarts,
Kenji Fukushima,
Tetsuo Hatsuda,
Andreas Ipp,
Shuzhe Shi,
Lingxiao Wang,
Kai Zhou
Abstract:
The integration of deep learning techniques and physics-driven designs is reforming the way we address inverse problems, in which accurate physical properties are extracted from complex data sets. This is particularly relevant for quantum chromodynamics (QCD), the theory of strong interactions, with its inherent limitations in observational data and demanding computational approaches. This perspec…
▽ More
The integration of deep learning techniques and physics-driven designs is reforming the way we address inverse problems, in which accurate physical properties are extracted from complex data sets. This is particularly relevant for quantum chromodynamics (QCD), the theory of strong interactions, with its inherent limitations in observational data and demanding computational approaches. This perspective highlights advances and potential of physics-driven learning methods, focusing on predictions of physical quantities towards QCD physics, and drawing connections to machine learning(ML). It is shown that the fusion of ML and physics can lead to more efficient and reliable problem-solving strategies. Key ideas of ML, methodology of embedding physics priors, and generative models as inverse modelling of physical probability distributions are introduced. Specific applications cover first-principle lattice calculations, and QCD physics of hadrons, neutron stars, and heavy-ion collisions. These examples provide a structured and concise overview of how incorporating prior knowledge such as symmetry, continuity and equations into deep learning designs can address diverse inverse problems across different physical sciences.
△ Less
Submitted 9 January, 2025;
originally announced January 2025.
-
Cosmos World Foundation Model Platform for Physical AI
Authors:
NVIDIA,
:,
Niket Agarwal,
Arslan Ali,
Maciej Bala,
Yogesh Balaji,
Erik Barker,
Tiffany Cai,
Prithvijit Chattopadhyay,
Yongxin Chen,
Yin Cui,
Yifan Ding,
Daniel Dworakowski,
Jiaojiao Fan,
Michele Fenzi,
Francesco Ferroni,
Sanja Fidler,
Dieter Fox,
Songwei Ge,
Yunhao Ge,
Jinwei Gu,
Siddharth Gururani,
Ethan He,
Jiahui Huang,
Jacob Huffman
, et al. (54 additional authors not shown)
Abstract:
Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into cu…
▽ More
Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make Cosmos open-source and our models open-weight with permissive licenses available via https://github.com/nvidia-cosmos/cosmos-predict1.
△ Less
Submitted 18 March, 2025; v1 submitted 7 January, 2025;
originally announced January 2025.
-
A Conditional Diffusion Model for Electrical Impedance Tomography Image Reconstruction
Authors:
Shuaikai Shi,
Ruiyuan Kang,
Panos Liatsis
Abstract:
Electrical impedance tomography (EIT) is a non-invasive imaging technique, capable of reconstructing images of the electrical conductivity of tissues and materials. It is popular in diverse application areas, from medical imaging to industrial process monitoring and tactile sensing, due to its low cost, real-time capabilities and non-ionizing nature. EIT visualizes the conductivity distribution wi…
▽ More
Electrical impedance tomography (EIT) is a non-invasive imaging technique, capable of reconstructing images of the electrical conductivity of tissues and materials. It is popular in diverse application areas, from medical imaging to industrial process monitoring and tactile sensing, due to its low cost, real-time capabilities and non-ionizing nature. EIT visualizes the conductivity distribution within a body by measuring the boundary voltages, given a current injection. However, EIT image reconstruction is ill-posed due to the mismatch between the under-sampled voltage data and the high-resolution conductivity image. A variety of approaches, both conventional and deep learning-based, have been proposed, capitalizing on the use of spatial regularizers, and the paradigm of image regression. In this research, a novel method based on the conditional diffusion model for EIT reconstruction is proposed, termed CDEIT. Specifically, CDEIT consists of the forward diffusion process, which first gradually adds Gaussian noise to the clean conductivity images, and a reverse denoising process, which learns to predict the original conductivity image from its noisy version, conditioned on the boundary voltages. Following model training, CDEIT applies the conditional reverse process on test voltage data to generate the desired conductivities. Moreover, we provide the details of a normalization procedure, which demonstrates how EIT image reconstruction models trained on simulated datasets can be applied on real datasets with varying sizes, excitation currents and background conductivities. Experiments conducted on a synthetic dataset and two real datasets demonstrate that the proposed model outperforms state-of-the-art methods. The CDEIT software is available as open-source (https://github.com/shuaikaishi/CDEIT) for reproducibility purposes.
△ Less
Submitted 22 December, 2024;
originally announced December 2024.
-
SweetTok: Semantic-Aware Spatial-Temporal Tokenizer for Compact Video Discretization
Authors:
Zhentao Tan,
Ben Xue,
Jian Jia,
Junhao Wang,
Wencai Ye,
Shaoyun Shi,
Mingjie Sun,
Wenjin Wu,
Quan Chen,
Peng Jiang
Abstract:
This paper presents the \textbf{S}emantic-a\textbf{W}ar\textbf{E} spatial-t\textbf{E}mporal \textbf{T}okenizer (SweetTok), a novel video tokenizer to overcome the limitations in current video tokenization methods for compacted yet effective discretization. Unlike previous approaches that process flattened local visual patches via direct discretization or adaptive query tokenization, SweetTok propo…
▽ More
This paper presents the \textbf{S}emantic-a\textbf{W}ar\textbf{E} spatial-t\textbf{E}mporal \textbf{T}okenizer (SweetTok), a novel video tokenizer to overcome the limitations in current video tokenization methods for compacted yet effective discretization. Unlike previous approaches that process flattened local visual patches via direct discretization or adaptive query tokenization, SweetTok proposes a decoupling framework, compressing visual inputs through distinct spatial and temporal queries via \textbf{D}ecoupled \textbf{Q}uery \textbf{A}uto\textbf{E}ncoder (DQAE). This design allows SweetTok to efficiently compress video token count while achieving superior fidelity by capturing essential information across spatial and temporal dimensions. Furthermore, we design a \textbf{M}otion-enhanced \textbf{L}anguage \textbf{C}odebook (MLC) tailored for spatial and temporal compression to address the differences in semantic representation between appearance and motion information. SweetTok significantly improves video reconstruction results by \textbf{42.8\%} w.r.t rFVD on UCF-101 dataset. With a better token compression strategy, it also boosts downstream video generation results by \textbf{15.1\%} w.r.t gFVD. Additionally, the compressed decoupled tokens are imbued with semantic information, enabling few-shot recognition capabilities powered by LLMs in downstream applications.
△ Less
Submitted 10 March, 2025; v1 submitted 11 December, 2024;
originally announced December 2024.
-
A Context-Enhanced Framework for Sequential Graph Reasoning
Authors:
Shuo Shi,
Chao Peng,
Chenyang Xu,
Zhengfeng Yang
Abstract:
The paper studies sequential reasoning over graph-structured data, which stands as a fundamental task in various trending fields like automated math problem solving and neural graph algorithm learning, attracting a lot of research interest. Simultaneously managing both sequential and graph-structured information in such tasks presents a notable challenge. Over recent years, many neural architectur…
▽ More
The paper studies sequential reasoning over graph-structured data, which stands as a fundamental task in various trending fields like automated math problem solving and neural graph algorithm learning, attracting a lot of research interest. Simultaneously managing both sequential and graph-structured information in such tasks presents a notable challenge. Over recent years, many neural architectures in the literature have emerged to tackle the issue. In this work, we generalize the existing architectures and propose a context-enhanced framework. The crucial innovation is that the reasoning of each step does not only rely on the outcome of the preceding step but also leverages the aggregation of information from more historical outcomes. The idea stems from our observation that in sequential graph reasoning, each step's outcome has a much stronger inner connection with each other compared to traditional seq-to-seq tasks. We show that the framework can effectively integrate with the existing methods, enhancing their reasoning abilities. Empirical evaluations are conducted on the challenging CLRS Reasoning Benchmark, and the results demonstrate that the proposed framework significantly improves the performance of existing architectures, yielding state-of-the-art results across the majority of the datasets within the benchmark.
△ Less
Submitted 12 December, 2024;
originally announced December 2024.
-
Mel-Refine: A Plug-and-Play Approach to Refine Mel-Spectrogram in Audio Generation
Authors:
Hongming Guo,
Ruibo Fu,
Yizhong Geng,
Shuai Liu,
Shuchen Shi,
Tao Wang,
Chunyu Qiang,
Chenxing Li,
Ya Li,
Zhengqi Wen,
Yukun Liu,
Xuefei Liu
Abstract:
Text-to-audio (TTA) model is capable of generating diverse audio from textual prompts. However, most mainstream TTA models, which predominantly rely on Mel-spectrograms, still face challenges in producing audio with rich content. The intricate details and texture required in Mel-spectrograms for such audio often surpass the models' capacity, leading to outputs that are blurred or lack coherence. I…
▽ More
Text-to-audio (TTA) model is capable of generating diverse audio from textual prompts. However, most mainstream TTA models, which predominantly rely on Mel-spectrograms, still face challenges in producing audio with rich content. The intricate details and texture required in Mel-spectrograms for such audio often surpass the models' capacity, leading to outputs that are blurred or lack coherence. In this paper, we begin by investigating the critical role of U-Net in Mel-spectrogram generation. Our analysis shows that in U-Net structure, high-frequency components in skip-connections and the backbone influence texture and detail, while low-frequency components in the backbone are critical for the diffusion denoising process. We further propose ``Mel-Refine'', a plug-and-play approach that enhances Mel-spectrogram texture and detail by adjusting different component weights during inference. Our method requires no additional training or fine-tuning and is fully compatible with any diffusion-based TTA architecture. Experimental results show that our approach boosts performance metrics of the latest TTA model Tango2 by 25\%, demonstrating its effectiveness.
△ Less
Submitted 11 December, 2024;
originally announced December 2024.
-
Access Point Deployment for Localizing Accuracy and User Rate in Cell-Free Systems
Authors:
Fanfei Xu,
Shengheng Liu,
Zihuan Mao,
Shangqing Shi,
Dazhuan Xu,
Dongming Wang,
Yongming Huang
Abstract:
Evolving next-generation mobile networks is designed to provide ubiquitous coverage and networked sensing. With utility of multi-view sensing and multi-node joint transmission, cell-free is a promising technique to realize this prospect. This paper aims to tackle the problem of access point (AP) deployment in cell-free systems to balance the sensing accuracy and user rate. By merging the D-optimal…
▽ More
Evolving next-generation mobile networks is designed to provide ubiquitous coverage and networked sensing. With utility of multi-view sensing and multi-node joint transmission, cell-free is a promising technique to realize this prospect. This paper aims to tackle the problem of access point (AP) deployment in cell-free systems to balance the sensing accuracy and user rate. By merging the D-optimality with Euclidean criterion, a novel integrated metric is proposed to be the objective function for both max-sum and max-min problems, which respectively guarantee the overall and lowest performance in multi-user communication and target tracking scenario. To solve the corresponding high dimensional non-convex multi-objective problem, the Soft actor-critic (SAC) is utilized to avoid risk of local optimal result. Numerical results demonstrate that proposed SAC-based APs deployment method achieves $20\%$ of overall performance and $120\%$ of lowest performance.
△ Less
Submitted 9 December, 2024;
originally announced December 2024.
-
Adversarial Transferability in Deep Denoising Models: Theoretical Insights and Robustness Enhancement via Out-of-Distribution Typical Set Sampling
Authors:
Jie Ning,
Jiebao Sun,
Shengzhu Shi,
Zhichang Guo,
Yao Li,
Hongwei Li,
Boying Wu
Abstract:
Deep learning-based image denoising models demonstrate remarkable performance, but their lack of robustness analysis remains a significant concern. A major issue is that these models are susceptible to adversarial attacks, where small, carefully crafted perturbations to input data can cause them to fail. Surprisingly, perturbations specifically crafted for one model can easily transfer across vari…
▽ More
Deep learning-based image denoising models demonstrate remarkable performance, but their lack of robustness analysis remains a significant concern. A major issue is that these models are susceptible to adversarial attacks, where small, carefully crafted perturbations to input data can cause them to fail. Surprisingly, perturbations specifically crafted for one model can easily transfer across various models, including CNNs, Transformers, unfolding models, and plug-and-play models, leading to failures in those models as well. Such high adversarial transferability is not observed in classification models. We analyze the possible underlying reasons behind the high adversarial transferability through a series of hypotheses and validation experiments. By characterizing the manifolds of Gaussian noise and adversarial perturbations using the concept of typical set and the asymptotic equipartition property, we prove that adversarial samples deviate slightly from the typical set of the original input distribution, causing the models to fail. Based on these insights, we propose a novel adversarial defense method: the Out-of-Distribution Typical Set Sampling Training strategy (TS). TS not only significantly enhances the model's robustness but also marginally improves denoising performance compared to the original model.
△ Less
Submitted 8 December, 2024;
originally announced December 2024.
-
MotionStone: Decoupled Motion Intensity Modulation with Diffusion Transformer for Image-to-Video Generation
Authors:
Shuwei Shi,
Biao Gong,
Xi Chen,
Dandan Zheng,
Shuai Tan,
Zizheng Yang,
Yuyuan Li,
Jingwen He,
Kecheng Zheng,
Jingdong Chen,
Ming Yang,
Yinqiang Zheng
Abstract:
The image-to-video (I2V) generation is conditioned on the static image, which has been enhanced recently by the motion intensity as an additional control signal. These motion-aware models are appealing to generate diverse motion patterns, yet there lacks a reliable motion estimator for training such models on large-scale video set in the wild. Traditional metrics, e.g., SSIM or optical flow, are h…
▽ More
The image-to-video (I2V) generation is conditioned on the static image, which has been enhanced recently by the motion intensity as an additional control signal. These motion-aware models are appealing to generate diverse motion patterns, yet there lacks a reliable motion estimator for training such models on large-scale video set in the wild. Traditional metrics, e.g., SSIM or optical flow, are hard to generalize to arbitrary videos, while, it is very tough for human annotators to label the abstract motion intensity neither. Furthermore, the motion intensity shall reveal both local object motion and global camera movement, which has not been studied before. This paper addresses the challenge with a new motion estimator, capable of measuring the decoupled motion intensities of objects and cameras in video. We leverage the contrastive learning on randomly paired videos and distinguish the video with greater motion intensity. Such a paradigm is friendly for annotation and easy to scale up to achieve stable performance on motion estimation. We then present a new I2V model, named MotionStone, developed with the decoupled motion estimator. Experimental results demonstrate the stability of the proposed motion estimator and the state-of-the-art performance of MotionStone on I2V generation. These advantages warrant the decoupled motion estimator to serve as a general plug-in enhancer for both data processing and video generation training.
△ Less
Submitted 8 December, 2024;
originally announced December 2024.
-
Aya Expanse: Combining Research Breakthroughs for a New Multilingual Frontier
Authors:
John Dang,
Shivalika Singh,
Daniel D'souza,
Arash Ahmadian,
Alejandro Salamanca,
Madeline Smith,
Aidan Peppin,
Sungjin Hong,
Manoj Govindassamy,
Terrence Zhao,
Sandra Kublik,
Meor Amer,
Viraat Aryabumi,
Jon Ander Campos,
Yi-Chern Tan,
Tom Kocmi,
Florian Strub,
Nathan Grinsztajn,
Yannis Flet-Berliac,
Acyr Locatelli,
Hangyu Lin,
Dwarak Talupuru,
Bharat Venkitesh,
David Cairuz,
Bowen Yang
, et al. (20 additional authors not shown)
Abstract:
We introduce the Aya Expanse model family, a new generation of 8B and 32B parameter multilingual language models, aiming to address the critical challenge of developing highly performant multilingual models that match or surpass the capabilities of monolingual models. By leveraging several years of research at Cohere For AI and Cohere, including advancements in data arbitrage, multilingual prefere…
▽ More
We introduce the Aya Expanse model family, a new generation of 8B and 32B parameter multilingual language models, aiming to address the critical challenge of developing highly performant multilingual models that match or surpass the capabilities of monolingual models. By leveraging several years of research at Cohere For AI and Cohere, including advancements in data arbitrage, multilingual preference training, and model merging, Aya Expanse sets a new state-of-the-art in multilingual performance. Our evaluations on the Arena-Hard-Auto dataset, translated into 23 languages, demonstrate that Aya Expanse 8B and 32B outperform leading open-weight models in their respective parameter classes, including Gemma 2, Qwen 2.5, and Llama 3.1, achieving up to a 76.6% win-rate. Notably, Aya Expanse 32B outperforms Llama 3.1 70B, a model with twice as many parameters, achieving a 54.0% win-rate. In this short technical report, we present extended evaluation results for the Aya Expanse model family and release their open-weights, together with a new multilingual evaluation dataset m-ArenaHard.
△ Less
Submitted 5 December, 2024;
originally announced December 2024.
-
Mimir: Improving Video Diffusion Models for Precise Text Understanding
Authors:
Shuai Tan,
Biao Gong,
Yutong Feng,
Kecheng Zheng,
Dandan Zheng,
Shuwei Shi,
Yujun Shen,
Jingdong Chen,
Ming Yang
Abstract:
Text serves as the key control signal in video generation due to its narrative nature. To render text descriptions into video clips, current video diffusion models borrow features from text encoders yet struggle with limited text comprehension. The recent success of large language models (LLMs) showcases the power of decoder-only transformers, which offers three clear benefits for text-to-video (T…
▽ More
Text serves as the key control signal in video generation due to its narrative nature. To render text descriptions into video clips, current video diffusion models borrow features from text encoders yet struggle with limited text comprehension. The recent success of large language models (LLMs) showcases the power of decoder-only transformers, which offers three clear benefits for text-to-video (T2V) generation, namely, precise text understanding resulting from the superior scalability, imagination beyond the input text enabled by next token prediction, and flexibility to prioritize user interests through instruction tuning. Nevertheless, the feature distribution gap emerging from the two different text modeling paradigms hinders the direct use of LLMs in established T2V models. This work addresses this challenge with Mimir, an end-to-end training framework featuring a carefully tailored token fuser to harmonize the outputs from text encoders and LLMs. Such a design allows the T2V model to fully leverage learned video priors while capitalizing on the text-related capability of LLMs. Extensive quantitative and qualitative results demonstrate the effectiveness of Mimir in generating high-quality videos with excellent text comprehension, especially when processing short captions and managing shifting motions. Project page: https://lucaria-academy.github.io/Mimir/
△ Less
Submitted 4 December, 2024;
originally announced December 2024.