-
Coding for Quasi-Static Fading Channel with Imperfect CSI at the Transmitter and Quantized Feedback
Authors:
Yuhan Yang,
Mei Han,
Haonan Zhang,
Haoheng Yuan,
Fan Cheng,
Bin Dai
Abstract:
The classical Schalkwijk-Kailath (SK) scheme for the additive Gaussian noise channel with noiseless feedback is highly efficient since its coding complexity is extremely low and the decoding error doubly exponentially decays as the coding blocklength tends to infinity. However, its application to the fading channel with imperfect CSI at the transmitter (I-CSIT) is challenging since the SK scheme i…
▽ More
The classical Schalkwijk-Kailath (SK) scheme for the additive Gaussian noise channel with noiseless feedback is highly efficient since its coding complexity is extremely low and the decoding error doubly exponentially decays as the coding blocklength tends to infinity. However, its application to the fading channel with imperfect CSI at the transmitter (I-CSIT) is challenging since the SK scheme is sensitive to the CSI. In this paper, we investigate how to design SK-type scheme for the quasi-static fading channel with I-CSIT and quantized feedback. By introducing modulo lattice function and an auxiliary signal into the SK-type encoder-decoder of the transceiver, we show that the decoding error caused by the I-CSIT can be perfectly eliminated, resulting in the success of designing SK-type scheme for such a case. The study of this paper provides a way to design efficient coding scheme for fading channels in the presence of imperfect CSI and quantized feedback.
△ Less
Submitted 2 July, 2025;
originally announced July 2025.
-
Optimal Feedback Schemes for Dirty Paper Channels With State Estimation at the Receiver
Authors:
Dengfeng Xia,
Han Deng,
Haonan Zhang,
Fan Cheng,
Bin Dai,
Liuguo Yin
Abstract:
In the literature, it has been shown that feedback does not increase the optimal rate-distortion region of the dirty paper channel with state estimation at the receiver (SE-R). On the other hand, it is well-known that feedback helps to construct low-complexity coding schemes in Gaussian channels, such as the elegant Schalkwijk-Kailath (SK) feedback scheme. This motivates us to explore capacity-ach…
▽ More
In the literature, it has been shown that feedback does not increase the optimal rate-distortion region of the dirty paper channel with state estimation at the receiver (SE-R). On the other hand, it is well-known that feedback helps to construct low-complexity coding schemes in Gaussian channels, such as the elegant Schalkwijk-Kailath (SK) feedback scheme. This motivates us to explore capacity-achieving SK-type schemes in dirty paper channels with SE-R and feedback. In this paper, we first propose a capacity-achieving feedback scheme for the dirty paper channel with SE-R (DPC-SE-R), which combines the superposition coding and the classical SK-type scheme. Then, we extend this scheme to the dirty paper multiple-access channel with SE-R and feedback, and also show the extended scheme is capacity-achieving. Finally, we discuss how to extend our scheme to a noisy state observation case of the DPC-SE-R. However, the capacity-achieving SK-type scheme for such a case remains unknown.
△ Less
Submitted 1 July, 2025;
originally announced July 2025.
-
Exploration from a Primal-Dual Lens: Value-Incentivized Actor-Critic Methods for Sample-Efficient Online RL
Authors:
Tong Yang,
Bo Dai,
Lin Xiao,
Yuejie Chi
Abstract:
Online reinforcement learning (RL) with complex function approximations such as transformers and deep neural networks plays a significant role in the modern practice of artificial intelligence. Despite its popularity and importance, balancing the fundamental trade-off between exploration and exploitation remains a long-standing challenge; in particular, we are still in lack of efficient and practi…
▽ More
Online reinforcement learning (RL) with complex function approximations such as transformers and deep neural networks plays a significant role in the modern practice of artificial intelligence. Despite its popularity and importance, balancing the fundamental trade-off between exploration and exploitation remains a long-standing challenge; in particular, we are still in lack of efficient and practical schemes that are backed by theoretical performance guarantees. Motivated by recent developments in exploration via optimistic regularization, this paper provides an interpretation of the principle of optimism through the lens of primal-dual optimization. From this fresh perspective, we set forth a new value-incentivized actor-critic (VAC) method, which optimizes a single easy-to-optimize objective integrating exploration and exploitation -- it promotes state-action and policy estimates that are both consistent with collected data transitions and result in higher value functions. Theoretically, the proposed VAC method has near-optimal regret guarantees under linear Markov decision processes (MDPs) in both finite-horizon and infinite-horizon settings, which can be extended to the general function approximation setting under appropriate assumptions.
△ Less
Submitted 27 June, 2025;
originally announced June 2025.
-
A Simple but Accurate Approximation for Multivariate Gaussian Rate-Distortion Function and Its Application in Maximal Coding Rate Reduction
Authors:
Zhenglin Huang,
Qifa Yan,
Bin Dai,
Xiaohu Tang
Abstract:
The multivariate Gaussian rate-distortion (RD) function is crucial in various applications, such as digital communications, data storage, or neural networks. However, the complex form of the multivariate Gaussian RD function prevents its application in many neural network-based scenarios that rely on its analytical properties, for example, white-box neural networks, multi-device task-oriented comm…
▽ More
The multivariate Gaussian rate-distortion (RD) function is crucial in various applications, such as digital communications, data storage, or neural networks. However, the complex form of the multivariate Gaussian RD function prevents its application in many neural network-based scenarios that rely on its analytical properties, for example, white-box neural networks, multi-device task-oriented communication, and semantic communication. This paper proposes a simple but accurate approximation for the multivariate Gaussian RD function. The upper and lower bounds on the approximation error (the difference between the approximate and the exact value) are derived, which indicate that for well-conditioned covariance matrices, the approximation error is small. In particular, when the condition number of the covariance matrix approaches 1, the approximation error approaches 0. In addition, based on the proposed approximation, a new classification algorithm called Adaptive Regularized ReduNet (AR-ReduNet) is derived by applying the approximation to ReduNet, which is a white-box classification network oriented from Maximal Coding Rate Reduction (MCR$^2$) principle. Simulation results indicate that AR-ReduNet achieves higher accuracy and more efficient optimization than ReduNet.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
Learning Task-Agnostic Skill Bases to Uncover Motor Primitives in Animal Behaviors
Authors:
Jiyi Wang,
Jingyang Ke,
Bo Dai,
Anqi Wu
Abstract:
Animals flexibly recombine a finite set of core motor primitives to meet diverse task demands, but existing behavior-segmentation methods oversimplify this process by imposing discrete syllables under restrictive generative assumptions. To reflect the animal behavior generation procedure, we introduce skill-based imitation learning (SKIL) for behavior understanding, a reinforcement learning-based…
▽ More
Animals flexibly recombine a finite set of core motor primitives to meet diverse task demands, but existing behavior-segmentation methods oversimplify this process by imposing discrete syllables under restrictive generative assumptions. To reflect the animal behavior generation procedure, we introduce skill-based imitation learning (SKIL) for behavior understanding, a reinforcement learning-based imitation framework that (1) infers interpretable skill sets, i.e., latent basis functions of behavior, by leveraging representation learning on transition probabilities, and (2) parameterizes policies as dynamic mixtures of these skills. We validate our approach on a simple grid world, a discrete labyrinth, and unconstrained videos of freely moving animals. Across tasks, it identifies reusable skill components, learns continuously evolving compositional policies, and generates realistic trajectories beyond the capabilities of traditional discrete models. By exploiting generative behavior modeling with compositional representations, our method offers a concise, principled account of how complex animal behaviors emerge from dynamic combinations of fundamental motor primitives.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
Reasoning with Exploration: An Entropy Perspective
Authors:
Daixuan Cheng,
Shaohan Huang,
Xuekai Zhu,
Bo Dai,
Wayne Xin Zhao,
Zhenliang Zhang,
Furu Wei
Abstract:
Balancing exploration and exploitation is a central goal in reinforcement learning (RL). Despite recent advances in enhancing language model (LM) reasoning, most methods lean toward exploitation, and increasingly encounter performance plateaus. In this work, we revisit entropy -- a signal of exploration in RL -- and examine its relationship to exploratory reasoning in LMs. Through empirical analys…
▽ More
Balancing exploration and exploitation is a central goal in reinforcement learning (RL). Despite recent advances in enhancing language model (LM) reasoning, most methods lean toward exploitation, and increasingly encounter performance plateaus. In this work, we revisit entropy -- a signal of exploration in RL -- and examine its relationship to exploratory reasoning in LMs. Through empirical analysis, we uncover strong positive correlations between high-entropy regions and three types of exploratory reasoning actions: (1) pivotal tokens that determine or connect logical steps, (2) reflective actions such as self-verification and correction, and (3) rare behaviors under-explored by the base LMs. Motivated by this, we introduce a minimal modification to standard RL with only one line of code: augmenting the advantage function with an entropy-based term. Unlike traditional maximum-entropy methods which encourage exploration by promoting uncertainty, we encourage exploration by promoting longer and deeper reasoning chains. Notably, our method achieves significant gains on the Pass@K metric -- an upper-bound estimator of LM reasoning capabilities -- even when evaluated with extremely large K values, pushing the boundaries of LM reasoning.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
Direct Numerical Layout Generation for 3D Indoor Scene Synthesis via Spatial Reasoning
Authors:
Xingjian Ran,
Yixuan Li,
Linning Xu,
Mulin Yu,
Bo Dai
Abstract:
Realistic 3D indoor scene synthesis is vital for embodied AI and digital content creation. It can be naturally divided into two subtasks: object generation and layout generation. While recent generative models have significantly advanced object-level quality and controllability, layout generation remains challenging due to limited datasets. Existing methods either overfit to these datasets or rely…
▽ More
Realistic 3D indoor scene synthesis is vital for embodied AI and digital content creation. It can be naturally divided into two subtasks: object generation and layout generation. While recent generative models have significantly advanced object-level quality and controllability, layout generation remains challenging due to limited datasets. Existing methods either overfit to these datasets or rely on predefined constraints to optimize numerical layout that sacrifice flexibility. As a result, they fail to generate scenes that are both open-vocabulary and aligned with fine-grained user instructions. We introduce DirectLayout, a framework that directly generates numerical 3D layouts from text descriptions using generalizable spatial reasoning of large language models (LLMs). DirectLayout decomposes the generation into three stages: producing a Bird's-Eye View (BEV) layout, lifting it into 3D space, and refining object placements. To enable explicit spatial reasoning and help the model grasp basic principles of object placement, we employ Chain-of-Thought (CoT) Activation based on the 3D-Front dataset. Additionally, we design CoT-Grounded Generative Layout Reward to enhance generalization and spatial planning. During inference, DirectLayout addresses asset-layout mismatches via Iterative Asset-Layout Alignment through in-context learning. Extensive experiments demonstrate that DirectLayout achieves impressive semantic consistency, generalization and physical plausibility.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
V-VAE: A Variational Auto Encoding Framework Towards Fine-Grained Control over Human-Like Chat
Authors:
Qi Lin,
Weikai Xu,
Lisi Chen,
Bin Dai
Abstract:
With the continued proliferation of Large Language Model (LLM) based chatbots, there is a growing demand for generating responses that are not only linguistically fluent but also consistently aligned with persona-specific traits in conversations. However, existing role-play and persona-based chat approaches rely heavily on static role descriptions, coarse-grained signal space, and low-quality synt…
▽ More
With the continued proliferation of Large Language Model (LLM) based chatbots, there is a growing demand for generating responses that are not only linguistically fluent but also consistently aligned with persona-specific traits in conversations. However, existing role-play and persona-based chat approaches rely heavily on static role descriptions, coarse-grained signal space, and low-quality synthetic data, which fail to capture dynamic fine-grained details in human-like chat. Human-like chat requires modeling subtle latent traits, such as emotional tone, situational awareness, and evolving personality, which are difficult to predefine and cannot be easily learned from synthetic or distillation-based data. To address these limitations, we propose a Verbal Variational Auto-Encoding (V-VAE) framework, containing a variational auto-encoding module and fine-grained control space which dynamically adapts dialogue behaviour based on fine-grained, interpretable latent variables across talking style, interaction patterns, and personal attributes. We also construct a high-quality dataset, HumanChatData, and benchmark HumanChatBench to address the scarcity of high-quality data in the human-like domain. Experiments show that LLMs based on V-VAE consistently outperform standard baselines on HumanChatBench and DialogBench, which further demonstrates the effectiveness of V-VAE and HumanChatData.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
AnySplat: Feed-forward 3D Gaussian Splatting from Unconstrained Views
Authors:
Lihan Jiang,
Yucheng Mao,
Linning Xu,
Tao Lu,
Kerui Ren,
Yichen Jin,
Xudong Xu,
Mulin Yu,
Jiangmiao Pang,
Feng Zhao,
Dahua Lin,
Bo Dai
Abstract:
We introduce AnySplat, a feed forward network for novel view synthesis from uncalibrated image collections. In contrast to traditional neural rendering pipelines that demand known camera poses and per scene optimization, or recent feed forward methods that buckle under the computational weight of dense views, our model predicts everything in one shot. A single forward pass yields a set of 3D Gauss…
▽ More
We introduce AnySplat, a feed forward network for novel view synthesis from uncalibrated image collections. In contrast to traditional neural rendering pipelines that demand known camera poses and per scene optimization, or recent feed forward methods that buckle under the computational weight of dense views, our model predicts everything in one shot. A single forward pass yields a set of 3D Gaussian primitives encoding both scene geometry and appearance, and the corresponding camera intrinsics and extrinsics for each input image. This unified design scales effortlessly to casually captured, multi view datasets without any pose annotations. In extensive zero shot evaluations, AnySplat matches the quality of pose aware baselines in both sparse and dense view scenarios while surpassing existing pose free approaches. Moreover, it greatly reduce rendering latency compared to optimization based neural fields, bringing real time novel view synthesis within reach for unconstrained capture settings.Project page: https://city-super.github.io/anysplat/
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Vision Transformers with Self-Distilled Registers
Authors:
Yinjie Chen,
Zipeng Yan,
Chong Zhou,
Bo Dai,
Andrew F. Luo
Abstract:
Vision Transformers (ViTs) have emerged as the dominant architecture for visual processing tasks, demonstrating excellent scalability with increased training data and model size. However, recent work has identified the emergence of artifact tokens in ViTs that are incongruous with the local semantics. These anomalous tokens degrade ViT performance in tasks that require fine-grained localization or…
▽ More
Vision Transformers (ViTs) have emerged as the dominant architecture for visual processing tasks, demonstrating excellent scalability with increased training data and model size. However, recent work has identified the emergence of artifact tokens in ViTs that are incongruous with the local semantics. These anomalous tokens degrade ViT performance in tasks that require fine-grained localization or structural coherence. An effective mitigation of this issue is to the addition of register tokens to ViTs, which implicitly "absorb" the artifact term during training. Given the availability of various large-scale pre-trained ViTs, in this paper we aim at equipping them with such register tokens without the need of re-training them from scratch, which is infeasible considering their size. Specifically, we propose Post Hoc Registers (PH-Reg), an efficient self-distillation method that integrates registers into an existing ViT without requiring additional labeled data and full retraining. PH-Reg initializes both teacher and student networks from the same pre-trained ViT. The teacher remains frozen and unmodified, while the student is augmented with randomly initialized register tokens. By applying test-time augmentation to the teacher's inputs, we generate denoised dense embeddings free of artifacts, which are then used to optimize only a small subset of unlocked student weights. We show that our approach can effectively reduce the number of artifact tokens, improving the segmentation and depth prediction of the student ViT under zero-shot and linear probing.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
MV-CoLight: Efficient Object Compositing with Consistent Lighting and Shadow Generation
Authors:
Kerui Ren,
Jiayang Bai,
Linning Xu,
Lihan Jiang,
Jiangmiao Pang,
Mulin Yu,
Bo Dai
Abstract:
Object compositing offers significant promise for augmented reality (AR) and embodied intelligence applications. Existing approaches predominantly focus on single-image scenarios or intrinsic decomposition techniques, facing challenges with multi-view consistency, complex scenes, and diverse lighting conditions. Recent inverse rendering advancements, such as 3D Gaussian and diffusion-based methods…
▽ More
Object compositing offers significant promise for augmented reality (AR) and embodied intelligence applications. Existing approaches predominantly focus on single-image scenarios or intrinsic decomposition techniques, facing challenges with multi-view consistency, complex scenes, and diverse lighting conditions. Recent inverse rendering advancements, such as 3D Gaussian and diffusion-based methods, have enhanced consistency but are limited by scalability, heavy data requirements, or prolonged reconstruction time per scene. To broaden its applicability, we introduce MV-CoLight, a two-stage framework for illumination-consistent object compositing in both 2D images and 3D scenes. Our novel feed-forward architecture models lighting and shadows directly, avoiding the iterative biases of diffusion-based methods. We employ a Hilbert curve-based mapping to align 2D image inputs with 3D Gaussian scene representations seamlessly. To facilitate training and evaluation, we further introduce a large-scale 3D compositing dataset. Experiments demonstrate state-of-the-art harmonized results across standard benchmarks and our dataset, as well as casually captured real-world scenes demonstrate the framework's robustness and wide generalization.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
Towards Better Instruction Following Retrieval Models
Authors:
Yuchen Zhuang,
Aaron Trinh,
Rushi Qiang,
Haotian Sun,
Chao Zhang,
Hanjun Dai,
Bo Dai
Abstract:
Modern information retrieval (IR) models, trained exclusively on standard <query, passage> pairs, struggle to effectively interpret and follow explicit user instructions. We introduce InF-IR, a large-scale, high-quality training corpus tailored for enhancing retrieval models in Instruction-Following IR. InF-IR expands traditional training pairs into over 38,000 expressive <instruction, query, pass…
▽ More
Modern information retrieval (IR) models, trained exclusively on standard <query, passage> pairs, struggle to effectively interpret and follow explicit user instructions. We introduce InF-IR, a large-scale, high-quality training corpus tailored for enhancing retrieval models in Instruction-Following IR. InF-IR expands traditional training pairs into over 38,000 expressive <instruction, query, passage> triplets as positive samples. In particular, for each positive triplet, we generate two additional hard negative examples by poisoning both instructions and queries, then rigorously validated by an advanced reasoning model (o3-mini) to ensure semantic plausibility while maintaining instructional incorrectness. Unlike existing corpora that primarily support computationally intensive reranking tasks for decoder-only language models, the highly contrastive positive-negative triplets in InF-IR further enable efficient representation learning for smaller encoder-only models, facilitating direct embedding-based retrieval. Using this corpus, we train InF-Embed, an instruction-aware Embedding model optimized through contrastive learning and instruction-query attention mechanisms to align retrieval outcomes precisely with user intents. Extensive experiments across five instruction-based retrieval benchmarks demonstrate that InF-Embed significantly surpasses competitive baselines by 8.1% in p-MRR, measuring the instruction-following capabilities.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning
Authors:
Ziju Shen,
Naohao Huang,
Fanyi Yang,
Yutong Wang,
Guoxiong Gao,
Tianyi Xu,
Jiedong Jiang,
Wanyi He,
Pu Yang,
Mengzhou Sun,
Haocheng Ju,
Peihao Wu,
Bryan Dai,
Bin Dong
Abstract:
Nowadays, formal theorem provers have made monumental progress on high-school and competition-level mathematics, but few of them generalize to more advanced mathematics. In this paper, we present REAL-Prover, a new open-source stepwise theorem prover for Lean 4 to push this boundary. This prover, based on our fine-tuned large language model (REAL-Prover-v1) and integrated with a retrieval system (…
▽ More
Nowadays, formal theorem provers have made monumental progress on high-school and competition-level mathematics, but few of them generalize to more advanced mathematics. In this paper, we present REAL-Prover, a new open-source stepwise theorem prover for Lean 4 to push this boundary. This prover, based on our fine-tuned large language model (REAL-Prover-v1) and integrated with a retrieval system (Leansearch-PS), notably boosts performance on solving college-level mathematics problems. To train REAL-Prover-v1, we developed HERALD-AF, a data extraction pipeline that converts natural language math problems into formal statements, and a new open-source Lean 4 interactive environment (Jixia-interactive) to facilitate synthesis data collection. In our experiments, our prover using only supervised fine-tune achieves competitive results with a 23.7% success rate (Pass@64) on the ProofNet dataset-comparable to state-of-the-art (SOTA) models. To further evaluate our approach, we introduce FATE-M, a new benchmark focused on algebraic problems, where our prover achieves a SOTA success rate of 56.7% (Pass@64).
△ Less
Submitted 16 June, 2025; v1 submitted 26 May, 2025;
originally announced May 2025.
-
One-shot Entropy Minimization
Authors:
Zitian Gao,
Lynx Chen,
Joey Zhou,
Bryan Dai
Abstract:
We trained 13,440 large language models and found that entropy minimization requires only a single unlabeled data and 10 steps optimization to achieve performance improvements comparable to or even greater than those obtained using thousands of data and carefully designed rewards in rule-based reinforcement learning. This striking result may prompt a rethinking of post-training paradigms for large…
▽ More
We trained 13,440 large language models and found that entropy minimization requires only a single unlabeled data and 10 steps optimization to achieve performance improvements comparable to or even greater than those obtained using thousands of data and carefully designed rewards in rule-based reinforcement learning. This striking result may prompt a rethinking of post-training paradigms for large language models. Our code is avaliable at https://github.com/zitian-gao/one-shot-em.
△ Less
Submitted 3 June, 2025; v1 submitted 26 May, 2025;
originally announced May 2025.
-
HaloGS: Loose Coupling of Compact Geometry and Gaussian Splats for 3D Scenes
Authors:
Changjian Jiang,
Kerui Ren,
Linning Xu,
Jiong Chen,
Jiangmiao Pang,
Yu Zhang,
Bo Dai,
Mulin Yu
Abstract:
High fidelity 3D reconstruction and rendering hinge on capturing precise geometry while preserving photo realistic detail. Most existing methods either fuse these goals into a single cumbersome model or adopt hybrid schemes whose uniform primitives lead to a trade off between efficiency and fidelity. In this paper, we introduce HaloGS, a dual representation that loosely couples coarse triangles fo…
▽ More
High fidelity 3D reconstruction and rendering hinge on capturing precise geometry while preserving photo realistic detail. Most existing methods either fuse these goals into a single cumbersome model or adopt hybrid schemes whose uniform primitives lead to a trade off between efficiency and fidelity. In this paper, we introduce HaloGS, a dual representation that loosely couples coarse triangles for geometry with Gaussian primitives for appearance, motivated by the lightweight classic geometry representations and their proven efficiency in real world applications. Our design yields a compact yet expressive model capable of photo realistic rendering across both indoor and outdoor environments, seamlessly adapting to varying levels of scene complexity. Experiments on multiple benchmark datasets demonstrate that our method yields both compact, accurate geometry and high fidelity renderings, especially in challenging scenarios where robust geometric structure make a clear difference.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
AmorLIP: Efficient Language-Image Pretraining via Amortization
Authors:
Haotian Sun,
Yitong Li,
Yuchen Zhuang,
Niao He,
Hanjun Dai,
Bo Dai
Abstract:
Contrastive Language-Image Pretraining (CLIP) has demonstrated strong zero-shot performance across diverse downstream text-image tasks. Existing CLIP methods typically optimize a contrastive objective using negative samples drawn from each minibatch. To achieve robust representation learning, these methods require extremely large batch sizes and escalate computational demands to hundreds or even t…
▽ More
Contrastive Language-Image Pretraining (CLIP) has demonstrated strong zero-shot performance across diverse downstream text-image tasks. Existing CLIP methods typically optimize a contrastive objective using negative samples drawn from each minibatch. To achieve robust representation learning, these methods require extremely large batch sizes and escalate computational demands to hundreds or even thousands of GPUs. Prior approaches to mitigate this issue often compromise downstream performance, prolong training duration, or face scalability challenges with very large datasets. To overcome these limitations, we propose AmorLIP, an efficient CLIP pretraining framework that amortizes expensive computations involved in contrastive learning through lightweight neural networks, which substantially improves training efficiency and performance. Leveraging insights from a spectral factorization of energy-based models, we introduce novel amortization objectives along with practical techniques to improve training stability. Extensive experiments across 38 downstream tasks demonstrate the superior zero-shot classification and retrieval capabilities of AmorLIP, consistently outperforming standard CLIP baselines with substantial relative improvements of up to 12.24%.
△ Less
Submitted 25 May, 2025;
originally announced May 2025.
-
UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning
Authors:
Xiangyu Wang,
Donglin Yang,
Yue Liao,
Wenhao Zheng,
wenjun wu,
Bin Dai,
Hongsheng Li,
Si Liu
Abstract:
Unmanned Aerial Vehicles (UAVs) are evolving into language-interactive platforms, enabling more intuitive forms of human-drone interaction. While prior works have primarily focused on high-level planning and long-horizon navigation, we shift attention to language-guided fine-grained trajectory control, where UAVs execute short-range, reactive flight behaviors in response to language instructions.…
▽ More
Unmanned Aerial Vehicles (UAVs) are evolving into language-interactive platforms, enabling more intuitive forms of human-drone interaction. While prior works have primarily focused on high-level planning and long-horizon navigation, we shift attention to language-guided fine-grained trajectory control, where UAVs execute short-range, reactive flight behaviors in response to language instructions. We formalize this problem as the Flying-on-a-Word (Flow) task and introduce UAV imitation learning as an effective approach. In this framework, UAVs learn fine-grained control policies by mimicking expert pilot trajectories paired with atomic language instructions. To support this paradigm, we present UAV-Flow, the first real-world benchmark for language-conditioned, fine-grained UAV control. It includes a task formulation, a large-scale dataset collected in diverse environments, a deployable control framework, and a simulation suite for systematic evaluation. Our design enables UAVs to closely imitate the precise, expert-level flight trajectories of human pilots and supports direct deployment without sim-to-real gap. We conduct extensive experiments on UAV-Flow, benchmarking VLN and VLA paradigms. Results show that VLA models are superior to VLN baselines and highlight the critical role of spatial grounding in the fine-grained Flow setting.
△ Less
Submitted 26 May, 2025; v1 submitted 21 May, 2025;
originally announced May 2025.
-
TeleOpBench: A Simulator-Centric Benchmark for Dual-Arm Dexterous Teleoperation
Authors:
Hangyu Li,
Qin Zhao,
Haoran Xu,
Xinyu Jiang,
Qingwei Ben,
Feiyu Jia,
Haoyu Zhao,
Liang Xu,
Jia Zeng,
Hanqing Wang,
Bo Dai,
Junting Dong,
Jiangmiao Pang
Abstract:
Teleoperation is a cornerstone of embodied-robot learning, and bimanual dexterous teleoperation in particular provides rich demonstrations that are difficult to obtain with fully autonomous systems. While recent studies have proposed diverse hardware pipelines-ranging from inertial motion-capture gloves to exoskeletons and vision-based interfaces-there is still no unified benchmark that enables fa…
▽ More
Teleoperation is a cornerstone of embodied-robot learning, and bimanual dexterous teleoperation in particular provides rich demonstrations that are difficult to obtain with fully autonomous systems. While recent studies have proposed diverse hardware pipelines-ranging from inertial motion-capture gloves to exoskeletons and vision-based interfaces-there is still no unified benchmark that enables fair, reproducible comparison of these systems. In this paper, we introduce TeleOpBench, a simulator-centric benchmark tailored to bimanual dexterous teleoperation. TeleOpBench contains 30 high-fidelity task environments that span pick-and-place, tool use, and collaborative manipulation, covering a broad spectrum of kinematic and force-interaction difficulty. Within this benchmark we implement four representative teleoperation modalities-(i) MoCap, (ii) VR device, (iii) arm-hand exoskeletons, and (iv) monocular vision tracking-and evaluate them with a common protocol and metric suite. To validate that performance in simulation is predictive of real-world behavior, we conduct mirrored experiments on a physical dual-arm platform equipped with two 6-DoF dexterous hands. Across 10 held-out tasks we observe a strong correlation between simulator and hardware performance, confirming the external validity of TeleOpBench. TeleOpBench establishes a common yardstick for teleoperation research and provides an extensible platform for future algorithmic and hardware innovation.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
PSDiffusion: Harmonized Multi-Layer Image Generation via Layout and Appearance Alignment
Authors:
Dingbang Huang,
Wenbo Li,
Yifei Zhao,
Xinyu Pan,
Yanhong Zeng,
Bo Dai
Abstract:
Diffusion models have made remarkable advancements in generating high-quality images from textual descriptions. Recent works like LayerDiffuse have extended the previous single-layer, unified image generation paradigm to transparent image layer generation. However, existing multi-layer generation methods fail to handle the interactions among multiple layers such as rational global layout, physics-…
▽ More
Diffusion models have made remarkable advancements in generating high-quality images from textual descriptions. Recent works like LayerDiffuse have extended the previous single-layer, unified image generation paradigm to transparent image layer generation. However, existing multi-layer generation methods fail to handle the interactions among multiple layers such as rational global layout, physics-plausible contacts and visual effects like shadows and reflections while maintaining high alpha quality. To solve this problem, we propose PSDiffusion, a unified diffusion framework for simultaneous multi-layer text-to-image generation. Our model can automatically generate multi-layer images with one RGB background and multiple RGBA foregrounds through a single feed-forward process. Unlike existing methods that combine multiple tools for post-decomposition or generate layers sequentially and separately, our method introduces a global-layer interactive mechanism that generates layered-images concurrently and collaboratively, ensuring not only high quality and completeness for each layer, but also spatial and visual interactions among layers for global coherence.
△ Less
Submitted 16 May, 2025;
originally announced May 2025.
-
MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering
Authors:
Rushi Qiang,
Yuchen Zhuang,
Yinghao Li,
Dingu Sagar V K,
Rongzhi Zhang,
Changhao Li,
Ian Shu-Hei Wong,
Sherry Yang,
Percy Liang,
Chao Zhang,
Bo Dai
Abstract:
We introduce MLE-Dojo, a Gym-style framework for systematically reinforcement learning, evaluating, and improving autonomous large language model (LLM) agents in iterative machine learning engineering (MLE) workflows. Unlike existing benchmarks that primarily rely on static datasets or single-attempt evaluations, MLE-Dojo provides an interactive environment enabling agents to iteratively experimen…
▽ More
We introduce MLE-Dojo, a Gym-style framework for systematically reinforcement learning, evaluating, and improving autonomous large language model (LLM) agents in iterative machine learning engineering (MLE) workflows. Unlike existing benchmarks that primarily rely on static datasets or single-attempt evaluations, MLE-Dojo provides an interactive environment enabling agents to iteratively experiment, debug, and refine solutions through structured feedback loops. Built upon 200+ real-world Kaggle challenges, MLE-Dojo covers diverse, open-ended MLE tasks carefully curated to reflect realistic engineering scenarios such as data processing, architecture search, hyperparameter tuning, and code debugging. Its fully executable environment supports comprehensive agent training via both supervised fine-tuning and reinforcement learning, facilitating iterative experimentation, realistic data sampling, and real-time outcome verification. Extensive evaluations of eight frontier LLMs reveal that while current models achieve meaningful iterative improvements, they still exhibit significant limitations in autonomously generating long-horizon solutions and efficiently resolving complex errors. Furthermore, MLE-Dojo's flexible and extensible architecture seamlessly integrates diverse data sources, tools, and evaluation protocols, uniquely enabling model-based agent tuning and promoting interoperability, scalability, and reproducibility. We open-source our framework and benchmarks to foster community-driven innovation towards next-generation MLE agents.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Virtualized 3D Gaussians: Flexible Cluster-based Level-of-Detail System for Real-Time Rendering of Composed Scenes
Authors:
Xijie Yang,
Linning Xu,
Lihan Jiang,
Dahua Lin,
Bo Dai
Abstract:
3D Gaussian Splatting (3DGS) enables the reconstruction of intricate digital 3D assets from multi-view images by leveraging a set of 3D Gaussian primitives for rendering. Its explicit and discrete representation facilitates the seamless composition of complex digital worlds, offering significant advantages over previous neural implicit methods. However, when applied to large-scale compositions, su…
▽ More
3D Gaussian Splatting (3DGS) enables the reconstruction of intricate digital 3D assets from multi-view images by leveraging a set of 3D Gaussian primitives for rendering. Its explicit and discrete representation facilitates the seamless composition of complex digital worlds, offering significant advantages over previous neural implicit methods. However, when applied to large-scale compositions, such as crowd-level scenes, it can encompass numerous 3D Gaussians, posing substantial challenges for real-time rendering. To address this, inspired by Unreal Engine 5's Nanite system, we propose Virtualized 3D Gaussians (V3DG), a cluster-based LOD solution that constructs hierarchical 3D Gaussian clusters and dynamically selects only the necessary ones to accelerate rendering speed. Our approach consists of two stages: (1) Offline Build, where hierarchical clusters are generated using a local splatting method to minimize visual differences across granularities, and (2) Online Selection, where footprint evaluation determines perceptible clusters for efficient rasterization during rendering. We curate a dataset of synthetic and real-world scenes, including objects, trees, people, and buildings, each requiring 0.1 billion 3D Gaussians to capture fine details. Experiments show that our solution balances rendering efficiency and visual quality across user-defined tolerances, facilitating downstream interactive applications that compose extensive 3DGS assets for consistent rendering performance.
△ Less
Submitted 10 May, 2025;
originally announced May 2025.
-
Rethinking the Global Convergence of Softmax Policy Gradient with Linear Function Approximation
Authors:
Max Qiushi Lin,
Jincheng Mei,
Matin Aghaei,
Michael Lu,
Bo Dai,
Alekh Agarwal,
Dale Schuurmans,
Csaba Szepesvari,
Sharan Vaswani
Abstract:
Policy gradient (PG) methods have played an essential role in the empirical successes of reinforcement learning. In order to handle large state-action spaces, PG methods are typically used with function approximation. In this setting, the approximation error in modeling problem-dependent quantities is a key notion for characterizing the global convergence of PG methods. We focus on Softmax PG with…
▽ More
Policy gradient (PG) methods have played an essential role in the empirical successes of reinforcement learning. In order to handle large state-action spaces, PG methods are typically used with function approximation. In this setting, the approximation error in modeling problem-dependent quantities is a key notion for characterizing the global convergence of PG methods. We focus on Softmax PG with linear function approximation (referred to as $\texttt{Lin-SPG}$) and demonstrate that the approximation error is irrelevant to the algorithm's global convergence even for the stochastic bandit setting. Consequently, we first identify the necessary and sufficient conditions on the feature representation that can guarantee the asymptotic global convergence of $\texttt{Lin-SPG}$. Under these feature conditions, we prove that $T$ iterations of $\texttt{Lin-SPG}$ with a problem-specific learning rate result in an $O(1/T)$ convergence to the optimal policy. Furthermore, we prove that $\texttt{Lin-SPG}$ with any arbitrary constant learning rate can ensure asymptotic global convergence to the optimal policy.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
Representation Learning via Non-Contrastive Mutual Information
Authors:
Zhaohan Daniel Guo,
Bernardo Avila Pires,
Khimya Khetarpal,
Dale Schuurmans,
Bo Dai
Abstract:
Labeling data is often very time consuming and expensive, leaving us with a majority of unlabeled data. Self-supervised representation learning methods such as SimCLR (Chen et al., 2020) or BYOL (Grill et al., 2020) have been very successful at learning meaningful latent representations from unlabeled image data, resulting in much more general and transferable representations for downstream tasks.…
▽ More
Labeling data is often very time consuming and expensive, leaving us with a majority of unlabeled data. Self-supervised representation learning methods such as SimCLR (Chen et al., 2020) or BYOL (Grill et al., 2020) have been very successful at learning meaningful latent representations from unlabeled image data, resulting in much more general and transferable representations for downstream tasks. Broadly, self-supervised methods fall into two types: 1) Contrastive methods, such as SimCLR; and 2) Non-Contrastive methods, such as BYOL. Contrastive methods are generally trying to maximize mutual information between related data points, so they need to compare every data point to every other data point, resulting in high variance, and thus requiring large batch sizes to work well. Non-contrastive methods like BYOL have much lower variance as they do not need to make pairwise comparisons, but are much trickier to implement as they have the possibility of collapsing to a constant vector. In this paper, we aim to develop a self-supervised objective that combines the strength of both types. We start with a particular contrastive method called the Spectral Contrastive Loss (HaoChen et al., 2021; Lu et al., 2024), and we convert it into a more general non-contrastive form; this removes the pairwise comparisons resulting in lower variance, but keeps the mutual information formulation of the contrastive method preventing collapse. We call our new objective the Mutual Information Non-Contrastive (MINC) loss. We test MINC by learning image representations on ImageNet (similar to SimCLR and BYOL) and show that it consistently improves upon the Spectral Contrastive loss baseline.
△ Less
Submitted 23 April, 2025;
originally announced April 2025.
-
A simulation-heuristics dual-process model for intuitive physics
Authors:
Shiqian Li,
Yuxi Ma,
Jiajun Yan,
Bo Dai,
Yujia Peng,
Chi Zhang,
Yixin Zhu
Abstract:
The role of mental simulation in human physical reasoning is widely acknowledged, but whether it is employed across scenarios with varying simulation costs and where its boundary lies remains unclear. Using a pouring-marble task, our human study revealed two distinct error patterns when predicting pouring angles, differentiated by simulation time. While mental simulation accurately captured human…
▽ More
The role of mental simulation in human physical reasoning is widely acknowledged, but whether it is employed across scenarios with varying simulation costs and where its boundary lies remains unclear. Using a pouring-marble task, our human study revealed two distinct error patterns when predicting pouring angles, differentiated by simulation time. While mental simulation accurately captured human judgments in simpler scenarios, a linear heuristic model better matched human predictions when simulation time exceeded a certain boundary. Motivated by these observations, we propose a dual-process framework, Simulation-Heuristics Model (SHM), where intuitive physics employs simulation for short-time simulation but switches to heuristics when simulation becomes costly. By integrating computational methods previously viewed as separate into a unified model, SHM quantitatively captures their switching mechanism. The SHM aligns more precisely with human behavior and demonstrates consistent predictive performance across diverse scenarios, advancing our understanding of the adaptive nature of intuitive physical reasoning.
△ Less
Submitted 19 May, 2025; v1 submitted 13 April, 2025;
originally announced April 2025.
-
Multi-identity Human Image Animation with Structural Video Diffusion
Authors:
Zhenzhi Wang,
Yixuan Li,
Yanhong Zeng,
Yuwei Guo,
Dahua Lin,
Tianfan Xue,
Bo Dai
Abstract:
Generating human videos from a single image while ensuring high visual quality and precise control is a challenging task, especially in complex scenarios involving multiple individuals and interactions with objects. Existing methods, while effective for single-human cases, often fail to handle the intricacies of multi-identity interactions because they struggle to associate the correct pairs of hu…
▽ More
Generating human videos from a single image while ensuring high visual quality and precise control is a challenging task, especially in complex scenarios involving multiple individuals and interactions with objects. Existing methods, while effective for single-human cases, often fail to handle the intricacies of multi-identity interactions because they struggle to associate the correct pairs of human appearance and pose condition and model the distribution of 3D-aware dynamics. To address these limitations, we present Structural Video Diffusion, a novel framework designed for generating realistic multi-human videos. Our approach introduces two core innovations: identity-specific embeddings to maintain consistent appearances across individuals and a structural learning mechanism that incorporates depth and surface-normal cues to model human-object interactions. Additionally, we expand existing human video dataset with 25K new videos featuring diverse multi-human and object interaction scenarios, providing a robust foundation for training. Experimental results demonstrate that Structural Video Diffusion achieves superior performance in generating lifelike, coherent videos for multiple subjects with dynamic and rich interactions, advancing the state of human-centric video generation.
△ Less
Submitted 5 April, 2025;
originally announced April 2025.
-
Yannakakis+: Practical Acyclic Query Evaluation with Theoretical Guarantees
Authors:
Qichen Wang,
Bingnan Chen,
Binyang Dai,
Ke Yi,
Feifei Li,
Liang Lin
Abstract:
Acyclic conjunctive queries form the backbone of most analytical workloads, and have been extensively studied in the literature from both theoretical and practical angles. However, there is still a large divide between theory and practice. While the 40-year-old Yannakakis algorithm has strong theoretical running time guarantees, it has not been adopted in real systems due to its high hidden consta…
▽ More
Acyclic conjunctive queries form the backbone of most analytical workloads, and have been extensively studied in the literature from both theoretical and practical angles. However, there is still a large divide between theory and practice. While the 40-year-old Yannakakis algorithm has strong theoretical running time guarantees, it has not been adopted in real systems due to its high hidden constant factor. In this paper, we strive to close this gap by proposing Yannakakis+, an improved version of the Yannakakis algorithm, which is more practically efficient while preserving its theoretical guarantees. Our experiments demonstrate that Yannakakis+ consistently outperforms the original Yannakakis algorithm by 2x to 5x across a wide range of queries and datasets.
Another nice feature of our new algorithm is that it generates a traditional DAG query plan consisting of standard relational operators, allowing Yannakakis+ to be easily plugged into any standard SQL engine. Our system prototype currently supports four different SQL engines (DuckDB, PostgreSQL, SparkSQL, and AnalyticDB from Alibaba Cloud), and our experiments show that Yannakakis+ is able to deliver better performance than their native query plans on 160 out of the 162 queries tested, with an average speedup of 2.41x and a maximum speedup of 47,059x.
△ Less
Submitted 4 April, 2025;
originally announced April 2025.
-
Ordering-based Conditions for Global Convergence of Policy Gradient Methods
Authors:
Jincheng Mei,
Bo Dai,
Alekh Agarwal,
Mohammad Ghavamzadeh,
Csaba Szepesvari,
Dale Schuurmans
Abstract:
We prove that, for finite-arm bandits with linear function approximation, the global convergence of policy gradient (PG) methods depends on inter-related properties between the policy update and the representation. textcolor{blue}{First}, we establish a few key observations that frame the study: \textbf{(i)} Global convergence can be achieved under linear function approximation without policy or r…
▽ More
We prove that, for finite-arm bandits with linear function approximation, the global convergence of policy gradient (PG) methods depends on inter-related properties between the policy update and the representation. textcolor{blue}{First}, we establish a few key observations that frame the study: \textbf{(i)} Global convergence can be achieved under linear function approximation without policy or reward realizability, both for the standard Softmax PG and natural policy gradient (NPG). \textbf{(ii)} Approximation error is not a key quantity for characterizing global convergence in either algorithm. \textbf{(iii)} The conditions on the representation that imply global convergence are different between these two algorithms. Overall, these observations call into question approximation error as an appropriate quantity for characterizing the global convergence of PG methods under linear function approximation. \textcolor{blue}{Second}, motivated by these observations, we establish new general results: \textbf{(i)} NPG with linear function approximation achieves global convergence \emph{if and only if} the projection of the reward onto the representable space preserves the optimal action's rank, a quantity that is not strongly related to approximation error. \textbf{(ii)} The global convergence of Softmax PG occurs if the representation satisfies a non-domination condition and can preserve the ranking of rewards, which goes well beyond policy or reward realizability. We provide experimental results to support these theoretical findings.
△ Less
Submitted 2 April, 2025;
originally announced April 2025.
-
TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization
Authors:
Liang Pan,
Zeshi Yang,
Zhiyang Dou,
Wenjia Wang,
Buzhen Huang,
Bo Dai,
Taku Komura,
Jingbo Wang
Abstract:
Synthesizing diverse and physically plausible Human-Scene Interactions (HSI) is pivotal for both computer animation and embodied AI. Despite encouraging progress, current methods mainly focus on developing separate controllers, each specialized for a specific interaction task. This significantly hinders the ability to tackle a wide variety of challenging HSI tasks that require the integration of m…
▽ More
Synthesizing diverse and physically plausible Human-Scene Interactions (HSI) is pivotal for both computer animation and embodied AI. Despite encouraging progress, current methods mainly focus on developing separate controllers, each specialized for a specific interaction task. This significantly hinders the ability to tackle a wide variety of challenging HSI tasks that require the integration of multiple skills, e.g., sitting down while carrying an object. To address this issue, we present TokenHSI, a single, unified transformer-based policy capable of multi-skill unification and flexible adaptation. The key insight is to model the humanoid proprioception as a separate shared token and combine it with distinct task tokens via a masking mechanism. Such a unified policy enables effective knowledge sharing across skills, thereby facilitating the multi-task training. Moreover, our policy architecture supports variable length inputs, enabling flexible adaptation of learned skills to new scenarios. By training additional task tokenizers, we can not only modify the geometries of interaction targets but also coordinate multiple skills to address complex tasks. The experiments demonstrate that our approach can significantly improve versatility, adaptability, and extensibility in various HSI tasks. Website: https://liangpan99.github.io/TokenHSI/
△ Less
Submitted 3 April, 2025; v1 submitted 25 March, 2025;
originally announced March 2025.
-
Enhancing Persona Consistency for LLMs' Role-Playing using Persona-Aware Contrastive Learning
Authors:
Ke Ji,
Yixin Lian,
Linxu Li,
Jingsheng Gao,
Weiyuan Li,
Bin Dai
Abstract:
In recent years, large language models (LLMs) have achieved breakthrough progress in many dialogue generation tasks. However, their lack of emotion and fine-grained role awareness limits the model's ability to provide personalized and diverse interactions further. Current methods face high costs in collecting high-quality annotated data for scenarios such as role-playing, and traditional human ali…
▽ More
In recent years, large language models (LLMs) have achieved breakthrough progress in many dialogue generation tasks. However, their lack of emotion and fine-grained role awareness limits the model's ability to provide personalized and diverse interactions further. Current methods face high costs in collecting high-quality annotated data for scenarios such as role-playing, and traditional human alignment methods are difficult to deploy due to the inherent diversity of model behavior in role-playing scenarios. Inspired by the alignment of models for safety behaviors through RLHF (Reinforcement Learning from Human Feedback), in this paper, we revisit model role-playing behavior from the perspective of persona alignment and propose a novel annotation-free framework named \textbf{\underline{P}}ersona-Aware \textbf{\underline{C}}ontrastive \textbf{\underline{L}}earning (PCL) to align LLMs' behavior during role-playing, enhancing the model's role consistency. Specifically, we first design a role chain method to encourage the model to self-question based on the role characteristics and dialogue context to adjust personality consistency. Then, we further enhance the model's role-playing strategy through iterative contrastive learning between the use of role characteristics and not. Experiments on both black-box and white-box LLMs show that LLMs equipped with PCL significantly outperform vanilla LLMs under automatic evaluation methods (CharEval \& GPT-4) and human expert evaluation.
△ Less
Submitted 25 March, 2025; v1 submitted 22 March, 2025;
originally announced March 2025.
-
Control the Soft Robot Arm with its Physical Twin
Authors:
Qinghua Guan,
Hung Hon Cheng,
Benhui Dai,
Josie Hughes
Abstract:
To exploit the compliant capabilities of soft robot arms we require controller which can exploit their physical capabilities. Teleoperation, leveraging a human in the loop, is a key step towards achieving more complex control strategies. Whilst teleoperation is widely used for rigid robots, for soft robots we require teleoperation methods where the configuration of the whole body is considered. We…
▽ More
To exploit the compliant capabilities of soft robot arms we require controller which can exploit their physical capabilities. Teleoperation, leveraging a human in the loop, is a key step towards achieving more complex control strategies. Whilst teleoperation is widely used for rigid robots, for soft robots we require teleoperation methods where the configuration of the whole body is considered. We propose a method of using an identical 'physical twin', or demonstrator of the robot. This tendon robot can be back-driven, with the tendon lengths providing configuration perception, and enabling a direct mapping of tendon lengths for the execture. We demonstrate how this teleoperation across the entire configuration of the robot enables complex interactions with exploit the envrionment, such as squeezing into gaps. We also show how this method can generalize to robots which are a larger scale that the physical twin, and how, tuneability of the stiffness properties of the physical twin simplify its use.
△ Less
Submitted 21 March, 2025;
originally announced March 2025.
-
Infinite Mobility: Scalable High-Fidelity Synthesis of Articulated Objects via Procedural Generation
Authors:
Xinyu Lian,
Zichao Yu,
Ruiming Liang,
Yitong Wang,
Li Ray Luo,
Kaixu Chen,
Yuanzhen Zhou,
Qihong Tang,
Xudong Xu,
Zhaoyang Lyu,
Bo Dai,
Jiangmiao Pang
Abstract:
Large-scale articulated objects with high quality are desperately needed for multiple tasks related to embodied AI. Most existing methods for creating articulated objects are either data-driven or simulation based, which are limited by the scale and quality of the training data or the fidelity and heavy labour of the simulation. In this paper, we propose Infinite Mobility, a novel method for synth…
▽ More
Large-scale articulated objects with high quality are desperately needed for multiple tasks related to embodied AI. Most existing methods for creating articulated objects are either data-driven or simulation based, which are limited by the scale and quality of the training data or the fidelity and heavy labour of the simulation. In this paper, we propose Infinite Mobility, a novel method for synthesizing high-fidelity articulated objects through procedural generation. User study and quantitative evaluation demonstrate that our method can produce results that excel current state-of-the-art methods and are comparable to human-annotated datasets in both physics property and mesh quality. Furthermore, we show that our synthetic data can be used as training data for generative models, enabling next-step scaling up. Code is available at https://github.com/Intern-Nexus/Infinite-Mobility
△ Less
Submitted 17 March, 2025;
originally announced March 2025.
-
Towards Synthesized and Editable Motion In-Betweening Through Part-Wise Phase Representation
Authors:
Minyue Dai,
Jingbo Wang,
Ke Fan,
Bin Ji,
Haoyu Zhao,
Junting Dong,
Bo Dai
Abstract:
Styled motion in-betweening is crucial for computer animation and gaming. However, existing methods typically encode motion styles by modeling whole-body motions, often overlooking the representation of individual body parts. This limitation reduces the flexibility of infilled motion, particularly in adjusting the motion styles of specific limbs independently. To overcome this challenge, we propos…
▽ More
Styled motion in-betweening is crucial for computer animation and gaming. However, existing methods typically encode motion styles by modeling whole-body motions, often overlooking the representation of individual body parts. This limitation reduces the flexibility of infilled motion, particularly in adjusting the motion styles of specific limbs independently. To overcome this challenge, we propose a novel framework that models motion styles at the body-part level, enhancing both the diversity and controllability of infilled motions. Our approach enables more nuanced and expressive animations by allowing precise modifications to individual limb motions while maintaining overall motion coherence. Leveraging phase-related insights, our framework employs periodic autoencoders to automatically extract the phase of each body part, capturing distinctive local style features. Additionally, we effectively decouple the motion source from synthesis control by integrating motion manifold learning and conditional generation techniques from both image and motion domains. This allows the motion source to generate high-quality motions across various styles, with extracted motion and style features readily available for controlled synthesis in subsequent tasks. Comprehensive evaluations demonstrate that our method achieves superior speed, robust generalization, and effective generation of extended motion sequences.
△ Less
Submitted 12 March, 2025; v1 submitted 11 March, 2025;
originally announced March 2025.
-
Gungnir: Exploiting Stylistic Features in Images for Backdoor Attacks on Diffusion Models
Authors:
Yu Pan,
Jiahao Chen,
Bingrong Dai,
Lin Wang,
Yi Du,
Jiao Liu
Abstract:
In recent years, Diffusion Models (DMs) have demonstrated significant advances in the field of image generation. However, according to current research, DMs are vulnerable to backdoor attacks, which allow attackers to control the model's output by inputting data containing covert triggers, such as a specific visual patch or phrase. Existing defense strategies are well equipped to thwart such attac…
▽ More
In recent years, Diffusion Models (DMs) have demonstrated significant advances in the field of image generation. However, according to current research, DMs are vulnerable to backdoor attacks, which allow attackers to control the model's output by inputting data containing covert triggers, such as a specific visual patch or phrase. Existing defense strategies are well equipped to thwart such attacks through backdoor detection and trigger inversion because previous attack methods are constrained by limited input spaces and low-dimensional triggers. For example, visual triggers are easily observed by defenders, text-based or attention-based triggers are more susceptible to neural network detection. To explore more possibilities of backdoor attack in DMs, we propose Gungnir, a novel method that enables attackers to activate the backdoor in DMs through style triggers within input images. Our approach proposes using stylistic features as triggers for the first time and implements backdoor attacks successfully in image-to-image tasks by introducing Reconstructing-Adversarial Noise (RAN) and Short-Term Timesteps-Retention (STTR). Our technique generates trigger-embedded images that are perceptually indistinguishable from clean images, thus bypassing both manual inspection and automated detection neural networks. Experiments demonstrate that Gungnir can easily bypass existing defense methods. Among existing DM defense frameworks, our approach achieves a 0 backdoor detection rate (BDR). Our codes are available at https://github.com/paoche11/Gungnir.
△ Less
Submitted 8 June, 2025; v1 submitted 27 February, 2025;
originally announced February 2025.
-
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning
Authors:
Tian Xie,
Zitian Gao,
Qingnan Ren,
Haoming Luo,
Yuqian Hong,
Bryan Dai,
Joey Zhou,
Kai Qiu,
Zhirong Wu,
Chong Luo
Abstract:
Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in large reasoning models. To analyze reasoning dynamics, we use synthetic logic puzzles as training data due to their controllable complexity and straightforward answer verification. We make some key technical contributions that lead to effective and stable RL training: a system prompt that…
▽ More
Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in large reasoning models. To analyze reasoning dynamics, we use synthetic logic puzzles as training data due to their controllable complexity and straightforward answer verification. We make some key technical contributions that lead to effective and stable RL training: a system prompt that emphasizes the thinking and answering process, a stringent format reward function that penalizes outputs for taking shortcuts, and a straightforward training recipe that achieves stable convergence. Our 7B model develops advanced reasoning skills-such as reflection, verification, and summarization-that are absent from the logic corpus. Remarkably, after training on just 5K logic problems, it demonstrates generalization abilities to the challenging math benchmarks AIME and AMC.
△ Less
Submitted 20 February, 2025;
originally announced February 2025.
-
Incentivize without Bonus: Provably Efficient Model-based Online Multi-agent RL for Markov Games
Authors:
Tong Yang,
Bo Dai,
Lin Xiao,
Yuejie Chi
Abstract:
Multi-agent reinforcement learning (MARL) lies at the heart of a plethora of applications involving the interaction of a group of agents in a shared unknown environment. A prominent framework for studying MARL is Markov games, with the goal of finding various notions of equilibria in a sample-efficient manner, such as the Nash equilibrium (NE) and the coarse correlated equilibrium (CCE). However,…
▽ More
Multi-agent reinforcement learning (MARL) lies at the heart of a plethora of applications involving the interaction of a group of agents in a shared unknown environment. A prominent framework for studying MARL is Markov games, with the goal of finding various notions of equilibria in a sample-efficient manner, such as the Nash equilibrium (NE) and the coarse correlated equilibrium (CCE). However, existing sample-efficient approaches either require tailored uncertainty estimation under function approximation, or careful coordination of the players. In this paper, we propose a novel model-based algorithm, called VMG, that incentivizes exploration via biasing the empirical estimate of the model parameters towards those with a higher collective best-response values of all the players when fixing the other players' policies, thus encouraging the policy to deviate from its current equilibrium for more exploration. VMG is oblivious to different forms of function approximation, and permits simultaneous and uncoupled policy updates of all players. Theoretically, we also establish that VMG achieves a near-optimal regret for finding both the NEs of two-player zero-sum Markov games and CCEs of multi-player general-sum Markov games under linear function approximation in an online environment, which nearly match their counterparts with sophisticated uncertainty quantification.
△ Less
Submitted 13 February, 2025;
originally announced February 2025.
-
Small steps no more: Global convergence of stochastic gradient bandits for arbitrary learning rates
Authors:
Jincheng Mei,
Bo Dai,
Alekh Agarwal,
Sharan Vaswani,
Anant Raj,
Csaba Szepesvari,
Dale Schuurmans
Abstract:
We provide a new understanding of the stochastic gradient bandit algorithm by showing that it converges to a globally optimal policy almost surely using \emph{any} constant learning rate. This result demonstrates that the stochastic gradient algorithm continues to balance exploration and exploitation appropriately even in scenarios where standard smoothness and noise control assumptions break down…
▽ More
We provide a new understanding of the stochastic gradient bandit algorithm by showing that it converges to a globally optimal policy almost surely using \emph{any} constant learning rate. This result demonstrates that the stochastic gradient algorithm continues to balance exploration and exploitation appropriately even in scenarios where standard smoothness and noise control assumptions break down. The proofs are based on novel findings about action sampling rates and the relationship between cumulative progress and noise, and extend the current understanding of how simple stochastic gradient methods behave in bandit settings.
△ Less
Submitted 10 February, 2025;
originally announced February 2025.
-
GAS: Generative Avatar Synthesis from a Single Image
Authors:
Yixing Lu,
Junting Dong,
Youngjoong Kwon,
Qin Zhao,
Bo Dai,
Fernando De la Torre
Abstract:
We introduce a generalizable and unified framework to synthesize view-consistent and temporally coherent avatars from a single image, addressing the challenging problem of single-image avatar generation. While recent methods employ diffusion models conditioned on human templates like depth or normal maps, they often struggle to preserve appearance information due to the discrepancy between sparse…
▽ More
We introduce a generalizable and unified framework to synthesize view-consistent and temporally coherent avatars from a single image, addressing the challenging problem of single-image avatar generation. While recent methods employ diffusion models conditioned on human templates like depth or normal maps, they often struggle to preserve appearance information due to the discrepancy between sparse driving signals and the actual human subject, resulting in multi-view and temporal inconsistencies. Our approach bridges this gap by combining the reconstruction power of regression-based 3D human reconstruction with the generative capabilities of a diffusion model. The dense driving signal from the initial reconstructed human provides comprehensive conditioning, ensuring high-quality synthesis faithful to the reference appearance and structure. Additionally, we propose a unified framework that enables the generalization learned from novel pose synthesis on in-the-wild videos to naturally transfer to novel view synthesis. Our video-based diffusion model enhances disentangled synthesis with high-quality view-consistent renderings for novel views and realistic non-rigid deformations in novel pose animation. Results demonstrate the superior generalization ability of our method across in-domain and out-of-domain in-the-wild datasets. Project page: https://humansensinglab.github.io/GAS/
△ Less
Submitted 10 February, 2025;
originally announced February 2025.
-
Efficient Online Reinforcement Learning for Diffusion Policy
Authors:
Haitong Ma,
Tianyi Chen,
Kai Wang,
Na Li,
Bo Dai
Abstract:
Diffusion policies have achieved superior performance in imitation learning and offline reinforcement learning (RL) due to their rich expressiveness. However, the conventional diffusion training procedure requires samples from target distribution, which is impossible in online RL since we cannot sample from the optimal policy. Backpropagating policy gradient through the diffusion process incurs hu…
▽ More
Diffusion policies have achieved superior performance in imitation learning and offline reinforcement learning (RL) due to their rich expressiveness. However, the conventional diffusion training procedure requires samples from target distribution, which is impossible in online RL since we cannot sample from the optimal policy. Backpropagating policy gradient through the diffusion process incurs huge computational costs and instability, thus being expensive and not scalable. To enable efficient training of diffusion policies in online RL, we generalize the conventional denoising score matching by reweighting the loss function. The resulting Reweighted Score Matching (RSM) preserves the optimal solution and low computational cost of denoising score matching, while eliminating the need to sample from the target distribution and allowing learning to optimize value functions. We introduce two tractable reweighted loss functions to solve two commonly used policy optimization problems, policy mirror descent and max-entropy policy, resulting in two practical algorithms named Diffusion Policy Mirror Descent (DPMD) and Soft Diffusion Actor-Critic (SDAC). We conducted comprehensive comparisons on MuJoCo benchmarks. The empirical results show that the proposed algorithms outperform recent diffusion-policy online RLs on most tasks, and the DPMD improves more than 120% over soft actor-critic on Humanoid and Ant.
△ Less
Submitted 29 June, 2025; v1 submitted 1 February, 2025;
originally announced February 2025.
-
EdgeTAM: On-Device Track Anything Model
Authors:
Chong Zhou,
Chenchen Zhu,
Yunyang Xiong,
Saksham Suri,
Fanyi Xiao,
Lemeng Wu,
Raghuraman Krishnamoorthi,
Bo Dai,
Chen Change Loy,
Vikas Chandra,
Bilge Soran
Abstract:
On top of Segment Anything Model (SAM), SAM 2 further extends its capability from image to video inputs through a memory bank mechanism and obtains a remarkable performance compared with previous methods, making it a foundation model for video segmentation task. In this paper, we aim at making SAM 2 much more efficient so that it even runs on mobile devices while maintaining a comparable performan…
▽ More
On top of Segment Anything Model (SAM), SAM 2 further extends its capability from image to video inputs through a memory bank mechanism and obtains a remarkable performance compared with previous methods, making it a foundation model for video segmentation task. In this paper, we aim at making SAM 2 much more efficient so that it even runs on mobile devices while maintaining a comparable performance. Despite several works optimizing SAM for better efficiency, we find they are not sufficient for SAM 2 because they all focus on compressing the image encoder, while our benchmark shows that the newly introduced memory attention blocks are also the latency bottleneck. Given this observation, we propose EdgeTAM, which leverages a novel 2D Spatial Perceiver to reduce the computational cost. In particular, the proposed 2D Spatial Perceiver encodes the densely stored frame-level memories with a lightweight Transformer that contains a fixed set of learnable queries. Given that video segmentation is a dense prediction task, we find preserving the spatial structure of the memories is essential so that the queries are split into global-level and patch-level groups. We also propose a distillation pipeline that further improves the performance without inference overhead. As a result, EdgeTAM achieves 87.7, 70.0, 72.3, and 71.7 J&F on DAVIS 2017, MOSE, SA-V val, and SA-V test, while running at 16 FPS on iPhone 15 Pro Max.
△ Less
Submitted 13 January, 2025;
originally announced January 2025.
-
Learning 3D Garment Animation from Trajectories of A Piece of Cloth
Authors:
Yidi Shao,
Chen Change Loy,
Bo Dai
Abstract:
Garment animation is ubiquitous in various applications, such as virtual reality, gaming, and film producing. Recently, learning-based approaches obtain compelling performance in animating diverse garments under versatile scenarios. Nevertheless, to mimic the deformations of the observed garments, data-driven methods require large scale of garment data, which are both resource-wise expensive and t…
▽ More
Garment animation is ubiquitous in various applications, such as virtual reality, gaming, and film producing. Recently, learning-based approaches obtain compelling performance in animating diverse garments under versatile scenarios. Nevertheless, to mimic the deformations of the observed garments, data-driven methods require large scale of garment data, which are both resource-wise expensive and time-consuming. In addition, forcing models to match the dynamics of observed garment animation may hinder the potentials to generalize to unseen cases. In this paper, instead of using garment-wise supervised-learning we adopt a disentangled scheme to learn how to animate observed garments: 1). learning constitutive behaviors from the observed cloth; 2). dynamically animate various garments constrained by the learned constitutive laws. Specifically, we propose Energy Unit network (EUNet) to model the constitutive relations in the format of energy. Without the priors from analytical physics models and differentiable simulation engines, EUNet is able to directly capture the constitutive behaviors from the observed piece of cloth and uniformly describes the change of energy caused by deformations, such as stretching and bending. We further apply the pre-trained EUNet to animate various garments based on energy optimizations. The disentangled scheme alleviates the need of garment data and enables us to utilize the dynamics of a piece of cloth for animating garments. Experiments show that while EUNet effectively delivers the energy gradients due to the deformations, models constrained by EUNet achieve more stable and physically plausible performance comparing with those trained in garment-wise supervised manner. Code is available at https://github.com/ftbabi/EUNet_NeurIPS2024.git .
△ Less
Submitted 2 January, 2025;
originally announced January 2025.
-
FasterSTS: A Faster Spatio-Temporal Synchronous Graph Convolutional Networks for Traffic flow Forecasting
Authors:
Ben-Ao Dai,
Nengchao Lyu,
Yongchao Miao
Abstract:
Accurate traffic flow prediction heavily relies on the spatio-temporal correlation of traffic flow data. Most current studies separately capture correlations in spatial and temporal dimensions, making it difficult to capture complex spatio-temporal heterogeneity, and often at the expense of increasing model complexity to improve prediction accuracy. Although there have been groundbreaking attempts…
▽ More
Accurate traffic flow prediction heavily relies on the spatio-temporal correlation of traffic flow data. Most current studies separately capture correlations in spatial and temporal dimensions, making it difficult to capture complex spatio-temporal heterogeneity, and often at the expense of increasing model complexity to improve prediction accuracy. Although there have been groundbreaking attempts in the field of spatio-temporal synchronous modeling, significant limitations remain in terms of performance and complexity control.This study proposes a quicker and more effective spatio-temporal synchronous traffic flow forecast model to address these issues.
△ Less
Submitted 1 January, 2025;
originally announced January 2025.
-
GausSim: Foreseeing Reality by Gaussian Simulator for Elastic Objects
Authors:
Yidi Shao,
Mu Huang,
Chen Change Loy,
Bo Dai
Abstract:
We introduce GausSim, a novel neural network-based simulator designed to capture the dynamic behaviors of real-world elastic objects represented through Gaussian kernels. We leverage continuum mechanics and treat each kernel as a Center of Mass System (CMS) that represents continuous piece of matter, accounting for realistic deformations without idealized assumptions. To improve computational effi…
▽ More
We introduce GausSim, a novel neural network-based simulator designed to capture the dynamic behaviors of real-world elastic objects represented through Gaussian kernels. We leverage continuum mechanics and treat each kernel as a Center of Mass System (CMS) that represents continuous piece of matter, accounting for realistic deformations without idealized assumptions. To improve computational efficiency and fidelity, we employ a hierarchical structure that further organizes kernels into CMSs with explicit formulations, enabling a coarse-to-fine simulation approach. This structure significantly reduces computational overhead while preserving detailed dynamics. In addition, GausSim incorporates explicit physics constraints, such as mass and momentum conservation, ensuring interpretable results and robust, physically plausible simulations. To validate our approach, we present a new dataset, READY, containing multi-view videos of real-world elastic deformations. Experimental results demonstrate that GausSim achieves superior performance compared to existing physics-driven baselines, offering a practical and accurate solution for simulating complex dynamic behaviors. Code and model will be released. Project page: https://www.mmlab-ntu.com/project/gausim/index.html .
△ Less
Submitted 10 March, 2025; v1 submitted 23 December, 2024;
originally announced December 2024.
-
LangSurf: Language-Embedded Surface Gaussians for 3D Scene Understanding
Authors:
Hao Li,
Roy Qin,
Zhengyu Zou,
Diqi He,
Bohan Li,
Bingquan Dai,
Dingewn Zhang,
Junwei Han
Abstract:
Applying Gaussian Splatting to perception tasks for 3D scene understanding is becoming increasingly popular. Most existing works primarily focus on rendering 2D feature maps from novel viewpoints, which leads to an imprecise 3D language field with outlier languages, ultimately failing to align objects in 3D space. By utilizing masked images for feature extraction, these approaches also lack essent…
▽ More
Applying Gaussian Splatting to perception tasks for 3D scene understanding is becoming increasingly popular. Most existing works primarily focus on rendering 2D feature maps from novel viewpoints, which leads to an imprecise 3D language field with outlier languages, ultimately failing to align objects in 3D space. By utilizing masked images for feature extraction, these approaches also lack essential contextual information, leading to inaccurate feature representation. To this end, we propose a Language-Embedded Surface Field (LangSurf), which accurately aligns the 3D language fields with the surface of objects, facilitating precise 2D and 3D segmentation with text query, widely expanding the downstream tasks such as removal and editing. The core of LangSurf is a joint training strategy that flattens the language Gaussian on the object surfaces using geometry supervision and contrastive losses to assign accurate language features to the Gaussians of objects. In addition, we also introduce the Hierarchical-Context Awareness Module to extract features at the image level for contextual information then perform hierarchical mask pooling using masks segmented by SAM to obtain fine-grained language features in different hierarchies. Extensive experiments on open-vocabulary 2D and 3D semantic segmentation demonstrate that LangSurf outperforms the previous state-of-the-art method LangSplat by a large margin. As shown in Fig. 1, our method is capable of segmenting objects in 3D space, thus boosting the effectiveness of our approach in instance recognition, removal, and editing, which is also supported by comprehensive experiments. \url{https://langsurf.github.io}.
△ Less
Submitted 23 December, 2024; v1 submitted 23 December, 2024;
originally announced December 2024.
-
Safe Dynamic Motion Generation in Configuration Space Using Differentiable Distance Fields
Authors:
Xuemin Chi,
Yiming Li,
Jihao Huang,
Bolun Dai,
Zhitao Liu,
Sylvain Calinon
Abstract:
Generating collision-free motions in dynamic environments is a challenging problem for high-dimensional robotics, particularly under real-time constraints. Control Barrier Functions (CBFs), widely utilized in safety-critical control, have shown significant potential for motion generation. However, for high-dimensional robot manipulators, existing QP formulations and CBF-based methods rely on posit…
▽ More
Generating collision-free motions in dynamic environments is a challenging problem for high-dimensional robotics, particularly under real-time constraints. Control Barrier Functions (CBFs), widely utilized in safety-critical control, have shown significant potential for motion generation. However, for high-dimensional robot manipulators, existing QP formulations and CBF-based methods rely on positional information, overlooking higher-order derivatives such as velocities. This limitation may lead to reduced success rates, decreased performance, and inadequate safety constraints. To address this, we construct time-varying CBFs (TVCBFs) that consider velocity conditions for obstacles. Our approach leverages recent developments on distance fields for articulated manipulators, a differentiable representation that enables the mapping of objects' position and velocity into the robot's joint space, offering a comprehensive understanding of the system's interactions. This allows the manipulator to be treated as a point-mass system thus simplifying motion generation tasks. Additionally, we introduce a time-varying control Lyapunov function (TVCLF) to enable whole-body contact motions. Our approach integrates the TVCBF, TVCLF, and manipulator physical constraints within a unified QP framework. We validate our method through simulations and comparisons with state-of-the-art approaches, demonstrating its effectiveness on a 7-axis Franka robot in real-world experiments.
△ Less
Submitted 20 December, 2024;
originally announced December 2024.
-
Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models
Authors:
Yinlam Chow,
Guy Tennenholtz,
Izzeddin Gur,
Vincent Zhuang,
Bo Dai,
Sridhar Thiagarajan,
Craig Boutilier,
Rishabh Agarwal,
Aviral Kumar,
Aleksandra Faust
Abstract:
Recent studies have indicated that effectively utilizing inference-time compute is crucial for attaining better performance from large language models (LLMs). In this work, we propose a novel inference-aware fine-tuning paradigm, in which the model is fine-tuned in a manner that directly optimizes the performance of the inference-time strategy. We study this paradigm using the simple yet effective…
▽ More
Recent studies have indicated that effectively utilizing inference-time compute is crucial for attaining better performance from large language models (LLMs). In this work, we propose a novel inference-aware fine-tuning paradigm, in which the model is fine-tuned in a manner that directly optimizes the performance of the inference-time strategy. We study this paradigm using the simple yet effective Best-of-N (BoN) inference strategy, in which a verifier selects the best out of a set of LLM-generated responses. We devise the first imitation learning and reinforcement learning~(RL) methods for BoN-aware fine-tuning, overcoming the challenging, non-differentiable argmax operator within BoN. We empirically demonstrate that our BoN-aware models implicitly learn a meta-strategy that interleaves best responses with more diverse responses that might be better suited to a test-time input -- a process reminiscent of the exploration-exploitation trade-off in RL. Our experiments demonstrate the effectiveness of BoN-aware fine-tuning in terms of improved performance and inference-time compute. In particular, we show that our methods improve the Bo32 performance of Gemma 2B on Hendrycks MATH from 26.8% to 30.8%, and pass@32 from 60.0% to 67.0%, as well as the pass@16 on HumanEval from 61.6% to 67.1%.
△ Less
Submitted 18 December, 2024;
originally announced December 2024.
-
ScaMo: Exploring the Scaling Law in Autoregressive Motion Generation Model
Authors:
Shunlin Lu,
Jingbo Wang,
Zeyu Lu,
Ling-Hao Chen,
Wenxun Dai,
Junting Dong,
Zhiyang Dou,
Bo Dai,
Ruimao Zhang
Abstract:
The scaling law has been validated in various domains, such as natural language processing (NLP) and massive computer vision tasks; however, its application to motion generation remains largely unexplored. In this paper, we introduce a scalable motion generation framework that includes the motion tokenizer Motion FSQ-VAE and a text-prefix autoregressive transformer. Through comprehensive experimen…
▽ More
The scaling law has been validated in various domains, such as natural language processing (NLP) and massive computer vision tasks; however, its application to motion generation remains largely unexplored. In this paper, we introduce a scalable motion generation framework that includes the motion tokenizer Motion FSQ-VAE and a text-prefix autoregressive transformer. Through comprehensive experiments, we observe the scaling behavior of this system. For the first time, we confirm the existence of scaling laws within the context of motion generation. Specifically, our results demonstrate that the normalized test loss of our prefix autoregressive models adheres to a logarithmic law in relation to compute budgets. Furthermore, we also confirm the power law between Non-Vocabulary Parameters, Vocabulary Parameters, and Data Tokens with respect to compute budgets respectively. Leveraging the scaling law, we predict the optimal transformer size, vocabulary size, and data requirements for a compute budget of $1e18$. The test loss of the system, when trained with the optimal model size, vocabulary size, and required data, aligns precisely with the predicted test loss, thereby validating the scaling law.
△ Less
Submitted 19 December, 2024;
originally announced December 2024.
-
ChatDyn: Language-Driven Multi-Actor Dynamics Generation in Street Scenes
Authors:
Yuxi Wei,
Jingbo Wang,
Yuwen Du,
Dingju Wang,
Liang Pan,
Chenxin Xu,
Yao Feng,
Bo Dai,
Siheng Chen
Abstract:
Generating realistic and interactive dynamics of traffic participants according to specific instruction is critical for street scene simulation. However, there is currently a lack of a comprehensive method that generates realistic dynamics of different types of participants including vehicles and pedestrians, with different kinds of interactions between them. In this paper, we introduce ChatDyn, t…
▽ More
Generating realistic and interactive dynamics of traffic participants according to specific instruction is critical for street scene simulation. However, there is currently a lack of a comprehensive method that generates realistic dynamics of different types of participants including vehicles and pedestrians, with different kinds of interactions between them. In this paper, we introduce ChatDyn, the first system capable of generating interactive, controllable and realistic participant dynamics in street scenes based on language instructions. To achieve precise control through complex language, ChatDyn employs a multi-LLM-agent role-playing approach, which utilizes natural language inputs to plan the trajectories and behaviors for different traffic participants. To generate realistic fine-grained dynamics based on the planning, ChatDyn designs two novel executors: the PedExecutor, a unified multi-task executor that generates realistic pedestrian dynamics under different task plannings; and the VehExecutor, a physical transition-based policy that generates physically plausible vehicle dynamics. Extensive experiments show that ChatDyn can generate realistic driving scene dynamics with multiple vehicles and pedestrians, and significantly outperforms previous methods on subtasks. Code and model will be available at https://vfishc.github.io/chatdyn.
△ Less
Submitted 11 December, 2024;
originally announced December 2024.
-
SLGaussian: Fast Language Gaussian Splatting in Sparse Views
Authors:
Kangjie Chen,
BingQuan Dai,
Minghan Qin,
Dongbin Zhang,
Peihao Li,
Yingshuang Zou,
Haoqian Wang
Abstract:
3D semantic field learning is crucial for applications like autonomous navigation, AR/VR, and robotics, where accurate comprehension of 3D scenes from limited viewpoints is essential. Existing methods struggle under sparse view conditions, relying on inefficient per-scene multi-view optimizations, which are impractical for many real-world tasks. To address this, we propose SLGaussian, a feed-forwa…
▽ More
3D semantic field learning is crucial for applications like autonomous navigation, AR/VR, and robotics, where accurate comprehension of 3D scenes from limited viewpoints is essential. Existing methods struggle under sparse view conditions, relying on inefficient per-scene multi-view optimizations, which are impractical for many real-world tasks. To address this, we propose SLGaussian, a feed-forward method for constructing 3D semantic fields from sparse viewpoints, allowing direct inference of 3DGS-based scenes. By ensuring consistent SAM segmentations through video tracking and using low-dimensional indexing for high-dimensional CLIP features, SLGaussian efficiently embeds language information in 3D space, offering a robust solution for accurate 3D scene understanding under sparse view conditions. In experiments on two-view sparse 3D object querying and segmentation in the LERF and 3D-OVS datasets, SLGaussian outperforms existing methods in chosen IoU, Localization Accuracy, and mIoU. Moreover, our model achieves scene inference in under 30 seconds and open-vocabulary querying in just 0.011 seconds per query.
△ Less
Submitted 11 December, 2024;
originally announced December 2024.
-
Proc-GS: Procedural Building Generation for City Assembly with 3D Gaussians
Authors:
Yixuan Li,
Xingjian Ran,
Linning Xu,
Tao Lu,
Mulin Yu,
Zhenzhi Wang,
Yuanbo Xiangli,
Dahua Lin,
Bo Dai
Abstract:
Buildings are primary components of cities, often featuring repeated elements such as windows and doors. Traditional 3D building asset creation is labor-intensive and requires specialized skills to develop design rules. Recent generative models for building creation often overlook these patterns, leading to low visual fidelity and limited scalability. Drawing inspiration from procedural modeling t…
▽ More
Buildings are primary components of cities, often featuring repeated elements such as windows and doors. Traditional 3D building asset creation is labor-intensive and requires specialized skills to develop design rules. Recent generative models for building creation often overlook these patterns, leading to low visual fidelity and limited scalability. Drawing inspiration from procedural modeling techniques used in the gaming and visual effects industry, our method, Proc-GS, integrates procedural code into the 3D Gaussian Splatting (3D-GS) framework, leveraging their advantages in high-fidelity rendering and efficient asset management from both worlds. By manipulating procedural code, we can streamline this process and generate an infinite variety of buildings. This integration significantly reduces model size by utilizing shared foundational assets, enabling scalable generation with precise control over building assembly. We showcase the potential for expansive cityscape generation while maintaining high rendering fidelity and precise control on both real and synthetic cases.
△ Less
Submitted 10 December, 2024;
originally announced December 2024.
-
Performance Analysis and Code Design for Resistive Random-Access Memory Using Channel Decomposition Approach
Authors:
Guanghui Song,
Meiru Gao,
Ying Li,
Bin Dai,
Kui Cai
Abstract:
A novel framework for performance analysis and code design is proposed to address the sneak path (SP) problem in resistive random-access memory (ReRAM) arrays. The main idea is to decompose the ReRAM channel, which is both non-ergodic and data-dependent, into multiple stationary memoryless channels. A finite-length performance bound is derived by analyzing the capacity and dispersion of these stat…
▽ More
A novel framework for performance analysis and code design is proposed to address the sneak path (SP) problem in resistive random-access memory (ReRAM) arrays. The main idea is to decompose the ReRAM channel, which is both non-ergodic and data-dependent, into multiple stationary memoryless channels. A finite-length performance bound is derived by analyzing the capacity and dispersion of these stationary memoryless channels. Furthermore, leveraging this channel decomposition, a practical sparse-graph code design is proposed using density evolution. The obtained channel codes are not only asymptotic capacity approaching but also close to the derived finite-length performance bound.
△ Less
Submitted 9 December, 2024;
originally announced December 2024.