-
SnapNCode: An Integrated Development Environment for Programming Physical Objects Interactions
Authors:
Xiaoyan Wei,
Zijian Yue,
Hsiang-Ting Chen
Abstract:
Spatial computing technologies have the potential to revolutionize how we interact with the world around us. However, most modern integrated development environments (IDEs) have not fully adapted to this paradigm shift. For example, physical 3D objects in the real world are still represented as 2D text variables in code, creating a significant perceptual distance between these representations. In…
▽ More
Spatial computing technologies have the potential to revolutionize how we interact with the world around us. However, most modern integrated development environments (IDEs) have not fully adapted to this paradigm shift. For example, physical 3D objects in the real world are still represented as 2D text variables in code, creating a significant perceptual distance between these representations. In response to this challenge, we introduce SnapNCode, a novel IDE for spatial programming. SnapNCode enables programmers to capture various states of physical objects through live video streams from cameras and directly insert these visual representations into their code. Moreover, users can augment physical objects by attaching code snippets onto objects, which are opportunistically triggered when observed by cameras. We conducted a user study (N=12) to assess the usability of SnapNCode. Feedback from participants indicates that the system is easy-to-use and holds promise for daily casual uses and integration into a broader range of workflows.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Context-AI Tunes: Context-Aware AI-Generated Music for Stress Reduction
Authors:
Xiaoyan Wei,
Zebang Zhang,
Zijian Yue,
Hsiang-Ting Chen
Abstract:
Music plays a critical role in emotional regulation and stress relief; however, individuals often need different types of music tailored to their unique stress levels or surrounding environment. Choosing the right music can be challenging due to the overwhelming number of options and the time-consuming trial-and-error process. To address this, we propose Context-AI Tune (CAT), a system that genera…
▽ More
Music plays a critical role in emotional regulation and stress relief; however, individuals often need different types of music tailored to their unique stress levels or surrounding environment. Choosing the right music can be challenging due to the overwhelming number of options and the time-consuming trial-and-error process. To address this, we propose Context-AI Tune (CAT), a system that generates personalized music based on environmental inputs and the user's self-assessed stress level. A 2x2 within-subject experiment (N=26) was conducted with two independent variables: AI (AI, NoAI) and Environment (Busy Hub, Quiet Library). CAT's effectiveness in reducing stress was evaluated using the Visual Analog Scale for Stress (VAS-S). Results show that CAT is more effective than manually chosen music in reducing stress by adapting to user context.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Self-Consuming Generative Models with Adversarially Curated Data
Authors:
Xiukun Wei,
Xueru Zhang
Abstract:
Recent advances in generative models have made it increasingly difficult to distinguish real data from model-generated synthetic data. Using synthetic data for successive training of future model generations creates "self-consuming loops", which may lead to model collapse or training instability. Furthermore, synthetic data is often subject to human feedback and curated by users based on their pre…
▽ More
Recent advances in generative models have made it increasingly difficult to distinguish real data from model-generated synthetic data. Using synthetic data for successive training of future model generations creates "self-consuming loops", which may lead to model collapse or training instability. Furthermore, synthetic data is often subject to human feedback and curated by users based on their preferences. Ferbach et al. (2024) recently showed that when data is curated according to user preferences, the self-consuming retraining loop drives the model to converge toward a distribution that optimizes those preferences. However, in practice, data curation is often noisy or adversarially manipulated. For example, competing platforms may recruit malicious users to adversarially curate data and disrupt rival models. In this paper, we study how generative models evolve under self-consuming retraining loops with noisy and adversarially curated data. We theoretically analyze the impact of such noisy data curation on generative models and identify conditions for the robustness of the retraining process. Building on this analysis, we design attack algorithms for competitive adversarial scenarios, where a platform with a limited budget employs malicious users to misalign a rival's model from actual user preferences. Experiments on both synthetic and real-world datasets demonstrate the effectiveness of the proposed algorithms.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
UMotion: Uncertainty-driven Human Motion Estimation from Inertial and Ultra-wideband Units
Authors:
Huakun Liu,
Hiroki Ota,
Xin Wei,
Yutaro Hirao,
Monica Perusquia-Hernandez,
Hideaki Uchiyama,
Kiyoshi Kiyokawa
Abstract:
Sparse wearable inertial measurement units (IMUs) have gained popularity for estimating 3D human motion. However, challenges such as pose ambiguity, data drift, and limited adaptability to diverse bodies persist. To address these issues, we propose UMotion, an uncertainty-driven, online fusing-all state estimation framework for 3D human shape and pose estimation, supported by six integrated, body-…
▽ More
Sparse wearable inertial measurement units (IMUs) have gained popularity for estimating 3D human motion. However, challenges such as pose ambiguity, data drift, and limited adaptability to diverse bodies persist. To address these issues, we propose UMotion, an uncertainty-driven, online fusing-all state estimation framework for 3D human shape and pose estimation, supported by six integrated, body-worn ultra-wideband (UWB) distance sensors with IMUs. UWB sensors measure inter-node distances to infer spatial relationships, aiding in resolving pose ambiguities and body shape variations when combined with anthropometric data. Unfortunately, IMUs are prone to drift, and UWB sensors are affected by body occlusions. Consequently, we develop a tightly coupled Unscented Kalman Filter (UKF) framework that fuses uncertainties from sensor data and estimated human motion based on individual body shape. The UKF iteratively refines IMU and UWB measurements by aligning them with uncertain human motion constraints in real-time, producing optimal estimates for each. Experiments on both synthetic and real-world datasets demonstrate the effectiveness of UMotion in stabilizing sensor data and the improvement over state of the art in pose accuracy.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
Authors:
Chenggang Zhao,
Chengqi Deng,
Chong Ruan,
Damai Dai,
Huazuo Gao,
Jiashi Li,
Liyue Zhang,
Panpan Huang,
Shangyan Zhou,
Shirong Ma,
Wenfeng Liang,
Ying He,
Yuqing Wang,
Yuxuan Liu,
Y. X. Wei
Abstract:
The rapid scaling of large language models (LLMs) has unveiled critical limitations in current hardware architectures, including constraints in memory capacity, computational efficiency, and interconnection bandwidth. DeepSeek-V3, trained on 2,048 NVIDIA H800 GPUs, demonstrates how hardware-aware model co-design can effectively address these challenges, enabling cost-efficient training and inferen…
▽ More
The rapid scaling of large language models (LLMs) has unveiled critical limitations in current hardware architectures, including constraints in memory capacity, computational efficiency, and interconnection bandwidth. DeepSeek-V3, trained on 2,048 NVIDIA H800 GPUs, demonstrates how hardware-aware model co-design can effectively address these challenges, enabling cost-efficient training and inference at scale. This paper presents an in-depth analysis of the DeepSeek-V3/R1 model architecture and its AI infrastructure, highlighting key innovations such as Multi-head Latent Attention (MLA) for enhanced memory efficiency, Mixture of Experts (MoE) architectures for optimized computation-communication trade-offs, FP8 mixed-precision training to unlock the full potential of hardware capabilities, and a Multi-Plane Network Topology to minimize cluster-level network overhead. Building on the hardware bottlenecks encountered during DeepSeek-V3's development, we engage in a broader discussion with academic and industry peers on potential future hardware directions, including precise low-precision computation units, scale-up and scale-out convergence, and innovations in low-latency communication fabrics. These insights underscore the critical role of hardware and model co-design in meeting the escalating demands of AI workloads, offering a practical blueprint for innovation in next-generation AI systems.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning
Authors:
Dayong Liang,
Changmeng Zheng,
Zhiyuan Wen,
Yi Cai,
Xiao-Yong Wei,
Qing Li
Abstract:
Traditional scene graphs primarily focus on spatial relationships, limiting vision-language models' (VLMs) ability to reason about complex interactions in visual scenes. This paper addresses two key challenges: (1) conventional detection-to-construction methods produce unfocused, contextually irrelevant relationship sets, and (2) existing approaches fail to form persistent memories for generalizin…
▽ More
Traditional scene graphs primarily focus on spatial relationships, limiting vision-language models' (VLMs) ability to reason about complex interactions in visual scenes. This paper addresses two key challenges: (1) conventional detection-to-construction methods produce unfocused, contextually irrelevant relationship sets, and (2) existing approaches fail to form persistent memories for generalizing interaction reasoning to new scenes. We propose Interaction-augmented Scene Graph Reasoning (ISGR), a framework that enhances VLMs' interactional reasoning through three complementary components. First, our dual-stream graph constructor combines SAM-powered spatial relation extraction with interaction-aware captioning to generate functionally salient scene graphs with spatial grounding. Second, we employ targeted interaction queries to activate VLMs' latent knowledge of object functionalities, converting passive recognition into active reasoning about how objects work together. Finally, we introduce a lone-term memory reinforcement learning strategy with a specialized interaction-focused reward function that transforms transient patterns into long-term reasoning heuristics. Extensive experiments demonstrate that our approach significantly outperforms baseline methods on interaction-heavy reasoning benchmarks, with particularly strong improvements on complex scene understanding tasks. The source code can be accessed at https://github.com/open_upon_acceptance.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Robust Movable-Antenna Position Optimization with Imperfect CSI for MISO Systems
Authors:
Haifeng Ma,
Weidong Mei,
Xin Wei,
Boyu Ning,
Zhi Chen
Abstract:
Movable antenna (MA) technology has emerged as a promising solution for reconfiguring wireless channel conditions through local antenna movement within confined regions. Unlike previous works assuming perfect channel state information (CSI), this letter addresses the robust MA position optimization problem under imperfect CSI conditions for a multiple-input single-output (MISO) MA system. Specific…
▽ More
Movable antenna (MA) technology has emerged as a promising solution for reconfiguring wireless channel conditions through local antenna movement within confined regions. Unlike previous works assuming perfect channel state information (CSI), this letter addresses the robust MA position optimization problem under imperfect CSI conditions for a multiple-input single-output (MISO) MA system. Specifically, we consider two types of CSI errors: norm-bounded and randomly distributed errors, aiming to maximize the worst-case and non-outage received signal power, respectively. For norm-bounded CSI errors, we derive the worst-case received signal power in closed-form. For randomly distributed CSI errors, due to the intractability of the probabilistic constraints, we apply the Bernstein-type inequality to obtain a closed-form lower bound for the non-outage received signal power. Based on these results, we show the optimality of the maximum-ratio transmission for imperfect CSI in both scenarios and employ a graph-based algorithm to obtain the optimal MA positions. Numerical results show that our proposed scheme can even outperform other benchmark schemes implemented under perfect CSI conditions.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
Model Steering: Learning with a Reference Model Improves Generalization Bounds and Scaling Laws
Authors:
Xiyuan Wei,
Ming Lin,
Fanjiang Ye,
Fengguang Song,
Liangliang Cao,
My T. Thai,
Tianbao Yang
Abstract:
This paper formalizes an emerging learning paradigm that uses a trained model as a reference to guide and enhance the training of a target model through strategic data selection or weighting, named $\textbf{model steering}$. While ad-hoc methods have been used in various contexts, including the training of large foundation models, its underlying principles remain insufficiently understood, leading…
▽ More
This paper formalizes an emerging learning paradigm that uses a trained model as a reference to guide and enhance the training of a target model through strategic data selection or weighting, named $\textbf{model steering}$. While ad-hoc methods have been used in various contexts, including the training of large foundation models, its underlying principles remain insufficiently understood, leading to sub-optimal performance. In this work, we propose a theory-driven framework for model steering called $\textbf{DRRho risk minimization}$, which is rooted in Distributionally Robust Optimization (DRO). Through a generalization analysis, we provide theoretical insights into why this approach improves generalization and data efficiency compared to training without a reference model. To the best of our knowledge, this is the first time such theoretical insights are provided for the new learning paradigm, which significantly enhance our understanding and practice of model steering. Building on these insights and the connection between contrastive learning and DRO, we introduce a novel method for Contrastive Language-Image Pretraining (CLIP) with a reference model, termed DRRho-CLIP. Extensive experiments validate the theoretical insights, reveal a superior scaling law compared to CLIP without a reference model, and demonstrate its strength over existing heuristic approaches.
△ Less
Submitted 13 May, 2025; v1 submitted 10 May, 2025;
originally announced May 2025.
-
UniCO: Towards a Unified Model for Combinatorial Optimization Problems
Authors:
Zefang Zong,
Xiaochen Wei,
Guozhen Zhang,
Chen Gao,
Huandong Wang,
Yong Li
Abstract:
Combinatorial Optimization (CO) encompasses a wide range of problems that arise in many real-world scenarios. While significant progress has been made in developing learning-based methods for specialized CO problems, a unified model with a single architecture and parameter set for diverse CO problems remains elusive. Such a model would offer substantial advantages in terms of efficiency and conven…
▽ More
Combinatorial Optimization (CO) encompasses a wide range of problems that arise in many real-world scenarios. While significant progress has been made in developing learning-based methods for specialized CO problems, a unified model with a single architecture and parameter set for diverse CO problems remains elusive. Such a model would offer substantial advantages in terms of efficiency and convenience. In this paper, we introduce UniCO, a unified model for solving various CO problems. Inspired by the success of next-token prediction, we frame each problem-solving process as a Markov Decision Process (MDP), tokenize the corresponding sequential trajectory data, and train the model using a transformer backbone. To reduce token length in the trajectory data, we propose a CO-prefix design that aggregates static problem features. To address the heterogeneity of state and action tokens within the MDP, we employ a two-stage self-supervised learning approach. In this approach, a dynamic prediction model is first trained and then serves as a pre-trained model for subsequent policy generation. Experiments across 10 CO problems showcase the versatility of UniCO, emphasizing its ability to generalize to new, unseen problems with minimal fine-tuning, achieving even few-shot or zero-shot performance. Our framework offers a valuable complement to existing neural CO methods that focus on optimizing performance for individual problems.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
Mechanical Power Modeling and Energy Efficiency Maximization for Movable Antenna Systems
Authors:
Xin Wei,
Weidong Mei,
Xuan Huang,
Zhi Chen,
Boyu Ning
Abstract:
Movable antennas (MAs) have recently garnered significant attention in wireless communications due to their capability to reshape wireless channels via local antenna movement within a confined region. However, to achieve accurate antenna movement, MA drivers introduce non-negligible mechanical power consumption, rendering energy efficiency (EE) optimization more critical compared to conventional f…
▽ More
Movable antennas (MAs) have recently garnered significant attention in wireless communications due to their capability to reshape wireless channels via local antenna movement within a confined region. However, to achieve accurate antenna movement, MA drivers introduce non-negligible mechanical power consumption, rendering energy efficiency (EE) optimization more critical compared to conventional fixed-position antenna (FPA) systems. To address this problem, we develop in this paper a fundamental power consumption model for stepper motor-driven MA systems by resorting to basic electric motor theory. Based on this model, we formulate an EE maximization problem by jointly optimizing an MA's position, moving speed, and transmit power. However, this problem is difficult to solve optimally due to the intricate relationship between the mechanical power consumption and the design variables. To tackle this issue, we first uncover a hidden monotonicity of the EE performance with respect to the MA's moving speed. Then, we apply the Dinkelbach algorithm to obtain the optimal transmit power in a semi-closed form for any given MA position, followed by an enumeration to determine the optimal MA position. Numerical results demonstrate that despite the additional mechanical power consumption, the MA system can outperform the conventional FPA system in terms of EE.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
STAR-Rec: Making Peace with Length Variance and Pattern Diversity in Sequential Recommendation
Authors:
Maolin Wang,
Sheng Zhang,
Ruocheng Guo,
Wanyu Wang,
Xuetao Wei,
Zitao Liu,
Hongzhi Yin,
Yi Chang,
Xiangyu Zhao
Abstract:
Recent deep sequential recommendation models often struggle to effectively model key characteristics of user behaviors, particularly in handling sequence length variations and capturing diverse interaction patterns. We propose STAR-Rec, a novel architecture that synergistically combines preference-aware attention and state-space modeling through a sequence-level mixture-of-experts framework. STAR-…
▽ More
Recent deep sequential recommendation models often struggle to effectively model key characteristics of user behaviors, particularly in handling sequence length variations and capturing diverse interaction patterns. We propose STAR-Rec, a novel architecture that synergistically combines preference-aware attention and state-space modeling through a sequence-level mixture-of-experts framework. STAR-Rec addresses these challenges by: (1) employing preference-aware attention to capture both inherently similar item relationships and diverse preferences, (2) utilizing state-space modeling to efficiently process variable-length sequences with linear complexity, and (3) incorporating a mixture-of-experts component that adaptively routes different behavioral patterns to specialized experts, handling both focused category-specific browsing and diverse category exploration patterns. We theoretically demonstrate how the state space model and attention mechanisms can be naturally unified in recommendation scenarios, where SSM captures temporal dynamics through state compression while attention models both similar and diverse item relationships. Extensive experiments on four real-world datasets demonstrate that STAR-Rec consistently outperforms state-of-the-art sequential recommendation methods, particularly in scenarios involving diverse user behaviors and varying sequence lengths.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative Synchronization
Authors:
Detao Bai,
Zhiheng Ma,
Xihan Wei,
Liefeng Bo
Abstract:
The inherent synchronization between a speaker's lip movements, voice, and the underlying linguistic content offers a rich source of information for improving speech processing tasks, especially in challenging conditions where traditional audio-only systems falter. We introduce CoGenAV, a powerful and data-efficient model designed to learn versatile audio-visual representations applicable across a…
▽ More
The inherent synchronization between a speaker's lip movements, voice, and the underlying linguistic content offers a rich source of information for improving speech processing tasks, especially in challenging conditions where traditional audio-only systems falter. We introduce CoGenAV, a powerful and data-efficient model designed to learn versatile audio-visual representations applicable across a wide range of speech and audio-visual tasks. CoGenAV is trained by optimizing a dual objective derived from natural audio-visual synchrony, contrastive feature alignment and generative text prediction, using only 223 hours of labeled data from the LRS2 dataset. This contrastive-generative synchronization strategy effectively captures fundamental cross-modal correlations. We showcase the effectiveness and versatility of the learned CoGenAV representations on multiple benchmarks. When utilized for Audio-Visual Speech Recognition (AVSR) on LRS2, these representations contribute to achieving a state-of-the-art Word Error Rate (WER) of 1.27. They also enable strong performance in Visual Speech Recognition (VSR) with a WER of 20.5 on LRS2, and significantly improve performance in noisy environments by over 70%. Furthermore, CoGenAV representations benefit speech reconstruction tasks, boosting performance in Speech Enhancement and Separation, and achieve competitive results in audio-visual synchronization tasks like Active Speaker Detection (ASD). Our model will be open-sourced to facilitate further development and collaboration within both academia and industry.
△ Less
Submitted 15 May, 2025; v1 submitted 6 May, 2025;
originally announced May 2025.
-
NeuroLoc: Encoding Navigation Cells for 6-DOF Camera Localization
Authors:
Xun Li,
Jian Yang,
Fenli Jia,
Muyu Wang,
Qi Wu,
Jun Wu,
Jinpeng Mi,
Jilin Hu,
Peidong Liang,
Xuan Tang,
Ke Li,
Xiong You,
Xian Wei
Abstract:
Recently, camera localization has been widely adopted in autonomous robotic navigation due to its efficiency and convenience. However, autonomous navigation in unknown environments often suffers from scene ambiguity, environmental disturbances, and dynamic object transformation in camera localization. To address this problem, inspired by the biological brain navigation mechanism (such as grid cell…
▽ More
Recently, camera localization has been widely adopted in autonomous robotic navigation due to its efficiency and convenience. However, autonomous navigation in unknown environments often suffers from scene ambiguity, environmental disturbances, and dynamic object transformation in camera localization. To address this problem, inspired by the biological brain navigation mechanism (such as grid cells, place cells, and head direction cells), we propose a novel neurobiological camera location method, namely NeuroLoc. Firstly, we designed a Hebbian learning module driven by place cells to save and replay historical information, aiming to restore the details of historical representations and solve the issue of scene fuzziness. Secondly, we utilized the head direction cell-inspired internal direction learning as multi-head attention embedding to help restore the true orientation in similar scenes. Finally, we added a 3D grid center prediction in the pose regression module to reduce the final wrong prediction. We evaluate the proposed NeuroLoc on commonly used benchmark indoor and outdoor datasets. The experimental results show that our NeuroLoc can enhance the robustness in complex environments and improve the performance of pose regression by using only a single image.
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
Latent Feature-Guided Conditional Diffusion for High-Fidelity Generative Image Semantic Communication
Authors:
Zehao Chen,
Xinfeng Wei,
Haonan Tong,
Zhaohui Yang,
Changchuan Yin
Abstract:
Semantic communication is proposed and expected to improve the efficiency and effectiveness of massive data transmission over sixth generation (6G) networks. However, existing deep learning-based joint source and channel coding (DeepJSCC) image semantic communication scheme predominantly focuses on optimizing pixel-level metrics, and neglects human perceptual requirements, which results in degrade…
▽ More
Semantic communication is proposed and expected to improve the efficiency and effectiveness of massive data transmission over sixth generation (6G) networks. However, existing deep learning-based joint source and channel coding (DeepJSCC) image semantic communication scheme predominantly focuses on optimizing pixel-level metrics, and neglects human perceptual requirements, which results in degraded perceptual quality. To address this issue, we propose a latent representation-oriented image semantic communication (LRISC) system, which transmits latent semantic features for image generation with semantic consistency, thereby ensuring the perceptual quality at the receiver. In particular, we first map the source image to latent features in a high-dimensional semantic space via a neural network (NN)- based non-linear transformation. Subsequently, these features are encoded using a joint source and channel coding (JSCC) scheme with adaptive coding length for efficient transmission over a wireless channel. At the receiver, a conditional diffusion model is developed by using the received latent features as conditional guidance to steer the reverse diffusion process, progressively reconstructing high-fidelity images while preserving semantic consistency. Moreover, we introduce a channel signal-to-noise ratio (SNR) adaptation mechanism, allowing one model to work across various channel states. Experiments show that the proposed method significantly outperforms existing methods, in terms of learned perceptual image patch similarity (LPIPS) and robustness against channel noise, with an average LPIPS reduction of 43.3% compared to DeepJSCC, while guaranteeing the semantic consistency.
△ Less
Submitted 30 April, 2025;
originally announced April 2025.
-
MATCHA: Can Multi-Agent Collaboration Build a Trustworthy Conversational Recommender?
Authors:
Zheng Hui,
Xiaokai Wei,
Yexi Jiang,
Kevin Gao,
Chen Wang,
Frank Ong,
Se-eun Yoon,
Rachit Pareek,
Michelle Gong
Abstract:
In this paper, we propose a multi-agent collaboration framework called MATCHA for conversational recommendation system, leveraging large language models (LLMs) to enhance personalization and user engagement. Users can request recommendations via free-form text and receive curated lists aligned with their interests, preferences, and constraints. Our system introduces specialized agents for intent a…
▽ More
In this paper, we propose a multi-agent collaboration framework called MATCHA for conversational recommendation system, leveraging large language models (LLMs) to enhance personalization and user engagement. Users can request recommendations via free-form text and receive curated lists aligned with their interests, preferences, and constraints. Our system introduces specialized agents for intent analysis, candidate generation, ranking, re-ranking, explainability, and safeguards. These agents collaboratively improve recommendations accuracy, diversity, and safety. On eight metrics, our model achieves superior or comparable performance to the current state-of-the-art. Through comparisons with six baseline models, our approach addresses key challenges in conversational recommendation systems for game recommendations, including: (1) handling complex, user-specific requests, (2) enhancing personalization through multi-agent collaboration, (3) empirical evaluation and deployment, and (4) ensuring safe and trustworthy interactions.
△ Less
Submitted 25 April, 2025;
originally announced April 2025.
-
ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding
Authors:
Yi-Xing Peng,
Qize Yang,
Yu-Ming Tang,
Shenghao Fu,
Kun-Yu Lin,
Xihan Wei,
Wei-Shi Zheng
Abstract:
Fine-grained understanding of human actions and poses in videos is essential for human-centric AI applications. In this work, we introduce ActionArt, a fine-grained video-caption dataset designed to advance research in human-centric multimodal understanding. Our dataset comprises thousands of videos capturing a broad spectrum of human actions, human-object interactions, and diverse scenarios, each…
▽ More
Fine-grained understanding of human actions and poses in videos is essential for human-centric AI applications. In this work, we introduce ActionArt, a fine-grained video-caption dataset designed to advance research in human-centric multimodal understanding. Our dataset comprises thousands of videos capturing a broad spectrum of human actions, human-object interactions, and diverse scenarios, each accompanied by detailed annotations that meticulously label every limb movement. We develop eight sub-tasks to evaluate the fine-grained understanding capabilities of existing large multimodal models across different dimensions. Experimental results indicate that, while current large multimodal models perform commendably on various tasks, they often fall short in achieving fine-grained understanding. We attribute this limitation to the scarcity of meticulously annotated data, which is both costly and difficult to scale manually. Since manual annotations are costly and hard to scale, we propose proxy tasks to enhance the model perception ability in both spatial and temporal dimensions. These proxy tasks are carefully crafted to be driven by data automatically generated from existing MLLMs, thereby reducing the reliance on costly manual labels. Experimental results show that the proposed proxy tasks significantly narrow the gap toward the performance achieved with manually annotated fine-grained data.
△ Less
Submitted 25 April, 2025;
originally announced April 2025.
-
A Generative Graph Contrastive Learning Model with Global Signal
Authors:
Xiaofan Wei,
Binyan Zhang
Abstract:
Graph contrastive learning (GCL) has garnered significant attention recently since it learns complex structural information from graphs through self-supervised learning manner. However, prevalent GCL models may suffer from performance degradation due to inappropriate contrastive signals. Concretely, they commonly generate augmented views based on random perturbation, which leads to biased essentia…
▽ More
Graph contrastive learning (GCL) has garnered significant attention recently since it learns complex structural information from graphs through self-supervised learning manner. However, prevalent GCL models may suffer from performance degradation due to inappropriate contrastive signals. Concretely, they commonly generate augmented views based on random perturbation, which leads to biased essential structures due to the introduction of noise. In addition, they assign equal weight to both hard and easy sample pairs, thereby ignoring the difference in importance of the sample pairs. To address these issues, this study proposes a novel Contrastive Signal Generative Framework for Accurate Graph Learning (CSG2L) with the following two-fold ideas: a) building a singular value decomposition (SVD)-directed augmented module (SVD-aug) to obtain the global interactions as well as avoiding the random noise perturbation; b) designing a local-global dependency learning module (LGDL) with an adaptive reweighting strategy which can differentiate the effects of hard and easy sample pairs. Extensive experiments on benchmark datasets demonstrate that the proposed CSG2L outperforms the state-of-art baselines. Moreover, CSG2L is compatible with a variety of GNNs.
△ Less
Submitted 25 April, 2025;
originally announced April 2025.
-
TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation
Authors:
Ling You,
Wenxuan Huang,
Xinni Xie,
Xiangyi Wei,
Bangyan Li,
Shaohui Lin,
Yang Li,
Changbo Wang
Abstract:
Soccer is a globally popular sporting event, typically characterized by long matches and distinctive highlight moments. Recent advances in Multimodal Large Language Models (MLLMs) offer promising capabilities in temporal grounding and video understanding, soccer commentary generation often requires precise temporal localization and semantically rich descriptions over long-form video. However, exis…
▽ More
Soccer is a globally popular sporting event, typically characterized by long matches and distinctive highlight moments. Recent advances in Multimodal Large Language Models (MLLMs) offer promising capabilities in temporal grounding and video understanding, soccer commentary generation often requires precise temporal localization and semantically rich descriptions over long-form video. However, existing soccer MLLMs often rely on the temporal a priori for caption generation, so they cannot process the soccer video end-to-end. While some traditional approaches follow a two-step paradigm that is complex and fails to capture the global context to achieve suboptimal performance. To solve the above issues, we present TimeSoccer, the first end-to-end soccer MLLM for Single-anchor Dense Video Captioning (SDVC) in full-match soccer videos. TimeSoccer jointly predicts timestamps and generates captions in a single pass, enabling global context modeling across 45-minute matches. To support long video understanding of soccer matches, we introduce MoFA-Select, a training-free, motion-aware frame compression module that adaptively selects representative frames via a coarse-to-fine strategy, and incorporates complementary training paradigms to strengthen the model's ability to handle long temporal sequences. Extensive experiments demonstrate that our TimeSoccer achieves State-of-The-Art (SoTA) performance on the SDVC task in an end-to-end form, generating high-quality commentary with accurate temporal alignment and strong semantic relevance.
△ Less
Submitted 28 April, 2025; v1 submitted 24 April, 2025;
originally announced April 2025.
-
ManipDreamer: Boosting Robotic Manipulation World Model with Action Tree and Visual Guidance
Authors:
Ying Li,
Xiaobao Wei,
Xiaowei Chi,
Yuming Li,
Zhongyu Zhao,
Hao Wang,
Ningning Ma,
Ming Lu,
Shanghang Zhang
Abstract:
While recent advancements in robotic manipulation video synthesis have shown promise, significant challenges persist in ensuring effective instruction-following and achieving high visual quality. Recent methods, like RoboDreamer, utilize linguistic decomposition to divide instructions into separate lower-level primitives, conditioning the world model on these primitives to achieve compositional in…
▽ More
While recent advancements in robotic manipulation video synthesis have shown promise, significant challenges persist in ensuring effective instruction-following and achieving high visual quality. Recent methods, like RoboDreamer, utilize linguistic decomposition to divide instructions into separate lower-level primitives, conditioning the world model on these primitives to achieve compositional instruction-following. However, these separate primitives do not consider the relationships that exist between them. Furthermore, recent methods neglect valuable visual guidance, including depth and semantic guidance, both crucial for enhancing visual quality. This paper introduces ManipDreamer, an advanced world model based on the action tree and visual guidance. To better learn the relationships between instruction primitives, we represent the instruction as the action tree and assign embeddings to tree nodes, each instruction can acquire its embeddings by navigating through the action tree. The instruction embeddings can be used to guide the world model. To enhance visual quality, we combine depth and semantic guidance by introducing a visual guidance adapter compatible with the world model. This visual adapter enhances both the temporal and physical consistency of video generation. Based on the action tree and visual guidance, ManipDreamer significantly boosts the instruction-following ability and visual quality. Comprehensive evaluations on robotic manipulation benchmarks reveal that ManipDreamer achieves large improvements in video quality metrics in both seen and unseen tasks, with PSNR improved from 19.55 to 21.05, SSIM improved from 0.7474 to 0.7982 and reduced Flow Error from 3.506 to 3.201 in unseen tasks, compared to the recent RoboDreamer model. Additionally, our method increases the success rate of robotic manipulation tasks by 2.5% in 6 RLbench tasks on average.
△ Less
Submitted 23 April, 2025;
originally announced April 2025.
-
LAPP: Large Language Model Feedback for Preference-Driven Reinforcement Learning
Authors:
Pingcheng Jian,
Xiao Wei,
Yanbaihui Liu,
Samuel A. Moore,
Michael M. Zavlanos,
Boyuan Chen
Abstract:
We introduce Large Language Model-Assisted Preference Prediction (LAPP), a novel framework for robot learning that enables efficient, customizable, and expressive behavior acquisition with minimum human effort. Unlike prior approaches that rely heavily on reward engineering, human demonstrations, motion capture, or expensive pairwise preference labels, LAPP leverages large language models (LLMs) t…
▽ More
We introduce Large Language Model-Assisted Preference Prediction (LAPP), a novel framework for robot learning that enables efficient, customizable, and expressive behavior acquisition with minimum human effort. Unlike prior approaches that rely heavily on reward engineering, human demonstrations, motion capture, or expensive pairwise preference labels, LAPP leverages large language models (LLMs) to automatically generate preference labels from raw state-action trajectories collected during reinforcement learning (RL). These labels are used to train an online preference predictor, which in turn guides the policy optimization process toward satisfying high-level behavioral specifications provided by humans. Our key technical contribution is the integration of LLMs into the RL feedback loop through trajectory-level preference prediction, enabling robots to acquire complex skills including subtle control over gait patterns and rhythmic timing. We evaluate LAPP on a diverse set of quadruped locomotion and dexterous manipulation tasks and show that it achieves efficient learning, higher final performance, faster adaptation, and precise control of high-level behaviors. Notably, LAPP enables robots to master highly dynamic and expressive tasks such as quadruped backflips, which remain out of reach for standard LLM-generated or handcrafted rewards. Our results highlight LAPP as a promising direction for scalable preference-driven robot learning.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation
Authors:
Hong-Tao Yu,
Xiu-Shen Wei,
Yuxin Peng,
Serge Belongie
Abstract:
Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception capabilities, garnering significant attention. While numerous evaluation studies have emerged, assessing LVLMs both holistically and on specialized tasks, fine-grained image tasks-fundamental to computer vision-remain largely unexplored. To fill this gap, we introduce a comprehensive fine…
▽ More
Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception capabilities, garnering significant attention. While numerous evaluation studies have emerged, assessing LVLMs both holistically and on specialized tasks, fine-grained image tasks-fundamental to computer vision-remain largely unexplored. To fill this gap, we introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 1.01 million questions and 0.33 million images. Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives, focusing on their semantic recognition and fine-grained feature representation capabilities. Through extensive experiments on twelve representative LVLMs/VLMs, we uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance. This work provides critical insights into the limitations of current LVLMs and offers guidance for future data construction and model design in the development of more advanced LVLMs. Our code is open-source and available at https://github.com/SEU-VIPGroup/FG-BMK.
△ Less
Submitted 13 May, 2025; v1 submitted 21 April, 2025;
originally announced April 2025.
-
Towards NSFW-Free Text-to-Image Generation via Safety-Constraint Direct Preference Optimization
Authors:
Shouwei Ruan,
Zhenyu Wu,
Yao Huang,
Ruochen Zhang,
Yitong Sun,
Caixin Kang,
Xingxing Wei
Abstract:
Ensuring the safety of generated content remains a fundamental challenge for Text-to-Image (T2I) generation. Existing studies either fail to guarantee complete safety under potentially harmful concepts or struggle to balance safety with generation quality. To address these issues, we propose Safety-Constrained Direct Preference Optimization (SC-DPO), a novel framework for safety alignment in T2I m…
▽ More
Ensuring the safety of generated content remains a fundamental challenge for Text-to-Image (T2I) generation. Existing studies either fail to guarantee complete safety under potentially harmful concepts or struggle to balance safety with generation quality. To address these issues, we propose Safety-Constrained Direct Preference Optimization (SC-DPO), a novel framework for safety alignment in T2I models. SC-DPO integrates safety constraints into the general human preference calibration, aiming to maximize the likelihood of generating human-preferred samples while minimizing the safety cost of the generated outputs. In SC-DPO, we introduce a safety cost model to accurately quantify harmful levels for images, and train it effectively using the proposed contrastive learning and cost anchoring objectives. To apply SC-DPO for effective T2I safety alignment, we constructed SCP-10K, a safety-constrained preference dataset containing rich harmful concepts, which blends safety-constrained preference pairs under both harmful and clean instructions, further mitigating the trade-off between safety and sample quality. Additionally, we propose a Dynamic Focusing Mechanism (DFM) for SC-DPO, promoting the model's learning of difficult preference pair samples. Extensive experiments demonstrate that SC-DPO outperforms existing methods, effectively defending against various NSFW content while maintaining optimal sample quality and human preference alignment. Additionally, SC-DPO exhibits resilience against adversarial prompts designed to generate harmful content.
△ Less
Submitted 19 April, 2025;
originally announced April 2025.
-
Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models
Authors:
Xinlin Zhuang,
Jiahui Peng,
Ren Ma,
Yinfan Wang,
Tianyi Bai,
Xingjian Wei,
Jiantao Qiu,
Chi Zhang,
Ying Qian,
Conghui He
Abstract:
The composition of pre-training datasets for large language models (LLMs) remains largely undisclosed, hindering transparency and efforts to optimize data quality, a critical driver of model performance. Current data selection methods, such as natural language quality assessments, diversity-based filters, and classifier-based approaches, are limited by single-dimensional evaluation or redundancy-f…
▽ More
The composition of pre-training datasets for large language models (LLMs) remains largely undisclosed, hindering transparency and efforts to optimize data quality, a critical driver of model performance. Current data selection methods, such as natural language quality assessments, diversity-based filters, and classifier-based approaches, are limited by single-dimensional evaluation or redundancy-focused strategies. To address these gaps, we propose PRRC to evaluate data quality across Professionalism, Readability, Reasoning, and Cleanliness. We further introduce Meta-rater, a multi-dimensional data selection method that integrates these dimensions with existing quality metrics through learned optimal weightings. Meta-rater employs proxy models to train a regression model that predicts validation loss, enabling the identification of optimal combinations of quality scores. Experiments demonstrate that Meta-rater doubles convergence speed for 1.3B parameter models and improves downstream task performance by 3.23, with scalable benefits observed in 3.3B models trained on 100B tokens. Additionally, we release the annotated SlimPajama-627B dataset, labeled across 25 quality metrics (including PRRC), to advance research in data-centric LLM development. Our work establishes that holistic, multi-dimensional quality integration significantly outperforms conventional single-dimension approaches, offering a scalable paradigm for enhancing pre-training efficiency and model capability.
△ Less
Submitted 30 April, 2025; v1 submitted 19 April, 2025;
originally announced April 2025.
-
Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning
Authors:
ByteDance Seed,
:,
Jiaze Chen,
Tiantian Fan,
Xin Liu,
Lingjun Liu,
Zhiqi Lin,
Mingxuan Wang,
Chengyi Wang,
Xiangpeng Wei,
Wenyuan Xu,
Yufeng Yuan,
Yu Yue,
Lin Yan,
Qiying Yu,
Xiaochen Zuo,
Chi Zhang,
Ruofei Zhu,
Zhecheng An,
Zhihao Bai,
Yu Bao,
Xingyan Bin,
Jiangjie Chen,
Feng Chen,
Hongmin Chen
, et al. (249 additional authors not shown)
Abstract:
We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For in…
▽ More
We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For instance, it surpasses DeepSeek R1 by 8% in win rate on non-reasoning tasks, indicating its broader applicability. Compared to other state-of-the-art reasoning models, Seed1.5-Thinking is a Mixture-of-Experts (MoE) model with a relatively small size, featuring 20B activated and 200B total parameters. As part of our effort to assess generalized reasoning, we develop two internal benchmarks, BeyondAIME and Codeforces, both of which will be publicly released to support future research. Model trial link: https://www.volcengine.com/experience/ark.
△ Less
Submitted 29 April, 2025; v1 submitted 10 April, 2025;
originally announced April 2025.
-
ImPart: Importance-Aware Delta-Sparsification for Improved Model Compression and Merging in LLMs
Authors:
Yan Yang,
Yixia Li,
Hongru Wang,
Xuetao Wei,
Jianqiao Yu,
Yun Chen,
Guanhua Chen
Abstract:
With the proliferation of task-specific large language models, delta compression has emerged as a method to mitigate the resource challenges of deploying numerous such models by effectively compressing the delta model parameters. Previous delta-sparsification methods either remove parameters randomly or truncate singular vectors directly after singular value decomposition (SVD). However, these met…
▽ More
With the proliferation of task-specific large language models, delta compression has emerged as a method to mitigate the resource challenges of deploying numerous such models by effectively compressing the delta model parameters. Previous delta-sparsification methods either remove parameters randomly or truncate singular vectors directly after singular value decomposition (SVD). However, these methods either disregard parameter importance entirely or evaluate it with too coarse a granularity. In this work, we introduce ImPart, a novel importance-aware delta sparsification approach. Leveraging SVD, it dynamically adjusts sparsity ratios of different singular vectors based on their importance, effectively retaining crucial task-specific knowledge even at high sparsity ratios. Experiments show that ImPart achieves state-of-the-art delta sparsification performance, demonstrating $2\times$ higher compression ratio than baselines at the same performance level. When integrated with existing methods, ImPart sets a new state-of-the-art on delta quantization and model merging.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
Image Editing with Diffusion Models: A Survey
Authors:
Jia Wang,
Jie Hu,
Xiaoqi Ma,
Hanghang Ma,
Xiaoming Wei,
Enhua Wu
Abstract:
With deeper exploration of diffusion model, developments in the field of image generation have triggered a boom in image creation. As the quality of base-model generated images continues to improve, so does the demand for further application like image editing. In recent years, many remarkable works are realizing a wide variety of editing effects. However, the wide variety of editing types and div…
▽ More
With deeper exploration of diffusion model, developments in the field of image generation have triggered a boom in image creation. As the quality of base-model generated images continues to improve, so does the demand for further application like image editing. In recent years, many remarkable works are realizing a wide variety of editing effects. However, the wide variety of editing types and diverse editing approaches have made it difficult for researchers to establish a comprehensive view of the development of this field. In this survey, we summarize the image editing field from four aspects: tasks definition, methods classification, results evaluation and editing datasets. First, we provide a definition of image editing, which in turn leads to a variety of editing task forms from the perspective of operation parts and manipulation actions. Subsequently, we categorize and summary methods for implementing editing into three categories: inversion-based, fine-tuning-based and adapter-based. In addition, we organize the currently used metrics, available datasets and corresponding construction methods. At the end, we present some visions for the future development of the image editing field based on the previous summaries.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
LayoutCoT: Unleashing the Deep Reasoning Potential of Large Language Models for Layout Generation
Authors:
Hengyu Shi,
Junhao Su,
Huansheng Ning,
Xiaoming Wei,
Jialin Gao
Abstract:
Conditional layout generation aims to automatically generate visually appealing and semantically coherent layouts from user-defined constraints. While recent methods based on generative models have shown promising results, they typically require substantial amounts of training data or extensive fine-tuning, limiting their versatility and practical applicability. Alternatively, some training-free a…
▽ More
Conditional layout generation aims to automatically generate visually appealing and semantically coherent layouts from user-defined constraints. While recent methods based on generative models have shown promising results, they typically require substantial amounts of training data or extensive fine-tuning, limiting their versatility and practical applicability. Alternatively, some training-free approaches leveraging in-context learning with Large Language Models (LLMs) have emerged, but they often suffer from limited reasoning capabilities and overly simplistic ranking mechanisms, which restrict their ability to generate consistently high-quality layouts. To this end, we propose LayoutCoT, a novel approach that leverages the reasoning capabilities of LLMs through a combination of Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) techniques. Specifically, LayoutCoT transforms layout representations into a standardized serialized format suitable for processing by LLMs. A Layout-aware RAG is used to facilitate effective retrieval and generate a coarse layout by LLMs. This preliminary layout, together with the selected exemplars, is then fed into a specially designed CoT reasoning module for iterative refinement, significantly enhancing both semantic coherence and visual quality. We conduct extensive experiments on five public datasets spanning three conditional layout generation tasks. Experimental results demonstrate that LayoutCoT achieves state-of-the-art performance without requiring training or fine-tuning. Notably, our CoT reasoning module enables standard LLMs, even those without explicit deep reasoning abilities, to outperform specialized deep-reasoning models such as deepseek-R1, highlighting the potential of our approach in unleashing the deep reasoning capabilities of LLMs for layout generation tasks.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Authors:
Jinguo Zhu,
Weiyun Wang,
Zhe Chen,
Zhaoyang Liu,
Shenglong Ye,
Lixin Gu,
Hao Tian,
Yuchen Duan,
Weijie Su,
Jie Shao,
Zhangwei Gao,
Erfei Cui,
Xuehui Wang,
Yue Cao,
Yangzhou Liu,
Xingguang Wei,
Hongjie Zhang,
Haomin Wang,
Weiye Xu,
Hao Li,
Jiahao Wang,
Nianchen Deng,
Songze Li,
Yinan He,
Tan Jiang
, et al. (26 additional authors not shown)
Abstract:
We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single p…
▽ More
We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.
△ Less
Submitted 18 April, 2025; v1 submitted 14 April, 2025;
originally announced April 2025.
-
Omni-Dish: Photorealistic and Faithful Image Generation and Editing for Arbitrary Chinese Dishes
Authors:
Huijie Liu,
Bingcan Wang,
Jie Hu,
Xiaoming Wei,
Guoliang Kang
Abstract:
Dish images play a crucial role in the digital era, with the demand for culturally distinctive dish images continuously increasing due to the digitization of the food industry and e-commerce. In general cases, existing text-to-image generation models excel in producing high-quality images; however, they struggle to capture diverse characteristics and faithful details of specific domains, particula…
▽ More
Dish images play a crucial role in the digital era, with the demand for culturally distinctive dish images continuously increasing due to the digitization of the food industry and e-commerce. In general cases, existing text-to-image generation models excel in producing high-quality images; however, they struggle to capture diverse characteristics and faithful details of specific domains, particularly Chinese dishes. To address this limitation, we propose Omni-Dish, the first text-to-image generation model specifically tailored for Chinese dishes. We develop a comprehensive dish curation pipeline, building the largest dish dataset to date. Additionally, we introduce a recaption strategy and employ a coarse-to-fine training scheme to help the model better learn fine-grained culinary nuances. During inference, we enhance the user's textual input using a pre-constructed high-quality caption library and a large language model, enabling more photorealistic and faithful image generation. Furthermore, to extend our model's capability for dish editing tasks, we propose Concept-Enhanced P2P. Based on this approach, we build a dish editing dataset and train a specialized editing model. Extensive experiments demonstrate the superiority of our methods.
△ Less
Submitted 30 April, 2025; v1 submitted 14 April, 2025;
originally announced April 2025.
-
EmbodiedOcc++: Boosting Embodied 3D Occupancy Prediction with Plane Regularization and Uncertainty Sampler
Authors:
Hao Wang,
Xiaobao Wei,
Xiaoan Zhang,
Jianing Li,
Chengyu Bai,
Ying Li,
Ming Lu,
Wenzhao Zheng,
Shanghang Zhang
Abstract:
Online 3D occupancy prediction provides a comprehensive spatial understanding of embodied environments. While the innovative EmbodiedOcc framework utilizes 3D semantic Gaussians for progressive indoor occupancy prediction, it overlooks the geometric characteristics of indoor environments, which are primarily characterized by planar structures. This paper introduces EmbodiedOcc++, enhancing the ori…
▽ More
Online 3D occupancy prediction provides a comprehensive spatial understanding of embodied environments. While the innovative EmbodiedOcc framework utilizes 3D semantic Gaussians for progressive indoor occupancy prediction, it overlooks the geometric characteristics of indoor environments, which are primarily characterized by planar structures. This paper introduces EmbodiedOcc++, enhancing the original framework with two key innovations: a Geometry-guided Refinement Module (GRM) that constrains Gaussian updates through plane regularization, along with a Semantic-aware Uncertainty Sampler (SUS) that enables more effective updates in overlapping regions between consecutive frames. GRM regularizes the position update to align with surface normals. It determines the adaptive regularization weight using curvature-based and depth-based constraints, allowing semantic Gaussians to align accurately with planar surfaces while adapting in complex regions. To effectively improve geometric consistency from different views, SUS adaptively selects proper Gaussians to update. Comprehensive experiments on the EmbodiedOcc-ScanNet benchmark demonstrate that EmbodiedOcc++ achieves state-of-the-art performance across different settings. Our method demonstrates improved edge accuracy and retains more geometric details while ensuring computational efficiency, which is essential for online embodied perception. The code will be released at: https://github.com/PKUHaoWang/EmbodiedOcc2.
△ Less
Submitted 13 April, 2025;
originally announced April 2025.
-
DWFS-Obfuscation: Dynamic Weighted Feature Selection for Robust Malware Familial Classification under Obfuscation
Authors:
Xingyuan Wei,
Zijun Cheng,
Ning Li,
Qiujian Lv,
Ziyang Yu,
Degang Sun
Abstract:
Due to its open-source nature, the Android operating system has consistently been a primary target for attackers. Learning-based methods have made significant progress in the field of Android malware detection. However, traditional detection methods based on static features struggle to identify obfuscated malicious code, while methods relying on dynamic analysis suffer from low efficiency. To addr…
▽ More
Due to its open-source nature, the Android operating system has consistently been a primary target for attackers. Learning-based methods have made significant progress in the field of Android malware detection. However, traditional detection methods based on static features struggle to identify obfuscated malicious code, while methods relying on dynamic analysis suffer from low efficiency. To address this, we propose a dynamic weighted feature selection method that analyzes the importance and stability of features, calculates scores to filter out the most robust features, and combines these selected features with the program's structural information. We then utilize graph neural networks for classification, thereby improving the robustness and accuracy of the detection system. We analyzed 8,664 malware samples from eight malware families and tested a total of 44,940 malware variants generated using seven obfuscation strategies. Experiments demonstrate that our proposed method achieves an F1-score of 95.56% on the unobfuscated dataset and 92.28% on the obfuscated dataset, indicating that the model can effectively detect obfuscated malware.
△ Less
Submitted 10 April, 2025;
originally announced April 2025.
-
LCGC: Learning from Consistency Gradient Conflicting for Class-Imbalanced Semi-Supervised Debiasing
Authors:
Weiwei Xing,
Yue Cheng,
Hongzhu Yi,
Xiaohui Gao,
Xiang Wei,
Xiaoyu Guo,
Yuming Zhang,
Xinyu Pang
Abstract:
Classifiers often learn to be biased corresponding to the class-imbalanced dataset, especially under the semi-supervised learning (SSL) set. While previous work tries to appropriately re-balance the classifiers by subtracting a class-irrelevant image's logit, but lacks a firm theoretical basis. We theoretically analyze why exploiting a baseline image can refine pseudo-labels and prove that the bla…
▽ More
Classifiers often learn to be biased corresponding to the class-imbalanced dataset, especially under the semi-supervised learning (SSL) set. While previous work tries to appropriately re-balance the classifiers by subtracting a class-irrelevant image's logit, but lacks a firm theoretical basis. We theoretically analyze why exploiting a baseline image can refine pseudo-labels and prove that the black image is the best choice. We also indicated that as the training process deepens, the pseudo-labels before and after refinement become closer. Based on this observation, we propose a debiasing scheme dubbed LCGC, which Learning from Consistency Gradient Conflicting, by encouraging biased class predictions during training. We intentionally update the pseudo-labels whose gradient conflicts with the debiased logits, representing the optimization direction offered by the over-imbalanced classifier predictions. Then, we debiased the predictions by subtracting the baseline image logits during testing. Extensive experiments demonstrate that LCGC can significantly improve the prediction accuracy of existing CISSL models on public benchmarks.
△ Less
Submitted 8 April, 2025;
originally announced April 2025.
-
OSDM-MReg: Multimodal Image Registration based One Step Diffusion Model
Authors:
Xiaochen Wei,
Weiwei Guo,
Wenxian Yu,
Feiming Wei,
Dongying Li
Abstract:
Multimodal remote sensing image registration aligns images from different sensors for data fusion and analysis. However, current methods often fail to extract modality-invariant features when aligning image pairs with large nonlinear radiometric differences. To address this issues, we propose OSDM-MReg, a novel multimodal image registration framework based image-to-image translation to eliminate t…
▽ More
Multimodal remote sensing image registration aligns images from different sensors for data fusion and analysis. However, current methods often fail to extract modality-invariant features when aligning image pairs with large nonlinear radiometric differences. To address this issues, we propose OSDM-MReg, a novel multimodal image registration framework based image-to-image translation to eliminate the gap of multimodal images. Firstly, we propose a novel one-step unaligned target-guided conditional denoising diffusion probabilistic models(UTGOS-CDDPM)to translate multimodal images into a unified domain. In the inference stage, traditional conditional DDPM generate translated source image by a large number of iterations, which severely slows down the image registration task. To address this issues, we use the unaligned traget image as a condition to promote the generation of low-frequency features of the translated source image. Furthermore, during the training stage, we add the inverse process of directly predicting the translated image to ensure that the translated source image can be generated in one step during the testing stage. Additionally, to supervised the detail features of translated source image, we propose a new perceptual loss that focuses on the high-frequency feature differences between the translated and ground-truth images. Finally, a multimodal multiscale image registration network (MM-Reg) fuse the multimodal feature of the unimodal images and multimodal images by proposed multimodal feature fusion strategy. Experiments demonstrate superior accuracy and efficiency across various multimodal registration tasks, particularly for SAR-optical image pairs.
△ Less
Submitted 8 April, 2025;
originally announced April 2025.
-
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
Authors:
Yu Yue,
Yufeng Yuan,
Qiying Yu,
Xiaochen Zuo,
Ruofei Zhu,
Wenyuan Xu,
Jiaze Chen,
Chengyi Wang,
TianTian Fan,
Zhengyin Du,
Xiangpeng Wei,
Xiangyu Yu,
Gaohong Liu,
Juncai Liu,
Lingjun Liu,
Haibin Lin,
Zhiqi Lin,
Bole Ma,
Chi Zhang,
Mofan Zhang,
Wang Zhang,
Hang Zhu,
Ru Zhang,
Xin Liu,
Mingxuan Wang
, et al. (2 additional authors not shown)
Abstract:
We present VAPO, Value-based Augmented Proximal Policy Optimization framework for reasoning models., a novel framework tailored for reasoning models within the value-based paradigm. Benchmarked the AIME 2024 dataset, VAPO, built on the Qwen 32B pre-trained model, attains a state-of-the-art score of $\mathbf{60.4}$. In direct comparison under identical experimental settings, VAPO outperforms the pr…
▽ More
We present VAPO, Value-based Augmented Proximal Policy Optimization framework for reasoning models., a novel framework tailored for reasoning models within the value-based paradigm. Benchmarked the AIME 2024 dataset, VAPO, built on the Qwen 32B pre-trained model, attains a state-of-the-art score of $\mathbf{60.4}$. In direct comparison under identical experimental settings, VAPO outperforms the previously reported results of DeepSeek-R1-Zero-Qwen-32B and DAPO by more than 10 points. The training process of VAPO stands out for its stability and efficiency. It reaches state-of-the-art performance within a mere 5,000 steps. Moreover, across multiple independent runs, no training crashes occur, underscoring its reliability. This research delves into long chain-of-thought (long-CoT) reasoning using a value-based reinforcement learning framework. We pinpoint three key challenges that plague value-based methods: value model bias, the presence of heterogeneous sequence lengths, and the sparsity of reward signals. Through systematic design, VAPO offers an integrated solution that effectively alleviates these challenges, enabling enhanced performance in long-CoT reasoning tasks.
△ Less
Submitted 10 April, 2025; v1 submitted 7 April, 2025;
originally announced April 2025.
-
Enhancing Embedding Representation Stability in Recommendation Systems with Semantic ID
Authors:
Carolina Zheng,
Minhui Huang,
Dmitrii Pedchenko,
Kaushik Rangadurai,
Siyu Wang,
Gaby Nahum,
Jie Lei,
Yang Yang,
Tao Liu,
Zutian Luo,
Xiaohan Wei,
Dinesh Ramasamy,
Jiyan Yang,
Yiping Han,
Lin Yang,
Hangjun Xu,
Rong Jin,
Shuang Yang
Abstract:
The exponential growth of online content has posed significant challenges to ID-based models in industrial recommendation systems, ranging from extremely high cardinality and dynamically growing ID space, to highly skewed engagement distributions, to prediction instability as a result of natural id life cycles (e.g, the birth of new IDs and retirement of old IDs). To address these issues, many sys…
▽ More
The exponential growth of online content has posed significant challenges to ID-based models in industrial recommendation systems, ranging from extremely high cardinality and dynamically growing ID space, to highly skewed engagement distributions, to prediction instability as a result of natural id life cycles (e.g, the birth of new IDs and retirement of old IDs). To address these issues, many systems rely on random hashing to handle the id space and control the corresponding model parameters (i.e embedding table). However, this approach introduces data pollution from multiple ids sharing the same embedding, leading to degraded model performance and embedding representation instability.
This paper examines these challenges and introduces Semantic ID prefix ngram, a novel token parameterization technique that significantly improves the performance of the original Semantic ID. Semantic ID prefix ngram creates semantically meaningful collisions by hierarchically clustering items based on their content embeddings, as opposed to random assignments. Through extensive experimentation, we demonstrate that Semantic ID prefix ngram not only addresses embedding instability but also significantly improves tail id modeling, reduces overfitting, and mitigates representation shifts. We further highlight the advantages of Semantic ID prefix ngram in attention-based models that contextualize user histories, showing substantial performance improvements. We also report our experience of integrating Semantic ID into Meta production Ads Ranking system, leading to notable performance gains and enhanced prediction stability in live deployments.
△ Less
Submitted 2 April, 2025;
originally announced April 2025.
-
MolGround: A Benchmark for Molecular Grounding
Authors:
Jiaxin Wu,
Ting Zhang,
Rubing Chen,
Wengyu Zhang,
Chen Jason Zhang,
Xiao-Yong Wei,
Li Qing
Abstract:
Current molecular understanding approaches predominantly focus on the descriptive aspect of human perception, providing broad, topic-level insights. However, the referential aspect -- linking molecular concepts to specific structural components -- remains largely unexplored. To address this gap, we propose a molecular grounding benchmark designed to evaluate a model's referential abilities. We ali…
▽ More
Current molecular understanding approaches predominantly focus on the descriptive aspect of human perception, providing broad, topic-level insights. However, the referential aspect -- linking molecular concepts to specific structural components -- remains largely unexplored. To address this gap, we propose a molecular grounding benchmark designed to evaluate a model's referential abilities. We align molecular grounding with established conventions in NLP, cheminformatics, and molecular science, showcasing the potential of NLP techniques to advance molecular understanding within the AI for Science movement. Furthermore, we constructed the largest molecular understanding benchmark to date, comprising 117k QA pairs, and developed a multi-agent grounding prototype as proof of concept. This system outperforms existing models, including GPT-4o, and its grounding outputs have been integrated to enhance traditional tasks such as molecular captioning and ATC (Anatomical, Therapeutic, Chemical) classification.
△ Less
Submitted 30 April, 2025; v1 submitted 30 March, 2025;
originally announced March 2025.
-
A Survey of WebAgents: Towards Next-Generation AI Agents for Web Automation with Large Foundation Models
Authors:
Liangbo Ning,
Ziran Liang,
Zhuohang Jiang,
Haohao Qu,
Yujuan Ding,
Wenqi Fan,
Xiao-yong Wei,
Shanru Lin,
Hui Liu,
Philip S. Yu,
Qing Li
Abstract:
With the advancement of web techniques, they have significantly revolutionized various aspects of people's lives. Despite the importance of the web, many tasks performed on it are repetitive and time-consuming, negatively impacting overall quality of life. To efficiently handle these tedious daily tasks, one of the most promising approaches is to advance autonomous agents based on Artificial Intel…
▽ More
With the advancement of web techniques, they have significantly revolutionized various aspects of people's lives. Despite the importance of the web, many tasks performed on it are repetitive and time-consuming, negatively impacting overall quality of life. To efficiently handle these tedious daily tasks, one of the most promising approaches is to advance autonomous agents based on Artificial Intelligence (AI) techniques, referred to as AI Agents, as they can operate continuously without fatigue or performance degradation. In the context of the web, leveraging AI Agents -- termed WebAgents -- to automatically assist people in handling tedious daily tasks can dramatically enhance productivity and efficiency. Recently, Large Foundation Models (LFMs) containing billions of parameters have exhibited human-like language understanding and reasoning capabilities, showing proficiency in performing various complex tasks. This naturally raises the question: `Can LFMs be utilized to develop powerful AI Agents that automatically handle web tasks, providing significant convenience to users?' To fully explore the potential of LFMs, extensive research has emerged on WebAgents designed to complete daily web tasks according to user instructions, significantly enhancing the convenience of daily human life. In this survey, we comprehensively review existing research studies on WebAgents across three key aspects: architectures, training, and trustworthiness. Additionally, several promising directions for future research are explored to provide deeper insights.
△ Less
Submitted 10 May, 2025; v1 submitted 30 March, 2025;
originally announced March 2025.
-
ArchCAD-400K: An Open Large-Scale Architectural CAD Dataset and New Baseline for Panoptic Symbol Spotting
Authors:
Ruifeng Luo,
Zhengjie Liu,
Tianxiao Cheng,
Jie Wang,
Tongjie Wang,
Xingguang Wei,
Haomin Wang,
YanPeng Li,
Fu Chai,
Fei Cheng,
Shenglong Ye,
Wenhai Wang,
Yanting Zhang,
Yu Qiao,
Hongjie Zhang,
Xianzhong Zhao
Abstract:
Recognizing symbols in architectural CAD drawings is critical for various advanced engineering applications. In this paper, we propose a novel CAD data annotation engine that leverages intrinsic attributes from systematically archived CAD drawings to automatically generate high-quality annotations, thus significantly reducing manual labeling efforts. Utilizing this engine, we construct ArchCAD-400…
▽ More
Recognizing symbols in architectural CAD drawings is critical for various advanced engineering applications. In this paper, we propose a novel CAD data annotation engine that leverages intrinsic attributes from systematically archived CAD drawings to automatically generate high-quality annotations, thus significantly reducing manual labeling efforts. Utilizing this engine, we construct ArchCAD-400K, a large-scale CAD dataset consisting of 413,062 chunks from 5538 highly standardized drawings, making it over 26 times larger than the largest existing CAD dataset. ArchCAD-400K boasts an extended drawing diversity and broader categories, offering line-grained annotations. Furthermore, we present a new baseline model for panoptic symbol spotting, termed Dual-Pathway Symbol Spotter (DPSS). It incorporates an adaptive fusion module to enhance primitive features with complementary image features, achieving state-of-the-art performance and enhanced robustness. Extensive experiments validate the effectiveness of DPSS, demonstrating the value of ArchCAD-400K and its potential to drive innovation in architectural design and construction.
△ Less
Submitted 2 April, 2025; v1 submitted 28 March, 2025;
originally announced March 2025.
-
OpenHuEval: Evaluating Large Language Model on Hungarian Specifics
Authors:
Haote Yang,
Xingjian Wei,
Jiang Wu,
Noémi Ligeti-Nagy,
Jiaxing Sun,
Yinfan Wang,
Zijian Győző Yang,
Junyuan Gao,
Jingchao Wang,
Bowen Jiang,
Shasha Wang,
Nanjun Yu,
Zihao Zhang,
Shixin Hong,
Hongwei Liu,
Wei Li,
Songyang Zhang,
Dahua Lin,
Lijun Wu,
Gábor Prószéky,
Conghui He
Abstract:
We introduce OpenHuEval, the first benchmark for LLMs focusing on the Hungarian language and specifics. OpenHuEval is constructed from a vast collection of Hungarian-specific materials sourced from multiple origins. In the construction, we incorporated the latest design principles for evaluating LLMs, such as using real user queries from the internet, emphasizing the assessment of LLMs' generative…
▽ More
We introduce OpenHuEval, the first benchmark for LLMs focusing on the Hungarian language and specifics. OpenHuEval is constructed from a vast collection of Hungarian-specific materials sourced from multiple origins. In the construction, we incorporated the latest design principles for evaluating LLMs, such as using real user queries from the internet, emphasizing the assessment of LLMs' generative capabilities, and employing LLM-as-judge to enhance the multidimensionality and accuracy of evaluations. Ultimately, OpenHuEval encompasses eight Hungarian-specific dimensions, featuring five tasks and 3953 questions. Consequently, OpenHuEval provides the comprehensive, in-depth, and scientifically accurate assessment of LLM performance in the context of the Hungarian language and its specifics. We evaluated current mainstream LLMs, including both traditional LLMs and recently developed Large Reasoning Models. The results demonstrate the significant necessity for evaluation and model optimization tailored to the Hungarian language and specifics. We also established the framework for analyzing the thinking processes of LRMs with OpenHuEval, revealing intrinsic patterns and mechanisms of these models in non-English languages, with Hungarian serving as a representative example. We will release OpenHuEval at https://github.com/opendatalab/OpenHuEval .
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
Lipschitz Constant Meets Condition Number: Learning Robust and Compact Deep Neural Networks
Authors:
Yangqi Feng,
Shing-Ho J. Lin,
Baoyuan Gao,
Xian Wei
Abstract:
Recent research has revealed that high compression of Deep Neural Networks (DNNs), e.g., massive pruning of the weight matrix of a DNN, leads to a severe drop in accuracy and susceptibility to adversarial attacks. Integration of network pruning into an adversarial training framework has been proposed to promote adversarial robustness. It has been observed that a highly pruned weight matrix tends t…
▽ More
Recent research has revealed that high compression of Deep Neural Networks (DNNs), e.g., massive pruning of the weight matrix of a DNN, leads to a severe drop in accuracy and susceptibility to adversarial attacks. Integration of network pruning into an adversarial training framework has been proposed to promote adversarial robustness. It has been observed that a highly pruned weight matrix tends to be ill-conditioned, i.e., increasing the condition number of the weight matrix. This phenomenon aggravates the vulnerability of a DNN to input noise. Although a highly pruned weight matrix is considered to be able to lower the upper bound of the local Lipschitz constant to tolerate large distortion, the ill-conditionedness of such a weight matrix results in a non-robust DNN model. To overcome this challenge, this work develops novel joint constraints to adjust the weight distribution of networks, namely, the Transformed Sparse Constraint joint with Condition Number Constraint (TSCNC), which copes with smoothing distribution and differentiable constraint functions to reduce condition number and thus avoid the ill-conditionedness of weight matrices. Furthermore, our theoretical analyses unveil the relevance between the condition number and the local Lipschitz constant of the weight matrix, namely, the sharply increasing condition number becomes the dominant factor that restricts the robustness of over-sparsified models. Extensive experiments are conducted on several public datasets, and the results show that the proposed constraints significantly improve the robustness of a DNN with high pruning rates.
△ Less
Submitted 26 March, 2025;
originally announced March 2025.
-
VGAT: A Cancer Survival Analysis Framework Transitioning from Generative Visual Question Answering to Genomic Reconstruction
Authors:
Zizhi Chen,
Minghao Han,
Xukun Zhang,
Shuwei Ma,
Tao Liu,
Xing Wei,
Lihua Zhang
Abstract:
Multimodal learning combining pathology images and genomic sequences enhances cancer survival analysis but faces clinical implementation barriers due to limited access to genomic sequencing in under-resourced regions. To enable survival prediction using only whole-slide images (WSI), we propose the Visual-Genomic Answering-Guided Transformer (VGAT), a framework integrating Visual Question Answerin…
▽ More
Multimodal learning combining pathology images and genomic sequences enhances cancer survival analysis but faces clinical implementation barriers due to limited access to genomic sequencing in under-resourced regions. To enable survival prediction using only whole-slide images (WSI), we propose the Visual-Genomic Answering-Guided Transformer (VGAT), a framework integrating Visual Question Answering (VQA) techniques for genomic modality reconstruction. By adapting VQA's text feature extraction approach, we derive stable genomic representations that circumvent dimensionality challenges in raw genomic data. Simultaneously, a cluster-based visual prompt module selectively enhances discriminative WSI patches, addressing noise from unfiltered image regions. Evaluated across five TCGA datasets, VGAT outperforms existing WSI-only methods, demonstrating the viability of genomic-informed inference without sequencing. This approach bridges multimodal research and clinical feasibility in resource-constrained settings. The code link is https://github.com/CZZZZZZZZZZZZZZZZZ/VGAT.
△ Less
Submitted 29 March, 2025; v1 submitted 25 March, 2025;
originally announced March 2025.
-
PM4Bench: A Parallel Multilingual Multi-Modal Multi-task Benchmark for Large Vision Language Model
Authors:
Junyuan Gao,
Jiahe Song,
Jiang Wu,
Runchuan Zhu,
Guanlin Shen,
Shasha Wang,
Xingjian Wei,
Haote Yang,
Songyang Zhang,
Weijia Li,
Bin Wang,
Dahua Lin,
Lijun Wu,
Conghui He
Abstract:
Existing multilingual benchmarks for Large Vision Language Models (LVLMs) suffer from limitations including language-specific content biases, disjointed multimodal input formats, and a lack of safety evaluation. To address these gaps, we propose PM4Bench, the first Parallel Multilingual Multi-Modal Multi-task Benchmark for LVLMs. PM4Bench features a parallel corpus design across 10 languages, enab…
▽ More
Existing multilingual benchmarks for Large Vision Language Models (LVLMs) suffer from limitations including language-specific content biases, disjointed multimodal input formats, and a lack of safety evaluation. To address these gaps, we propose PM4Bench, the first Parallel Multilingual Multi-Modal Multi-task Benchmark for LVLMs. PM4Bench features a parallel corpus design across 10 languages, enabling fair and accurate cross-lingual comparisons. It includes the vision setting where text and queries are embedded in images, requiring LVLMs to simultaneously "see", "read", and "think", aligning with real-world applications. Additionally, PM\textsuperscript{4}Bench incorporates safety evaluations, addressing critical oversight in existing multilingual benchmarks. Using PM4Bench, we evaluate 11 mainstream LVLMs, revealing significant cross-linguistic performance disparities, particularly in vision settings, and identifying OCR capability as a key determinant of these imbalances. We will release PM4Bench at https://github.com/opendatalab/PM4Bench .
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
TensoFlow: Tensorial Flow-based Sampler for Inverse Rendering
Authors:
Chun Gu,
Xiaofei Wei,
Li Zhang,
Xiatian Zhu
Abstract:
Inverse rendering aims to recover scene geometry, material properties, and lighting from multi-view images. Given the complexity of light-surface interactions, importance sampling is essential for the evaluation of the rendering equation, as it reduces variance and enhances the efficiency of Monte Carlo sampling. Existing inverse rendering methods typically use pre-defined non-learnable importance…
▽ More
Inverse rendering aims to recover scene geometry, material properties, and lighting from multi-view images. Given the complexity of light-surface interactions, importance sampling is essential for the evaluation of the rendering equation, as it reduces variance and enhances the efficiency of Monte Carlo sampling. Existing inverse rendering methods typically use pre-defined non-learnable importance samplers in prior manually, struggling to effectively match the spatially and directionally varied integrand and resulting in high variance and suboptimal performance. To address this limitation, we propose the concept of learning a spatially and directionally aware importance sampler for the rendering equation to accurately and flexibly capture the unconstrained complexity of a typical scene. We further formulate TensoFlow, a generic approach for sampler learning in inverse rendering, enabling to closely match the integrand of the rendering equation spatially and directionally. Concretely, our sampler is parameterized by normalizing flows, allowing both directional sampling of incident light and probability density function (PDF) inference. To capture the characteristics of the sampler spatially, we learn a tensorial representation of the scene space, which imposes spatial conditions, together with reflected direction, leading to spatially and directionally aware sampling distributions. Our model can be optimized by minimizing the difference between the integrand and our normalizing flow. Extensive experiments validate the superiority of TensoFlow over prior alternatives on both synthetic and real-world benchmarks.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
DiffusionTalker: Efficient and Compact Speech-Driven 3D Talking Head via Personalizer-Guided Distillation
Authors:
Peng Chen,
Xiaobao Wei,
Ming Lu,
Hui Chen,
Feng Tian
Abstract:
Real-time speech-driven 3D facial animation has been attractive in academia and industry. Traditional methods mainly focus on learning a deterministic mapping from speech to animation. Recent approaches start to consider the nondeterministic fact of speech-driven 3D face animation and employ the diffusion model for the task. Existing diffusion-based methods can improve the diversity of facial anim…
▽ More
Real-time speech-driven 3D facial animation has been attractive in academia and industry. Traditional methods mainly focus on learning a deterministic mapping from speech to animation. Recent approaches start to consider the nondeterministic fact of speech-driven 3D face animation and employ the diffusion model for the task. Existing diffusion-based methods can improve the diversity of facial animation. However, personalized speaking styles conveying accurate lip language is still lacking, besides, efficiency and compactness still need to be improved. In this work, we propose DiffusionTalker to address the above limitations via personalizer-guided distillation. In terms of personalization, we introduce a contrastive personalizer that learns identity and emotion embeddings to capture speaking styles from audio. We further propose a personalizer enhancer during distillation to enhance the influence of embeddings on facial animation. For efficiency, we use iterative distillation to reduce the steps required for animation generation and achieve more than 8x speedup in inference. To achieve compactness, we distill the large teacher model into a smaller student model, reducing our model's storage by 86.4\% while minimizing performance loss. After distillation, users can derive their identity and emotion embeddings from audio to quickly create personalized animations that reflect specific speaking styles. Extensive experiments are conducted to demonstrate that our method outperforms state-of-the-art methods. The code will be released at: https://github.com/ChenVoid/DiffusionTalker.
△ Less
Submitted 23 March, 2025;
originally announced March 2025.
-
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Authors:
Qiying Yu,
Zheng Zhang,
Ruofei Zhu,
Yufeng Yuan,
Xiaochen Zuo,
Yu Yue,
Tiantian Fan,
Gaohong Liu,
Lingjun Liu,
Xin Liu,
Haibin Lin,
Zhiqi Lin,
Bole Ma,
Guangming Sheng,
Yuxuan Tong,
Chi Zhang,
Mofan Zhang,
Wang Zhang,
Hang Zhu,
Jinhua Zhu,
Jiaze Chen,
Jiangjie Chen,
Chengyi Wang,
Hongli Yu,
Weinan Dai
, et al. (10 additional authors not shown)
Abstract:
Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the $\textbf{D}$ecouple…
▽ More
Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the $\textbf{D}$ecoupled Clip and $\textbf{D}$ynamic s$\textbf{A}$mpling $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{DAPO}$) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2.5-32B base model. Unlike previous works that withhold training details, we introduce four key techniques of our algorithm that make large-scale LLM RL a success. In addition, we open-source our training code, which is built on the verl framework, along with a carefully curated and processed dataset. These components of our open-source system enhance reproducibility and support future research in large-scale LLM RL.
△ Less
Submitted 18 March, 2025;
originally announced March 2025.
-
ViSpeak: Visual Instruction Feedback in Streaming Videos
Authors:
Shenghao Fu,
Qize Yang,
Yuan-Ming Li,
Yi-Xing Peng,
Kun-Yu Lin,
Xihan Wei,
Jian-Fang Hu,
Xiaohua Xie,
Wei-Shi Zheng
Abstract:
Recent advances in Large Multi-modal Models (LMMs) are primarily focused on offline video understanding. Instead, streaming video understanding poses great challenges to recent models due to its time-sensitive, omni-modal and interactive characteristics. In this work, we aim to extend the streaming video understanding from a new perspective and propose a novel task named Visual Instruction Feedbac…
▽ More
Recent advances in Large Multi-modal Models (LMMs) are primarily focused on offline video understanding. Instead, streaming video understanding poses great challenges to recent models due to its time-sensitive, omni-modal and interactive characteristics. In this work, we aim to extend the streaming video understanding from a new perspective and propose a novel task named Visual Instruction Feedback in which models should be aware of visual contents and learn to extract instructions from them. For example, when users wave their hands to agents, agents should recognize the gesture and start conversations with welcome information. Thus, following instructions in visual modality greatly enhances user-agent interactions. To facilitate research, we define seven key subtasks highly relevant to visual modality and collect the ViSpeak-Instruct dataset for training and the ViSpeak-Bench for evaluation. Further, we propose the ViSpeak model, which is a SOTA streaming video understanding LMM with GPT-4o-level performance on various streaming video understanding benchmarks. After finetuning on our ViSpeak-Instruct dataset, ViSpeak is equipped with basic visual instruction feedback ability, serving as a solid baseline for future research.
△ Less
Submitted 16 March, 2025;
originally announced March 2025.
-
Rethinking Multi-modal Object Detection from the Perspective of Mono-Modality Feature Learning
Authors:
Tianyi Zhao,
Boyang Liu,
Yanglei Gao,
Yiming Sun,
Maoxun Yuan,
Xingxing Wei
Abstract:
Multi-Modal Object Detection (MMOD), due to its stronger adaptability to various complex environments, has been widely applied in various applications. Extensive research is dedicated to the RGB-IR object detection, primarily focusing on how to integrate complementary features from RGB-IR modalities. However, they neglect the mono-modality insufficient learning problem that the decreased feature e…
▽ More
Multi-Modal Object Detection (MMOD), due to its stronger adaptability to various complex environments, has been widely applied in various applications. Extensive research is dedicated to the RGB-IR object detection, primarily focusing on how to integrate complementary features from RGB-IR modalities. However, they neglect the mono-modality insufficient learning problem that the decreased feature extraction capability in multi-modal joint learning. This leads to an unreasonable but prevalent phenomenon--Fusion Degradation, which hinders the performance improvement of the MMOD model. Motivated by this, in this paper, we introduce linear probing evaluation to the multi-modal detectors and rethink the multi-modal object detection task from the mono-modality learning perspective. Therefore, we construct an novel framework called M$^2$D-LIF, which consists of the Mono-Modality Distillation (M$^2$D) method and the Local Illumination-aware Fusion (LIF) module. The M$^2$D-LIF framework facilitates the sufficient learning of mono-modality during multi-modal joint training and explores a lightweight yet effective feature fusion manner to achieve superior object detection performance. Extensive experiments conducted on three MMOD datasets demonstrate that our M$^2$D-LIF effectively mitigates the Fusion Degradation phenomenon and outperforms the previous SOTA detectors.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
PBR3DGen: A VLM-guided Mesh Generation with High-quality PBR Texture
Authors:
Xiaokang Wei,
Bowen Zhang,
Xianghui Yang,
Yuxuan Wang,
Chunchao Guo,
Xi Zhao,
Yan Luximon
Abstract:
Generating high-quality physically based rendering (PBR) materials is important to achieve realistic rendering in the downstream tasks, yet it remains challenging due to the intertwined effects of materials and lighting. While existing methods have made breakthroughs by incorporating material decomposition in the 3D generation pipeline, they tend to bake highlights into albedo and ignore spatially…
▽ More
Generating high-quality physically based rendering (PBR) materials is important to achieve realistic rendering in the downstream tasks, yet it remains challenging due to the intertwined effects of materials and lighting. While existing methods have made breakthroughs by incorporating material decomposition in the 3D generation pipeline, they tend to bake highlights into albedo and ignore spatially varying properties of metallicity and roughness. In this work, we present PBR3DGen, a two-stage mesh generation method with high-quality PBR materials that integrates the novel multi-view PBR material estimation model and a 3D PBR mesh reconstruction model. Specifically, PBR3DGen leverages vision language models (VLM) to guide multi-view diffusion, precisely capturing the spatial distribution and inherent attributes of reflective-metalness material. Additionally, we incorporate view-dependent illumination-aware conditions as pixel-aware priors to enhance spatially varying material properties. Furthermore, our reconstruction model reconstructs high-quality mesh with PBR materials. Experimental results demonstrate that PBR3DGen significantly outperforms existing methods, achieving new state-of-the-art results for PBR estimation and mesh generation. More results and visualization can be found on our project page: https://pbr3dgen1218.github.io/.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
DNA Origami Nanostructures Observed in Transmission Electron Microscopy Images can be Characterized through Convolutional Neural Networks
Authors:
Xingfei Wei,
Qiankun Mo,
Chi Chen,
Mark Bathe,
Rigoberto Hernandez
Abstract:
Artificial intelligence (AI) models remain an emerging strategy to accelerate materials design and development. We demonstrate that convolutional neural network (CNN) models can characterize DNA origami nanostructures employed in programmable self-assembling, which is important in many applications such as in biomedicine. Specifically, we benchmark the performance of 9 CNN models -- viz. AlexNet,…
▽ More
Artificial intelligence (AI) models remain an emerging strategy to accelerate materials design and development. We demonstrate that convolutional neural network (CNN) models can characterize DNA origami nanostructures employed in programmable self-assembling, which is important in many applications such as in biomedicine. Specifically, we benchmark the performance of 9 CNN models -- viz. AlexNet, GoogLeNet, VGG16, VGG19, ResNet18, ResNet34, ResNet50, ResNet101, and ResNet152 -- to characterize the ligation number of DNA origami nanostructures in transmission electron microscopy (TEM) images. We first pre-train CNN models using a large image dataset of 720 images from our coarse-grained (CG) molecular dynamics (MD) simulations. Then, we fine-tune the pre-trained CNN models, using a small experimental TEM dataset with 146 TEM images. All CNN models were found to have similar computational time requirements, while their model sizes and performances are different. We use 20 test MD images to demonstrate that among all of the pre-trained CNN models ResNet50 and VGG16 have the highest and second highest accuracies. Among the fine-tuned models, VGG16 was found to have the highest agreement on the test TEM images. Thus, we conclude that fine-tuned VGG16 models can quickly characterize the ligation number of nanostructures in large TEM images.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
A Hierarchical Semantic Distillation Framework for Open-Vocabulary Object Detection
Authors:
Shenghao Fu,
Junkai Yan,
Qize Yang,
Xihan Wei,
Xiaohua Xie,
Wei-Shi Zheng
Abstract:
Open-vocabulary object detection (OVD) aims to detect objects beyond the training annotations, where detectors are usually aligned to a pre-trained vision-language model, eg, CLIP, to inherit its generalizable recognition ability so that detectors can recognize new or novel objects. However, previous works directly align the feature space with CLIP and fail to learn the semantic knowledge effectiv…
▽ More
Open-vocabulary object detection (OVD) aims to detect objects beyond the training annotations, where detectors are usually aligned to a pre-trained vision-language model, eg, CLIP, to inherit its generalizable recognition ability so that detectors can recognize new or novel objects. However, previous works directly align the feature space with CLIP and fail to learn the semantic knowledge effectively. In this work, we propose a hierarchical semantic distillation framework named HD-OVD to construct a comprehensive distillation process, which exploits generalizable knowledge from the CLIP model in three aspects. In the first hierarchy of HD-OVD, the detector learns fine-grained instance-wise semantics from the CLIP image encoder by modeling relations among single objects in the visual space. Besides, we introduce text space novel-class-aware classification to help the detector assimilate the highly generalizable class-wise semantics from the CLIP text encoder, representing the second hierarchy. Lastly, abundant image-wise semantics containing multi-object and their contexts are also distilled by an image-wise contrastive distillation. Benefiting from the elaborated semantic distillation in triple hierarchies, our HD-OVD inherits generalizable recognition ability from CLIP in instance, class, and image levels. Thus, we boost the novel AP on the OV-COCO dataset to 46.4% with a ResNet50 backbone, which outperforms others by a clear margin. We also conduct extensive ablation studies to analyze how each component works.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.