-
End-to-End Vision Tokenizer Tuning
Authors:
Wenxuan Wang,
Fan Zhang,
Yufeng Cui,
Haiwen Diao,
Zhuoyan Luo,
Huchuan Lu,
Jing Liu,
Xinlong Wang
Abstract:
Existing vision tokenization isolates the optimization of vision tokenizers from downstream training, implicitly assuming the visual tokens can generalize well across various tasks, e.g., image generation and visual question answering. The vision tokenizer optimized for low-level reconstruction is agnostic to downstream tasks requiring varied representations and semantics. This decoupled paradigm…
▽ More
Existing vision tokenization isolates the optimization of vision tokenizers from downstream training, implicitly assuming the visual tokens can generalize well across various tasks, e.g., image generation and visual question answering. The vision tokenizer optimized for low-level reconstruction is agnostic to downstream tasks requiring varied representations and semantics. This decoupled paradigm introduces a critical misalignment: The loss of the vision tokenization can be the representation bottleneck for target tasks. For example, errors in tokenizing text in a given image lead to poor results when recognizing or generating them. To address this, we propose ETT, an end-to-end vision tokenizer tuning approach that enables joint optimization between vision tokenization and target autoregressive tasks. Unlike prior autoregressive models that use only discrete indices from a frozen vision tokenizer, ETT leverages the visual embeddings of the tokenizer codebook, and optimizes the vision tokenizers end-to-end with both reconstruction and caption objectives. ETT can be seamlessly integrated into existing training pipelines with minimal architecture modifications. Our ETT is simple to implement and integrate, without the need to adjust the original codebooks or architectures of the employed large language models. Extensive experiments demonstrate that our proposed end-to-end vision tokenizer tuning unlocks significant performance gains, i.e., 2-6% for multimodal understanding and visual generation tasks compared to frozen tokenizer baselines, while preserving the original reconstruction capability. We hope this very simple and strong method can empower multimodal foundation models besides image generation and understanding.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
AdaptCLIP: Adapting CLIP for Universal Visual Anomaly Detection
Authors:
Bin-Bin Gao,
Yue Zhu,
Jiangtao Yan,
Yuezhi Cai,
Weixi Zhang,
Meng Wang,
Jun Liu,
Yong Liu,
Lei Wang,
Chengjie Wang
Abstract:
Universal visual anomaly detection aims to identify anomalies from novel or unseen vision domains without additional fine-tuning, which is critical in open scenarios. Recent studies have demonstrated that pre-trained vision-language models like CLIP exhibit strong generalization with just zero or a few normal images. However, existing methods struggle with designing prompt templates, complex token…
▽ More
Universal visual anomaly detection aims to identify anomalies from novel or unseen vision domains without additional fine-tuning, which is critical in open scenarios. Recent studies have demonstrated that pre-trained vision-language models like CLIP exhibit strong generalization with just zero or a few normal images. However, existing methods struggle with designing prompt templates, complex token interactions, or requiring additional fine-tuning, resulting in limited flexibility. In this work, we present a simple yet effective method called AdaptCLIP based on two key insights. First, adaptive visual and textual representations should be learned alternately rather than jointly. Second, comparative learning between query and normal image prompt should incorporate both contextual and aligned residual features, rather than relying solely on residual features. AdaptCLIP treats CLIP models as a foundational service, adding only three simple adapters, visual adapter, textual adapter, and prompt-query adapter, at its input or output ends. AdaptCLIP supports zero-/few-shot generalization across domains and possesses a training-free manner on target domains once trained on a base dataset. AdaptCLIP achieves state-of-the-art performance on 12 anomaly detection benchmarks from industrial and medical domains, significantly outperforming existing competitive methods. We will make the code and model of AdaptCLIP available at https://github.com/gaobb/AdaptCLIP.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Unsupervised Radar Point Cloud Enhancement via Arbitrary LiDAR Guided Diffusion Prior
Authors:
Yanlong Yang,
Jianan Liu,
Guanxiong Luo,
Hao Li,
Euijoon Ahn,
Mostafa Rahimi Azghadi,
Tao Huang
Abstract:
In industrial automation, radar is a critical sensor in machine perception. However, the angular resolution of radar is inherently limited by the Rayleigh criterion, which depends on both the radar's operating wavelength and the effective aperture of its antenna array.To overcome these hardware-imposed limitations, recent neural network-based methods have leveraged high-resolution LiDAR data, pair…
▽ More
In industrial automation, radar is a critical sensor in machine perception. However, the angular resolution of radar is inherently limited by the Rayleigh criterion, which depends on both the radar's operating wavelength and the effective aperture of its antenna array.To overcome these hardware-imposed limitations, recent neural network-based methods have leveraged high-resolution LiDAR data, paired with radar measurements, during training to enhance radar point cloud resolution. While effective, these approaches require extensive paired datasets, which are costly to acquire and prone to calibration error. These challenges motivate the need for methods that can improve radar resolution without relying on paired high-resolution ground-truth data. Here, we introduce an unsupervised radar points enhancement algorithm that employs an arbitrary LiDAR-guided diffusion model as a prior without the need for paired training data. Specifically, our approach formulates radar angle estimation recovery as an inverse problem and incorporates prior knowledge through a diffusion model with arbitrary LiDAR domain knowledge. Experimental results demonstrate that our method attains high fidelity and low noise performance compared to traditional regularization techniques. Additionally, compared to paired training methods, it not only achieves comparable performance but also offers improved generalization capability. To our knowledge, this is the first approach that enhances radar points output by integrating prior knowledge via a diffusion model rather than relying on paired training data. Our code is available at https://github.com/yyxr75/RadarINV.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Large Language Models Are More Persuasive Than Incentivized Human Persuaders
Authors:
Philipp Schoenegger,
Francesco Salvi,
Jiacheng Liu,
Xiaoli Nan,
Ramit Debnath,
Barbara Fasolo,
Evelina Leivada,
Gabriel Recchia,
Fritz Günther,
Ali Zarifhonarvar,
Joe Kwon,
Zahoor Ul Islam,
Marco Dehnert,
Daryl Y. H. Lee,
Madeline G. Reinecke,
David G. Kamper,
Mert Kobaş,
Adam Sandford,
Jonas Kgomo,
Luke Hewitt,
Shreya Kapoor,
Kerem Oktar,
Eyup Engin Kucuk,
Bo Feng,
Cameron R. Jones
, et al. (15 additional authors not shown)
Abstract:
We directly compare the persuasion capabilities of a frontier large language model (LLM; Claude Sonnet 3.5) against incentivized human persuaders in an interactive, real-time conversational quiz setting. In this preregistered, large-scale incentivized experiment, participants (quiz takers) completed an online quiz where persuaders (either humans or LLMs) attempted to persuade quiz takers toward co…
▽ More
We directly compare the persuasion capabilities of a frontier large language model (LLM; Claude Sonnet 3.5) against incentivized human persuaders in an interactive, real-time conversational quiz setting. In this preregistered, large-scale incentivized experiment, participants (quiz takers) completed an online quiz where persuaders (either humans or LLMs) attempted to persuade quiz takers toward correct or incorrect answers. We find that LLM persuaders achieved significantly higher compliance with their directional persuasion attempts than incentivized human persuaders, demonstrating superior persuasive capabilities in both truthful (toward correct answers) and deceptive (toward incorrect answers) contexts. We also find that LLM persuaders significantly increased quiz takers' accuracy, leading to higher earnings, when steering quiz takers toward correct answers, and significantly decreased their accuracy, leading to lower earnings, when steering them toward incorrect answers. Overall, our findings suggest that AI's persuasion capabilities already exceed those of humans that have real-money bonuses tied to performance. Our findings of increasingly capable AI persuaders thus underscore the urgency of emerging alignment and governance frameworks.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Unlocking Location Intelligence: A Survey from Deep Learning to The LLM Era
Authors:
Xixuan Hao,
Yutian Jiang,
Xingchen Zou,
Jiabo Liu,
Yifang Yin,
Yuxuan Liang
Abstract:
Location Intelligence (LI), the science of transforming location-centric geospatial data into actionable knowledge, has become a cornerstone of modern spatial decision-making. The rapid evolution of Geospatial Representation Learning is fundamentally reshaping LI development through two successive technological revolutions: the deep learning breakthrough and the emerging large language model (LLM)…
▽ More
Location Intelligence (LI), the science of transforming location-centric geospatial data into actionable knowledge, has become a cornerstone of modern spatial decision-making. The rapid evolution of Geospatial Representation Learning is fundamentally reshaping LI development through two successive technological revolutions: the deep learning breakthrough and the emerging large language model (LLM) paradigm. While deep neural networks (DNNs) have demonstrated remarkable success in automated feature extraction from structured geospatial data (e.g., satellite imagery, GPS trajectories), the recent integration of LLMs introduces transformative capabilities for cross-modal geospatial reasoning and unstructured geo-textual data processing. This survey presents a comprehensive review of geospatial representation learning across both technological eras, organizing them into a structured taxonomy based on the complete pipeline comprising: (1) data perspective, (2) methodological perspective and (3) application perspective. We also highlight current advancements, discuss existing limitations, and propose potential future research directions in the LLM era. This work offers a thorough exploration of the field and providing a roadmap for further innovation in LI. The summary of the up-to-date paper list can be found in https://github.com/CityMind-Lab/Awesome-Location-Intelligence and will undergo continuous updates.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
BioVFM-21M: Benchmarking and Scaling Self-Supervised Vision Foundation Models for Biomedical Image Analysis
Authors:
Jiarun Liu,
Hong-Yu Zhou,
Weijian Huang,
Hao Yang,
Dongning Song,
Tao Tan,
Yong Liang,
Shanshan Wang
Abstract:
Scaling up model and data size have demonstrated impressive performance improvement over a wide range of tasks. Despite extensive studies on scaling behaviors for general-purpose tasks, medical images exhibit substantial differences from natural data. It remains unclear the key factors in developing medical vision foundation models at scale due to the absence of an extensive understanding of scali…
▽ More
Scaling up model and data size have demonstrated impressive performance improvement over a wide range of tasks. Despite extensive studies on scaling behaviors for general-purpose tasks, medical images exhibit substantial differences from natural data. It remains unclear the key factors in developing medical vision foundation models at scale due to the absence of an extensive understanding of scaling behavior in the medical domain. In this paper, we explored the scaling behavior across model sizes, training algorithms, data sizes, and imaging modalities in developing scalable medical vision foundation models by self-supervised learning. To support scalable pretraining, we introduce BioVFM-21M, a large-scale biomedical image dataset encompassing a wide range of biomedical image modalities and anatomies. We observed that scaling up does provide benefits but varies across tasks. Additional analysis reveals several factors correlated with scaling benefits. Finally, we propose BioVFM, a large-scale medical vision foundation model pretrained on 21 million biomedical images, which outperforms the previous state-of-the-art foundation models across 12 medical benchmarks. Our results highlight that while scaling up is beneficial for pursuing better performance, task characteristics, data diversity, pretraining methods, and computational efficiency remain critical considerations for developing scalable medical foundation models.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Few-Shot Anomaly-Driven Generation for Anomaly Classification and Segmentation
Authors:
Guan Gui,
Bin-Bin Gao,
Jun Liu,
Chengjie Wang,
Yunsheng Wu
Abstract:
Anomaly detection is a practical and challenging task due to the scarcity of anomaly samples in industrial inspection. Some existing anomaly detection methods address this issue by synthesizing anomalies with noise or external data. However, there is always a large semantic gap between synthetic and real-world anomalies, resulting in weak performance in anomaly detection. To solve the problem, we…
▽ More
Anomaly detection is a practical and challenging task due to the scarcity of anomaly samples in industrial inspection. Some existing anomaly detection methods address this issue by synthesizing anomalies with noise or external data. However, there is always a large semantic gap between synthetic and real-world anomalies, resulting in weak performance in anomaly detection. To solve the problem, we propose a few-shot Anomaly-driven Generation (AnoGen) method, which guides the diffusion model to generate realistic and diverse anomalies with only a few real anomalies, thereby benefiting training anomaly detection models. Specifically, our work is divided into three stages. In the first stage, we learn the anomaly distribution based on a few given real anomalies and inject the learned knowledge into an embedding. In the second stage, we use the embedding and given bounding boxes to guide the diffusion model to generate realistic and diverse anomalies on specific objects (or textures). In the final stage, we propose a weakly-supervised anomaly detection method to train a more powerful model with generated anomalies. Our method builds upon DRAEM and DesTSeg as the foundation model and conducts experiments on the commonly used industrial anomaly detection dataset, MVTec. The experiments demonstrate that our generated anomalies effectively improve the model performance of both anomaly classification and segmentation tasks simultaneously, \eg, DRAEM and DseTSeg achieved a 5.8\% and 1.5\% improvement in AU-PR metric on segmentation task, respectively. The code and generated anomalous data are available at https://github.com/gaobb/AnoGen.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation
Authors:
Yifu Yuan,
Haiqin Cui,
Yibin Chen,
Zibin Dong,
Fei Ni,
Longxin Kou,
Jinyi Liu,
Pengyi Li,
Yan Zheng,
Jianye Hao
Abstract:
Achieving generalization in robotic manipulation remains a critical challenge, particularly for unseen scenarios and novel tasks. Current Vision-Language-Action (VLA) models, while building on top of general Vision-Language Models (VLMs), still fall short of achieving robust zero-shot performance due to the scarcity and heterogeneity prevalent in embodied datasets. To address these limitations, we…
▽ More
Achieving generalization in robotic manipulation remains a critical challenge, particularly for unseen scenarios and novel tasks. Current Vision-Language-Action (VLA) models, while building on top of general Vision-Language Models (VLMs), still fall short of achieving robust zero-shot performance due to the scarcity and heterogeneity prevalent in embodied datasets. To address these limitations, we propose FSD (From Seeing to Doing), a novel vision-language model that generates intermediate representations through spatial relationship reasoning, providing fine-grained guidance for robotic manipulation. Our approach combines a hierarchical data pipeline for training with a self-consistency mechanism that aligns spatial coordinates with visual signals. Through extensive experiments, we comprehensively validated FSD's capabilities in both "seeing" and "doing," achieving outstanding performance across 8 benchmarks for general spatial reasoning and embodied reference abilities, as well as on our proposed more challenging benchmark VABench. We also verified zero-shot capabilities in robot manipulation, demonstrating significant performance improvements over baseline methods in both SimplerEnv and real robot settings. Experimental results show that FSD achieves 54.1% success rate in SimplerEnv and 72% success rate across 8 real-world tasks, outperforming the strongest baseline by 30%.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
A Unified Model for Cardinality Estimation by Learning from Data and Queries via Sum-Product Networks
Authors:
Jiawei Liu,
Ju Fan,
Tongyu Liu,
Kai Zeng,
Jiannan Wang,
Quehuan Liu,
Tao Ye,
Nan Tang
Abstract:
Cardinality estimation is a fundamental component in database systems, crucial for generating efficient execution plans. Despite advancements in learning-based cardinality estimation, existing methods may struggle to simultaneously optimize the key criteria: estimation accuracy, inference time, and storage overhead, limiting their practical applicability in real-world database environments. This p…
▽ More
Cardinality estimation is a fundamental component in database systems, crucial for generating efficient execution plans. Despite advancements in learning-based cardinality estimation, existing methods may struggle to simultaneously optimize the key criteria: estimation accuracy, inference time, and storage overhead, limiting their practical applicability in real-world database environments. This paper introduces QSPN, a unified model that integrates both data distribution and query workload. QSPN achieves high estimation accuracy by modeling data distribution using the simple yet effective Sum-Product Network (SPN) structure. To ensure low inference time and reduce storage overhead, QSPN further partitions columns based on query access patterns. We formalize QSPN as a tree-based structure that extends SPNs by introducing two new node types: QProduct and QSplit. This paper studies the research challenges of developing efficient algorithms for the offline construction and online computation of QSPN. We conduct extensive experiments to evaluate QSPN in both single-table and multi-table cardinality estimation settings. The experimental results have demonstrated that QSPN achieves superior and robust performance on the three key criteria, compared with state-of-the-art approaches.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
On the Account Security Risks Posed by Password Strength Meters
Authors:
Ming Xu,
Weili Han,
Jitao Yu,
Jing Liu,
Xinyi Zhang,
Yun Lin,
Jin Song Dong
Abstract:
Password strength meters (PSMs) have been widely used by websites to gauge password strength, encouraging users to create stronger passwords. Popular data-driven PSMs, e.g., based on Markov, Probabilistic Context-free Grammar (PCFG) and neural networks, alarm strength based on a model learned from real passwords. Despite their proven effectiveness, the secure utility that arises from the leakage o…
▽ More
Password strength meters (PSMs) have been widely used by websites to gauge password strength, encouraging users to create stronger passwords. Popular data-driven PSMs, e.g., based on Markov, Probabilistic Context-free Grammar (PCFG) and neural networks, alarm strength based on a model learned from real passwords. Despite their proven effectiveness, the secure utility that arises from the leakage of trained passwords remains largely overlooked. To address this gap, we analyze 11 PSMs and find that 5 data-driven meters are vulnerable to membership inference attacks that expose their trained passwords, and seriously, 3 rule-based meters openly disclose their blocked passwords. We specifically design a PSM privacy leakage evaluation approach, and uncover that a series of general data-driven meters are vulnerable to leaking between 10^4 to 10^5 trained passwords, with the PCFG-based models being more vulnerable than other counterparts; furthermore, we aid in deriving insights that the inherent utility-privacy tradeoff is not as severe as previously thought. To further exploit the risks, we develop novel meter-aware attacks when a clever attacker can filter the used passwords during compromising accounts on websites using the meter, and experimentally show that attackers targeting websites that deployed the popular Zxcvbn meter can compromise an additional 5.84% user accounts within 10 attempts, demonstrating the urgent need for privacy-preserving PSMs that protect the confidentiality of the meter's used passwords. Finally, we sketch some counter-measures to mitigate these threats.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
DexWild: Dexterous Human Interactions for In-the-Wild Robot Policies
Authors:
Tony Tao,
Mohan Kumar Srirama,
Jason Jingzhou Liu,
Kenneth Shaw,
Deepak Pathak
Abstract:
Large-scale, diverse robot datasets have emerged as a promising path toward enabling dexterous manipulation policies to generalize to novel environments, but acquiring such datasets presents many challenges. While teleoperation provides high-fidelity datasets, its high cost limits its scalability. Instead, what if people could use their own hands, just as they do in everyday life, to collect data?…
▽ More
Large-scale, diverse robot datasets have emerged as a promising path toward enabling dexterous manipulation policies to generalize to novel environments, but acquiring such datasets presents many challenges. While teleoperation provides high-fidelity datasets, its high cost limits its scalability. Instead, what if people could use their own hands, just as they do in everyday life, to collect data? In DexWild, a diverse team of data collectors uses their hands to collect hours of interactions across a multitude of environments and objects. To record this data, we create DexWild-System, a low-cost, mobile, and easy-to-use device. The DexWild learning framework co-trains on both human and robot demonstrations, leading to improved performance compared to training on each dataset individually. This combination results in robust robot policies capable of generalizing to novel environments, tasks, and embodiments with minimal additional robot-specific data. Experimental results demonstrate that DexWild significantly improves performance, achieving a 68.5% success rate in unseen environments-nearly four times higher than policies trained with robot data only-and offering 5.8x better cross-embodiment generalization. Video results, codebases, and instructions at https://dexwild.github.io
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets
Authors:
Weiyu Li,
Xuanyang Zhang,
Zheng Sun,
Di Qi,
Hao Li,
Wei Cheng,
Weiwei Cai,
Shihao Wu,
Jiarui Liu,
Zihao Wang,
Xiao Chen,
Feipeng Tian,
Jianxiong Pan,
Zeming Li,
Gang Yu,
Xiangyu Zhang,
Daxin Jiang,
Ping Tan
Abstract:
While generative artificial intelligence has advanced significantly across text, image, audio, and video domains, 3D generation remains comparatively underdeveloped due to fundamental challenges such as data scarcity, algorithmic limitations, and ecosystem fragmentation. To this end, we present Step1X-3D, an open framework addressing these challenges through: (1) a rigorous data curation pipeline…
▽ More
While generative artificial intelligence has advanced significantly across text, image, audio, and video domains, 3D generation remains comparatively underdeveloped due to fundamental challenges such as data scarcity, algorithmic limitations, and ecosystem fragmentation. To this end, we present Step1X-3D, an open framework addressing these challenges through: (1) a rigorous data curation pipeline processing >5M assets to create a 2M high-quality dataset with standardized geometric and textural properties; (2) a two-stage 3D-native architecture combining a hybrid VAE-DiT geometry generator with an diffusion-based texture synthesis module; and (3) the full open-source release of models, training code, and adaptation modules. For geometry generation, the hybrid VAE-DiT component produces TSDF representations by employing perceiver-based latent encoding with sharp edge sampling for detail preservation. The diffusion-based texture synthesis module then ensures cross-view consistency through geometric conditioning and latent-space synchronization. Benchmark results demonstrate state-of-the-art performance that exceeds existing open-source methods, while also achieving competitive quality with proprietary solutions. Notably, the framework uniquely bridges the 2D and 3D generation paradigms by supporting direct transfer of 2D control techniques~(e.g., LoRA) to 3D synthesis. By simultaneously advancing data quality, algorithmic fidelity, and reproducibility, Step1X-3D aims to establish new standards for open research in controllable 3D asset generation.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
ABase: the Multi-Tenant NoSQL Serverless Database for Diverse and Dynamic Workloads in Large-scale Cloud Environments
Authors:
Rong Kang,
Yanbin Chen,
Ye Liu,
Fuxin Jiang,
Qingshuo Li,
Miao Ma,
Jian Liu,
Guangliang Zhao,
Tieying Zhang,
Jianjun Chen,
Lei Zhang
Abstract:
Multi-tenant architectures enhance the elasticity and resource utilization of NoSQL databases by allowing multiple tenants to co-locate and share resources. However, in large-scale cloud environments, the diverse and dynamic nature of workloads poses significant challenges for multi-tenant NoSQL databases. Based on our practical observations, we have identified three crucial challenges: (1) the im…
▽ More
Multi-tenant architectures enhance the elasticity and resource utilization of NoSQL databases by allowing multiple tenants to co-locate and share resources. However, in large-scale cloud environments, the diverse and dynamic nature of workloads poses significant challenges for multi-tenant NoSQL databases. Based on our practical observations, we have identified three crucial challenges: (1) the impact of caching on performance isolation, as cache hits alter request execution and resource consumption, leading to inaccurate traffic control; (2) the dynamic changes in traffic, with changes in tenant traffic trends causing throttling or resource wastage, and changes in access distribution causing hot key pressure or cache hit ratio drops; and (3) the imbalanced layout of data nodes due to tenants' diverse resource requirements, leading to low resource utilization. To address these challenges, we introduce ABase, a multi-tenant NoSQL serverless database developed at ByteDance. ABase introduces a two-layer caching mechanism with a cache-aware isolation mechanism to ensure accurate resource consumption estimates. Furthermore, ABase employs a predictive autoscaling policy to dynamically adjust resources in response to tenant traffic changes and a multi-resource rescheduling algorithm to balance resource utilization across data nodes. With these innovations, ABase has successfully served ByteDance's large-scale cloud environment, supporting a total workload that has achieved a peak QPS of over 13 billion and total storage exceeding 1 EB.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Neural Brain: A Neuroscience-inspired Framework for Embodied Agents
Authors:
Jian Liu,
Xiongtao Shi,
Thai Duy Nguyen,
Haitian Zhang,
Tianxiang Zhang,
Wei Sun,
Yanjie Li,
Athanasios V. Vasilakos,
Giovanni Iacca,
Arshad Ali Khan,
Arvind Kumar,
Jae Won Cho,
Ajmal Mian,
Lihua Xie,
Erik Cambria,
Lin Wang
Abstract:
The rapid evolution of artificial intelligence (AI) has shifted from static, data-driven models to dynamic systems capable of perceiving and interacting with real-world environments. Despite advancements in pattern recognition and symbolic reasoning, current AI systems, such as large language models, remain disembodied, unable to physically engage with the world. This limitation has driven the ris…
▽ More
The rapid evolution of artificial intelligence (AI) has shifted from static, data-driven models to dynamic systems capable of perceiving and interacting with real-world environments. Despite advancements in pattern recognition and symbolic reasoning, current AI systems, such as large language models, remain disembodied, unable to physically engage with the world. This limitation has driven the rise of embodied AI, where autonomous agents, such as humanoid robots, must navigate and manipulate unstructured environments with human-like adaptability. At the core of this challenge lies the concept of Neural Brain, a central intelligence system designed to drive embodied agents with human-like adaptability. A Neural Brain must seamlessly integrate multimodal sensing and perception with cognitive capabilities. Achieving this also requires an adaptive memory system and energy-efficient hardware-software co-design, enabling real-time action in dynamic environments. This paper introduces a unified framework for the Neural Brain of embodied agents, addressing two fundamental challenges: (1) defining the core components of Neural Brain and (2) bridging the gap between static AI models and the dynamic adaptability required for real-world deployment. To this end, we propose a biologically inspired architecture that integrates multimodal active sensing, perception-cognition-action function, neuroplasticity-based memory storage and updating, and neuromorphic hardware/software optimization. Furthermore, we also review the latest research on embodied agents across these four aspects and analyze the gap between current AI systems and human intelligence. By synthesizing insights from neuroscience, we outline a roadmap towards the development of generalizable, autonomous agents capable of human-level intelligence in real-world scenarios.
△ Less
Submitted 14 May, 2025; v1 submitted 12 May, 2025;
originally announced May 2025.
-
DARLR: Dual-Agent Offline Reinforcement Learning for Recommender Systems with Dynamic Reward
Authors:
Yi Zhang,
Ruihong Qiu,
Xuwei Xu,
Jiajun Liu,
Sen Wang
Abstract:
Model-based offline reinforcement learning (RL) has emerged as a promising approach for recommender systems, enabling effective policy learning by interacting with frozen world models. However, the reward functions in these world models, trained on sparse offline logs, often suffer from inaccuracies. Specifically, existing methods face two major limitations in addressing this challenge: (1) determ…
▽ More
Model-based offline reinforcement learning (RL) has emerged as a promising approach for recommender systems, enabling effective policy learning by interacting with frozen world models. However, the reward functions in these world models, trained on sparse offline logs, often suffer from inaccuracies. Specifically, existing methods face two major limitations in addressing this challenge: (1) deterministic use of reward functions as static look-up tables, which propagates inaccuracies during policy learning, and (2) static uncertainty designs that fail to effectively capture decision risks and mitigate the impact of these inaccuracies. In this work, a dual-agent framework, DARLR, is proposed to dynamically update world models to enhance recommendation policies. To achieve this, a \textbf{\textit{selector}} is introduced to identify reference users by balancing similarity and diversity so that the \textbf{\textit{recommender}} can aggregate information from these users and iteratively refine reward estimations for dynamic reward shaping. Further, the statistical features of the selected users guide the dynamic adaptation of an uncertainty penalty to better align with evolving recommendation requirements. Extensive experiments on four benchmark datasets demonstrate the superior performance of DARLR, validating its effectiveness. The code is available at https://github.com/ArronDZhang/DARLR.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Seed1.5-VL Technical Report
Authors:
Dong Guo,
Faming Wu,
Feida Zhu,
Fuxing Leng,
Guang Shi,
Haobin Chen,
Haoqi Fan,
Jian Wang,
Jianyu Jiang,
Jiawei Wang,
Jingji Chen,
Jingjia Huang,
Kang Lei,
Liping Yuan,
Lishu Luo,
Pengfei Liu,
Qinghao Ye,
Rui Qian,
Shen Yan,
Shixiong Zhao,
Shuai Peng,
Shuangye Li,
Sihang Yuan,
Sijin Wu,
Tianheng Cheng
, et al. (172 additional authors not shown)
Abstract:
We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluati…
▽ More
We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at https://www.volcengine.com/ (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
Towards Human-Centric Autonomous Driving: A Fast-Slow Architecture Integrating Large Language Model Guidance with Reinforcement Learning
Authors:
Chengkai Xu,
Jiaqi Liu,
Yicheng Guo,
Yuhang Zhang,
Peng Hang,
Jian Sun
Abstract:
Autonomous driving has made significant strides through data-driven techniques, achieving robust performance in standardized tasks. However, existing methods frequently overlook user-specific preferences, offering limited scope for interaction and adaptation with users. To address these challenges, we propose a "fast-slow" decision-making framework that integrates a Large Language Model (LLM) for…
▽ More
Autonomous driving has made significant strides through data-driven techniques, achieving robust performance in standardized tasks. However, existing methods frequently overlook user-specific preferences, offering limited scope for interaction and adaptation with users. To address these challenges, we propose a "fast-slow" decision-making framework that integrates a Large Language Model (LLM) for high-level instruction parsing with a Reinforcement Learning (RL) agent for low-level real-time decision. In this dual system, the LLM operates as the "slow" module, translating user directives into structured guidance, while the RL agent functions as the "fast" module, making time-critical maneuvers under stringent latency constraints. By decoupling high-level decision making from rapid control, our framework enables personalized user-centric operation while maintaining robust safety margins. Experimental evaluations across various driving scenarios demonstrate the effectiveness of our method. Compared to baseline algorithms, the proposed architecture not only reduces collision rates but also aligns driving behaviors more closely with user preferences, thereby achieving a human-centric mode. By integrating user guidance at the decision level and refining it with real-time control, our framework bridges the gap between individual passenger needs and the rigor required for safe, reliable driving in complex traffic environments.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
Sandcastles in the Storm: Revisiting the (Im)possibility of Strong Watermarking
Authors:
Fabrice Y Harel-Canada,
Boran Erol,
Connor Choi,
Jason Liu,
Gary Jiarui Song,
Nanyun Peng,
Amit Sahai
Abstract:
Watermarking AI-generated text is critical for combating misuse. Yet recent theoretical work argues that any watermark can be erased via random walk attacks that perturb text while preserving quality. However, such attacks rely on two key assumptions: (1) rapid mixing (watermarks dissolve quickly under perturbations) and (2) reliable quality preservation (automated quality oracles perfectly guide…
▽ More
Watermarking AI-generated text is critical for combating misuse. Yet recent theoretical work argues that any watermark can be erased via random walk attacks that perturb text while preserving quality. However, such attacks rely on two key assumptions: (1) rapid mixing (watermarks dissolve quickly under perturbations) and (2) reliable quality preservation (automated quality oracles perfectly guide edits). Through large-scale experiments and human-validated assessments, we find mixing is slow: 100% of perturbed texts retain traces of their origin after hundreds of edits, defying rapid mixing. Oracles falter, as state-of-the-art quality detectors misjudge edits (77% accuracy), compounding errors during attacks. Ultimately, attacks underperform: automated walks remove watermarks just 26% of the time -- dropping to 10% under human quality review. These findings challenge the inevitability of watermark removal. Instead, practical barriers -- slow mixing and imperfect quality control -- reveal watermarking to be far more robust than theoretical models suggest. The gap between idealized attacks and real-world feasibility underscores the need for stronger watermarking methods and more realistic attack models.
△ Less
Submitted 10 May, 2025;
originally announced May 2025.
-
JaxRobotarium: Training and Deploying Multi-Robot Policies in 10 Minutes
Authors:
Shalin Anand Jain,
Jiazhen Liu,
Siva Kailas,
Harish Ravichandar
Abstract:
Multi-agent reinforcement learning (MARL) has emerged as a promising solution for learning complex and scalable coordination behaviors in multi-robot systems. However, established MARL platforms (e.g., SMAC and MPE) lack robotics relevance and hardware deployment, leaving multi-robot learning researchers to develop bespoke environments and hardware testbeds dedicated to the development and evaluat…
▽ More
Multi-agent reinforcement learning (MARL) has emerged as a promising solution for learning complex and scalable coordination behaviors in multi-robot systems. However, established MARL platforms (e.g., SMAC and MPE) lack robotics relevance and hardware deployment, leaving multi-robot learning researchers to develop bespoke environments and hardware testbeds dedicated to the development and evaluation of their individual contributions. The Multi-Agent RL Benchmark and Learning Environment for the Robotarium (MARBLER) is an exciting recent step in providing a standardized robotics-relevant platform for MARL, by bridging the Robotarium testbed with existing MARL software infrastructure. However, MARBLER lacks support for parallelization and GPU/TPU execution, making the platform prohibitively slow compared to modern MARL environments and hindering adoption. We contribute JaxRobotarium, a Jax-powered end-to-end simulation, learning, deployment, and benchmarking platform for the Robotarium. JaxRobotarium enables rapid training and deployment of multi-robot reinforcement learning (MRRL) policies with realistic robot dynamics and safety constraints, supporting both parallelization and hardware acceleration. Our generalizable learning interface provides an easy-to-use integration with SOTA MARL libraries (e.g., JaxMARL). In addition, JaxRobotarium includes eight standardized coordination scenarios, including four novel scenarios that bring established MARL benchmark tasks (e.g., RWARE and Level-Based Foraging) to a realistic robotics setting. We demonstrate that JaxRobotarium retains high simulation fidelity while achieving dramatic speedups over baseline (20x in training and 150x in simulation), and provides an open-access sim-to-real evaluation pipeline through the Robotarium testbed, accelerating and democratizing access to multi-robot learning research and evaluation.
△ Less
Submitted 10 May, 2025;
originally announced May 2025.
-
Jailbreaking the Text-to-Video Generative Models
Authors:
Jiayang Liu,
Siyuan Liang,
Shiqian Zhao,
Rongcheng Tu,
Wenbo Zhou,
Xiaochun Cao,
Dacheng Tao,
Siew Kei Lam
Abstract:
Text-to-video generative models have achieved significant progress, driven by the rapid advancements in diffusion models, with notable examples including Pika, Luma, Kling, and Sora. Despite their remarkable generation ability, their vulnerability to jailbreak attack, i.e. to generate unsafe content, including pornography, violence, and discrimination, raises serious safety concerns. Existing effo…
▽ More
Text-to-video generative models have achieved significant progress, driven by the rapid advancements in diffusion models, with notable examples including Pika, Luma, Kling, and Sora. Despite their remarkable generation ability, their vulnerability to jailbreak attack, i.e. to generate unsafe content, including pornography, violence, and discrimination, raises serious safety concerns. Existing efforts, such as T2VSafetyBench, have provided valuable benchmarks for evaluating the safety of text-to-video models against unsafe prompts but lack systematic studies for exploiting their vulnerabilities effectively. In this paper, we propose the \textit{first} optimization-based jailbreak attack against text-to-video models, which is specifically designed. Our approach formulates the prompt generation task as an optimization problem with three key objectives: (1) maximizing the semantic similarity between the input and generated prompts, (2) ensuring that the generated prompts can evade the safety filter of the text-to-video model, and (3) maximizing the semantic similarity between the generated videos and the original input prompts. To further enhance the robustness of the generated prompts, we introduce a prompt mutation strategy that creates multiple prompt variants in each iteration, selecting the most effective one based on the averaged score. This strategy not only improves the attack success rate but also boosts the semantic relevance of the generated video. We conduct extensive experiments across multiple text-to-video models, including Open-Sora, Pika, Luma, and Kling. The results demonstrate that our method not only achieves a higher attack success rate compared to baseline methods but also generates videos with greater semantic similarity to the original input prompts.
△ Less
Submitted 10 May, 2025;
originally announced May 2025.
-
Exploring Multimodal Foundation AI and Expert-in-the-Loop for Sustainable Management of Wild Salmon Fisheries in Indigenous Rivers
Authors:
Chi Xu,
Yili Jin,
Sami Ma,
Rongsheng Qian,
Hao Fang,
Jiangchuan Liu,
Xue Liu,
Edith C. H. Ngai,
William I. Atlas,
Katrina M. Connors,
Mark A. Spoljaric
Abstract:
Wild salmon are essential to the ecological, economic, and cultural sustainability of the North Pacific Rim. Yet climate variability, habitat loss, and data limitations in remote ecosystems that lack basic infrastructure support pose significant challenges to effective fisheries management. This project explores the integration of multimodal foundation AI and expert-in-the-loop frameworks to enhan…
▽ More
Wild salmon are essential to the ecological, economic, and cultural sustainability of the North Pacific Rim. Yet climate variability, habitat loss, and data limitations in remote ecosystems that lack basic infrastructure support pose significant challenges to effective fisheries management. This project explores the integration of multimodal foundation AI and expert-in-the-loop frameworks to enhance wild salmon monitoring and sustainable fisheries management in Indigenous rivers across Pacific Northwest. By leveraging video and sonar-based monitoring, we develop AI-powered tools for automated species identification, counting, and length measurement, reducing manual effort, expediting delivery of results, and improving decision-making accuracy. Expert validation and active learning frameworks ensure ecological relevance while reducing annotation burdens. To address unique technical and societal challenges, we bring together a cross-domain, interdisciplinary team of university researchers, fisheries biologists, Indigenous stewardship practitioners, government agencies, and conservation organizations. Through these collaborations, our research fosters ethical AI co-development, open data sharing, and culturally informed fisheries management.
△ Less
Submitted 10 May, 2025;
originally announced May 2025.
-
The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization
Authors:
Jae-Won Chung,
Jiachen Liu,
Jeff J. Ma,
Ruofan Wu,
Oh Jun Kweon,
Yuxuan Xia,
Zhiyu Wu,
Mosharaf Chowdhury
Abstract:
As the adoption of Generative AI in real-world services grow explosively, energy has emerged as a critical bottleneck resource. However, energy remains a metric that is often overlooked, under-explored, or poorly understood in the context of building ML systems. We present the ML.ENERGY Benchmark, a benchmark suite and tool for measuring inference energy consumption under realistic service environ…
▽ More
As the adoption of Generative AI in real-world services grow explosively, energy has emerged as a critical bottleneck resource. However, energy remains a metric that is often overlooked, under-explored, or poorly understood in the context of building ML systems. We present the ML.ENERGY Benchmark, a benchmark suite and tool for measuring inference energy consumption under realistic service environments, and the corresponding ML.ENERGY Leaderboard, which have served as a valuable resource for those hoping to understand and optimize the energy consumption of their generative AI services. In this paper, we explain four key design principles for benchmarking ML energy we have acquired over time, and then describe how they are implemented in the ML.ENERGY Benchmark. We then highlight results from the latest iteration of the benchmark, including energy measurements of 40 widely used model architectures across 6 different tasks, case studies of how ML design choices impact energy consumption, and how automated optimization recommendations can lead to significant (sometimes more than 40%) energy savings without changing what is being computed by the model. The ML.ENERGY Benchmark is open-source and can be easily extended to various customized models and application scenarios.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
A novel Neural-ODE model for the state of health estimation of lithium-ion battery using charging curve
Authors:
Yiming Li,
Man He,
Jiapeng Liu
Abstract:
The state of health (SOH) of lithium-ion batteries (LIBs) is crucial for ensuring the safe and reliable operation of electric vehicles. Nevertheless, the prevailing SOH estimation methods often have limited generalizability. This paper introduces a data-driven approach for estimating the SOH of LIBs, which is designed to improve generalization. We construct a hybrid model named ACLA, which integra…
▽ More
The state of health (SOH) of lithium-ion batteries (LIBs) is crucial for ensuring the safe and reliable operation of electric vehicles. Nevertheless, the prevailing SOH estimation methods often have limited generalizability. This paper introduces a data-driven approach for estimating the SOH of LIBs, which is designed to improve generalization. We construct a hybrid model named ACLA, which integrates the attention mechanism, convolutional neural network (CNN), and long short-term memory network (LSTM) into the augmented neural ordinary differential equation (ANODE) framework. This model employs normalized charging time corresponding to specific voltages in the constant current charging phase as input and outputs the SOH as well as remaining useful of life. The model is trained on NASA and Oxford datasets and validated on the TJU and HUST datasets. Compared to the benchmark models NODE and ANODE, ACLA exhibits higher accuracy with root mean square errors (RMSE) for SOH estimation as low as 1.01% and 2.24% on the TJU and HUST datasets, respectively.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
Flow-GRPO: Training Flow Matching Models via Online RL
Authors:
Jie Liu,
Gongye Liu,
Jiajun Liang,
Yangguang Li,
Jiaheng Liu,
Xintao Wang,
Pengfei Wan,
Di Zhang,
Wanli Ouyang
Abstract:
We propose Flow-GRPO, the first method integrating online reinforcement learning (RL) into flow matching models. Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Differential Equation (ODE) into an equivalent Stochastic Differential Equation (SDE) that matches the original model's marginal distribution at all timesteps, enabling statistica…
▽ More
We propose Flow-GRPO, the first method integrating online reinforcement learning (RL) into flow matching models. Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Differential Equation (ODE) into an equivalent Stochastic Differential Equation (SDE) that matches the original model's marginal distribution at all timesteps, enabling statistical sampling for RL exploration; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original inference timestep number, significantly improving sampling efficiency without performance degradation. Empirically, Flow-GRPO is effective across multiple text-to-image tasks. For complex compositions, RL-tuned SD3.5 generates nearly perfect object counts, spatial relations, and fine-grained attributes, boosting GenEval accuracy from 63% to 95%. In visual text rendering, its accuracy improves from 59% to 92%, significantly enhancing text generation. Flow-GRPO also achieves substantial gains in human preference alignment. Notably, very little reward hacking occurred, meaning rewards did not increase at the cost of appreciable image quality or diversity degradation.
△ Less
Submitted 11 May, 2025; v1 submitted 8 May, 2025;
originally announced May 2025.
-
GroverGPT-2: Simulating Grover's Algorithm via Chain-of-Thought Reasoning and Quantum-Native Tokenization
Authors:
Min Chen,
Jinglei Cheng,
Pingzhi Li,
Haoran Wang,
Tianlong Chen,
Junyu Liu
Abstract:
Quantum computing offers theoretical advantages over classical computing for specific tasks, yet the boundary of practical quantum advantage remains an open question. To investigate this boundary, it is crucial to understand whether, and how, classical machines can learn and simulate quantum algorithms. Recent progress in large language models (LLMs) has demonstrated strong reasoning abilities, pr…
▽ More
Quantum computing offers theoretical advantages over classical computing for specific tasks, yet the boundary of practical quantum advantage remains an open question. To investigate this boundary, it is crucial to understand whether, and how, classical machines can learn and simulate quantum algorithms. Recent progress in large language models (LLMs) has demonstrated strong reasoning abilities, prompting exploration into their potential for this challenge. In this work, we introduce GroverGPT-2, an LLM-based method for simulating Grover's algorithm using Chain-of-Thought (CoT) reasoning and quantum-native tokenization. Building on its predecessor, GroverGPT-2 performs simulation directly from quantum circuit representations while producing logically structured and interpretable outputs. Our results show that GroverGPT-2 can learn and internalize quantum circuit logic through efficient processing of quantum-native tokens, providing direct evidence that classical models like LLMs can capture the structure of quantum algorithms. Furthermore, GroverGPT-2 outputs interleave circuit data with natural language, embedding explicit reasoning into the simulation. This dual capability positions GroverGPT-2 as a prototype for advancing machine understanding of quantum algorithms and modeling quantum circuit logic. We also identify an empirical scaling law for GroverGPT-2 with increasing qubit numbers, suggesting a path toward scalable classical simulation. These findings open new directions for exploring the limits of classical simulatability, enhancing quantum education and research, and laying groundwork for future foundation models in quantum computing.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
Active Sampling for MRI-based Sequential Decision Making
Authors:
Yuning Du,
Jingshuai Liu,
Rohan Dharmakumar,
Sotirios A. Tsaftaris
Abstract:
Despite the superior diagnostic capability of Magnetic Resonance Imaging (MRI), its use as a Point-of-Care (PoC) device remains limited by high cost and complexity. To enable such a future by reducing the magnetic field strength, one key approach will be to improve sampling strategies. Previous work has shown that it is possible to make diagnostic decisions directly from k-space with fewer samples…
▽ More
Despite the superior diagnostic capability of Magnetic Resonance Imaging (MRI), its use as a Point-of-Care (PoC) device remains limited by high cost and complexity. To enable such a future by reducing the magnetic field strength, one key approach will be to improve sampling strategies. Previous work has shown that it is possible to make diagnostic decisions directly from k-space with fewer samples. Such work shows that single diagnostic decisions can be made, but if we aspire to see MRI as a true PoC, multiple and sequential decisions are necessary while minimizing the number of samples acquired. We present a novel multi-objective reinforcement learning framework enabling comprehensive, sequential, diagnostic evaluation from undersampled k-space data. Our approach during inference actively adapts to sequential decisions to optimally sample. To achieve this, we introduce a training methodology that identifies the samples that contribute the best to each diagnostic objective using a step-wise weighting reward function. We evaluate our approach in two sequential knee pathology assessment tasks: ACL sprain detection and cartilage thickness loss assessment. Our framework achieves diagnostic performance competitive with various policy-based benchmarks on disease detection, severity quantification, and overall sequential diagnosis, while substantially saving k-space samples. Our approach paves the way for the future of MRI as a comprehensive and affordable PoC device. Our code is publicly available at https://github.com/vios-s/MRI_Sequential_Active_Sampling
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
Divide, Optimize, Merge: Fine-Grained LLM Agent Optimization at Scale
Authors:
Jiale Liu,
Yifan Zeng,
Shaokun Zhang,
Chi Zhang,
Malte Højmark-Bertelsen,
Marie Normann Gadeberg,
Huazheng Wang,
Qingyun Wu
Abstract:
LLM-based optimization has shown remarkable potential in enhancing agentic systems. However, the conventional approach of prompting LLM optimizer with the whole training trajectories on training dataset in a single pass becomes untenable as datasets grow, leading to context window overflow and degraded pattern recognition. To address these challenges, we propose Fine-Grained Optimization (FGO), a…
▽ More
LLM-based optimization has shown remarkable potential in enhancing agentic systems. However, the conventional approach of prompting LLM optimizer with the whole training trajectories on training dataset in a single pass becomes untenable as datasets grow, leading to context window overflow and degraded pattern recognition. To address these challenges, we propose Fine-Grained Optimization (FGO), a scalable framework that divides large optimization tasks into manageable subsets, performs targeted optimizations, and systematically combines optimized components through progressive merging. Evaluation across ALFWorld, LogisticsQA, and GAIA benchmarks demonstrate that FGO outperforms existing approaches by 1.6-8.6% while reducing average prompt token consumption by 56.3%. Our framework provides a practical solution for scaling up LLM-based optimization of increasingly sophisticated agent systems. Further analysis demonstrates that FGO achieves the most consistent performance gain in all training dataset sizes, showcasing its scalability and efficiency.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
RAIL: Region-Aware Instructive Learning for Semi-Supervised Tooth Segmentation in CBCT
Authors:
Chuyu Zhao,
Hao Huang,
Jiashuo Guo,
Ziyu Shen,
Zhongwei Zhou,
Jie Liu,
Zekuan Yu
Abstract:
Semi-supervised learning has become a compelling approach for 3D tooth segmentation from CBCT scans, where labeled data is minimal. However, existing methods still face two persistent challenges: limited corrective supervision in structurally ambiguous or mislabeled regions during supervised training and performance degradation caused by unreliable pseudo-labels on unlabeled data. To address these…
▽ More
Semi-supervised learning has become a compelling approach for 3D tooth segmentation from CBCT scans, where labeled data is minimal. However, existing methods still face two persistent challenges: limited corrective supervision in structurally ambiguous or mislabeled regions during supervised training and performance degradation caused by unreliable pseudo-labels on unlabeled data. To address these problems, we propose Region-Aware Instructive Learning (RAIL), a dual-group dual-student, semi-supervised framework. Each group contains two student models guided by a shared teacher network. By alternating training between the two groups, RAIL promotes intergroup knowledge transfer and collaborative region-aware instruction while reducing overfitting to the characteristics of any single model. Specifically, RAIL introduces two instructive mechanisms. Disagreement-Focused Supervision (DFS) Controller improves supervised learning by instructing predictions only within areas where student outputs diverge from both ground truth and the best student, thereby concentrating supervision on structurally ambiguous or mislabeled areas. In the unsupervised phase, Confidence-Aware Learning (CAL) Modulator reinforces agreement in regions with high model certainty while reducing the effect of low-confidence predictions during training. This helps prevent our model from learning unstable patterns and improves the overall reliability of pseudo-labels. Extensive experiments on four CBCT tooth segmentation datasets show that RAIL surpasses state-of-the-art methods under limited annotation. Our code will be available at https://github.com/Tournesol-Saturday/RAIL.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
CombiBench: Benchmarking LLM Capability for Combinatorial Mathematics
Authors:
Junqi Liu,
Xiaohan Lin,
Jonas Bayer,
Yael Dillies,
Weijie Jiang,
Xiaodan Liang,
Roman Soletskyi,
Haiming Wang,
Yunzhou Xie,
Beibei Xiong,
Zhengfeng Yang,
Jujian Zhang,
Lihong Zhi,
Jia Li,
Zhengying Liu
Abstract:
Neurosymbolic approaches integrating large language models with formal reasoning have recently achieved human-level performance on mathematics competition problems in algebra, geometry and number theory. In comparison, combinatorics remains a challenging domain, characterized by a lack of appropriate benchmarks and theorem libraries. To address this gap, we introduce CombiBench, a comprehensive be…
▽ More
Neurosymbolic approaches integrating large language models with formal reasoning have recently achieved human-level performance on mathematics competition problems in algebra, geometry and number theory. In comparison, combinatorics remains a challenging domain, characterized by a lack of appropriate benchmarks and theorem libraries. To address this gap, we introduce CombiBench, a comprehensive benchmark comprising 100 combinatorial problems, each formalized in Lean~4 and paired with its corresponding informal statement. The problem set covers a wide spectrum of difficulty levels, ranging from middle school to IMO and university level, and span over ten combinatorial topics. CombiBench is suitable for testing IMO solving capabilities since it includes all IMO combinatorial problems since 2000 (except IMO 2004 P3 as its statement contain an images). Furthermore, we provide a comprehensive and standardized evaluation framework, dubbed Fine-Eval (for $\textbf{F}$ill-in-the-blank $\textbf{in}$ L$\textbf{e}$an Evaluation), for formal mathematics. It accommodates not only proof-based problems but also, for the first time, the evaluation of fill-in-the-blank questions. Using Fine-Eval as the evaluation method and Kimina Lean Server as the backend, we benchmark several LLMs on CombiBench and observe that their capabilities for formally solving combinatorial problems remain limited. Among all models tested (none of which has been trained for this particular task), Kimina-Prover attains the best results, solving 7 problems (out of 100) under both ``with solution'' and ``without solution'' scenarios. We open source the benchmark dataset alongside with the code of the proposed evaluation method at https://github.com/MoonshotAI/CombiBench/.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
NTIRE 2025 Challenge on UGC Video Enhancement: Methods and Results
Authors:
Nikolay Safonov,
Alexey Bryncev,
Andrey Moskalenko,
Dmitry Kulikov,
Dmitry Vatolin,
Radu Timofte,
Haibo Lei,
Qifan Gao,
Qing Luo,
Yaqing Li,
Jie Song,
Shaozhe Hao,
Meisong Zheng,
Jingyi Xu,
Chengbin Wu,
Jiahui Liu,
Ying Chen,
Xin Deng,
Mai Xu,
Peipei Liang,
Jie Ma,
Junjie Jin,
Yingxue Pang,
Fangzhou Luo,
Kai Chen
, et al. (6 additional authors not shown)
Abstract:
This paper presents an overview of the NTIRE 2025 Challenge on UGC Video Enhancement. The challenge constructed a set of 150 user-generated content videos without reference ground truth, which suffer from real-world degradations such as noise, blur, faded colors, compression artifacts, etc. The goal of the participants was to develop an algorithm capable of improving the visual quality of such vid…
▽ More
This paper presents an overview of the NTIRE 2025 Challenge on UGC Video Enhancement. The challenge constructed a set of 150 user-generated content videos without reference ground truth, which suffer from real-world degradations such as noise, blur, faded colors, compression artifacts, etc. The goal of the participants was to develop an algorithm capable of improving the visual quality of such videos. Given the widespread use of UGC on short-form video platforms, this task holds substantial practical importance. The evaluation was based on subjective quality assessment in crowdsourcing, obtaining votes from over 8000 assessors. The challenge attracted more than 25 teams submitting solutions, 7 of which passed the final phase with source code verification. The outcomes may provide insights into the state-of-the-art in UGC video enhancement and highlight emerging trends and effective strategies in this evolving research area. All data, including the processed videos and subjective comparison votes and scores, is made publicly available at https://github.com/msu-video-group/NTIRE25_UGC_Video_Enhancement.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference
Authors:
Yaoqi Chen,
Jinkai Zhang,
Baotong Lu,
Qianxi Zhang,
Chengruidong Zhang,
Jingjia Luo,
Di Liu,
Huiqiang Jiang,
Qi Chen,
Jing Liu,
Bailu Ding,
Xiao Yan,
Jiawei Jiang,
Chen Chen,
Mingxing Zhang,
Yuqing Yang,
Fan Yang,
Mao Yang
Abstract:
The growing context lengths of large language models (LLMs) pose significant challenges for efficient inference, primarily due to GPU memory and bandwidth constraints. We present RetroInfer, a novel system that reconceptualizes the key-value (KV) cache as a vector storage system which exploits the inherent attention sparsity to accelerate long-context LLM inference. At its core is the wave index,…
▽ More
The growing context lengths of large language models (LLMs) pose significant challenges for efficient inference, primarily due to GPU memory and bandwidth constraints. We present RetroInfer, a novel system that reconceptualizes the key-value (KV) cache as a vector storage system which exploits the inherent attention sparsity to accelerate long-context LLM inference. At its core is the wave index, an Attention-aWare VEctor index that enables efficient and accurate retrieval of critical tokens through techniques such as tripartite attention approximation, accuracy-bounded attention estimation, and segmented clustering. Complementing this is the wave buffer, which coordinates KV cache placement and overlaps computation and data transfer across GPU and CPU to sustain high throughput. Unlike prior sparsity-based methods that struggle with token selection and hardware coordination, RetroInfer delivers robust performance without compromising model accuracy. Experiments on long-context benchmarks show up to 4.5X speedup over full attention within GPU memory limits and up to 10.5X over sparse attention baselines when KV cache is extended to CPU memory, all while preserving full-attention-level accuracy.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
Understanding University Students' Use of Generative AI: The Roles of Demographics and Personality Traits
Authors:
Newnew Deng,
Edward Jiusi Liu,
Xiaoming Zhai
Abstract:
The use of generative AI (GAI) among university students is rapidly increasing, yet empirical research on students' GAI use and the factors influencing it remains limited. To address this gap, we surveyed 363 undergraduate and graduate students in the United States, examining their GAI usage and how it relates to demographic variables and personality traits based on the Big Five model (i.e., extra…
▽ More
The use of generative AI (GAI) among university students is rapidly increasing, yet empirical research on students' GAI use and the factors influencing it remains limited. To address this gap, we surveyed 363 undergraduate and graduate students in the United States, examining their GAI usage and how it relates to demographic variables and personality traits based on the Big Five model (i.e., extraversion, agreeableness, conscientiousness, and emotional stability, and intellect/imagination). Our findings reveal: (a) Students in higher academic years are more inclined to use GAI and prefer it over traditional resources. (b) Non-native English speakers use and adopt GAI more readily than native speakers. (c) Compared to White, Asian students report higher GAI usage, perceive greater academic benefits, and express a stronger preference for it. Similarly, Black students report a more positive impact of GAI on their academic performance. Personality traits also play a significant role in shaping perceptions and usage of GAI. After controlling demographic factors, we found that personality still significantly predicts GAI use and attitudes: (a) Students with higher conscientiousness use GAI less. (b) Students who are higher in agreeableness perceive a less positive impact of GAI on academic performance and express more ethical concerns about using it for academic work. (c) Students with higher emotional stability report a more positive impact of GAI on learning and fewer concerns about its academic use. (d) Students with higher extraversion show a stronger preference for GAI over traditional resources. (e) Students with higher intellect/imagination tend to prefer traditional resources. These insights highlight the need for universities to provide personalized guidance to ensure students use GAI effectively, ethically, and equitably in their academic pursuits.
△ Less
Submitted 3 May, 2025;
originally announced May 2025.
-
Machine-Learning-Powered Neural Interfaces for Smart Prosthetics and Diagnostics
Authors:
MohammadAli Shaeri,
Jinhan Liu,
Mahsa Shoaran
Abstract:
Advanced neural interfaces are transforming applications ranging from neuroscience research to diagnostic tools (for mental state recognition, tremor and seizure detection) as well as prosthetic devices (for motor and communication recovery). By integrating complex functions into miniaturized neural devices, these systems unlock significant opportunities for personalized assistive technologies and…
▽ More
Advanced neural interfaces are transforming applications ranging from neuroscience research to diagnostic tools (for mental state recognition, tremor and seizure detection) as well as prosthetic devices (for motor and communication recovery). By integrating complex functions into miniaturized neural devices, these systems unlock significant opportunities for personalized assistive technologies and adaptive therapeutic interventions. Leveraging high-density neural recordings, on-site signal processing, and machine learning (ML), these interfaces extract critical features, identify disease neuro-markers, and enable accurate, low-latency neural decoding. This integration facilitates real-time interpretation of neural signals, adaptive modulation of brain activity, and efficient control of assistive devices. Moreover, the synergy between neural interfaces and ML has paved the way for self-sufficient, ubiquitous platforms capable of operating in diverse environments with minimal hardware costs and external dependencies. In this work, we review recent advancements in AI-driven decoding algorithms and energy-efficient System-on-Chip (SoC) platforms for next-generation miniaturized neural devices. These innovations highlight the potential for developing intelligent neural interfaces, addressing critical challenges in scalability, reliability, interpretability, and user adaptability.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
CrayonRobo: Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation
Authors:
Xiaoqi Li,
Lingyun Xu,
Mingxu Zhang,
Jiaming Liu,
Yan Shen,
Iaroslav Ponomarenko,
Jiahui Xu,
Liang Heng,
Siyuan Huang,
Shanghang Zhang,
Hao Dong
Abstract:
In robotic, task goals can be conveyed through various modalities, such as language, goal images, and goal videos. However, natural language can be ambiguous, while images or videos may offer overly detailed specifications. To tackle these challenges, we introduce CrayonRobo that leverages comprehensive multi-modal prompts that explicitly convey both low-level actions and high-level planning in a…
▽ More
In robotic, task goals can be conveyed through various modalities, such as language, goal images, and goal videos. However, natural language can be ambiguous, while images or videos may offer overly detailed specifications. To tackle these challenges, we introduce CrayonRobo that leverages comprehensive multi-modal prompts that explicitly convey both low-level actions and high-level planning in a simple manner. Specifically, for each key-frame in the task sequence, our method allows for manual or automatic generation of simple and expressive 2D visual prompts overlaid on RGB images. These prompts represent the required task goals, such as the end-effector pose and the desired movement direction after contact. We develop a training strategy that enables the model to interpret these visual-language prompts and predict the corresponding contact poses and movement directions in SE(3) space. Furthermore, by sequentially executing all key-frame steps, the model can complete long-horizon tasks. This approach not only helps the model explicitly understand the task objectives but also enhances its robustness on unseen tasks by providing easily interpretable prompts. We evaluate our method in both simulated and real-world environments, demonstrating its robust manipulation capabilities.
△ Less
Submitted 4 May, 2025;
originally announced May 2025.
-
GraphPrompter: Multi-stage Adaptive Prompt Optimization for Graph In-Context Learning
Authors:
Rui Lv,
Zaixi Zhang,
Kai Zhang,
Qi Liu,
Weibo Gao,
Jiawei Liu,
Jiaxia Yan,
Linan Yue,
Fangzhou Yao
Abstract:
Graph In-Context Learning, with the ability to adapt pre-trained graph models to novel and diverse downstream graphs without updating any parameters, has gained much attention in the community. The key to graph in-context learning is to perform downstream graphs conditioned on chosen prompt examples. Existing methods randomly select subgraphs or edges as prompts, leading to noisy graph prompts and…
▽ More
Graph In-Context Learning, with the ability to adapt pre-trained graph models to novel and diverse downstream graphs without updating any parameters, has gained much attention in the community. The key to graph in-context learning is to perform downstream graphs conditioned on chosen prompt examples. Existing methods randomly select subgraphs or edges as prompts, leading to noisy graph prompts and inferior model performance. Additionally, due to the gap between pre-training and testing graphs, when the number of classes in the testing graphs is much greater than that in the training, the in-context learning ability will also significantly deteriorate. To tackle the aforementioned challenges, we develop a multi-stage adaptive prompt optimization method GraphPrompter, which optimizes the entire process of generating, selecting, and using graph prompts for better in-context learning capabilities. Firstly, Prompt Generator introduces a reconstruction layer to highlight the most informative edges and reduce irrelevant noise for graph prompt construction. Furthermore, in the selection stage, Prompt Selector employs the $k$-nearest neighbors algorithm and pre-trained selection layers to dynamically choose appropriate samples and minimize the influence of irrelevant prompts. Finally, we leverage a Prompt Augmenter with a cache replacement strategy to enhance the generalization capability of the pre-trained model on new datasets. Extensive experiments show that GraphPrompter effectively enhances the in-context learning ability of graph models. On average across all the settings, our approach surpasses the state-of-the-art baselines by over 8%. Our code is released at https://github.com/karin0018/GraphPrompter.
△ Less
Submitted 4 May, 2025;
originally announced May 2025.
-
ForgeEDA: A Comprehensive Multimodal Dataset for Advancing EDA
Authors:
Zhengyuan Shi,
Zeju Li,
Chengyu Ma,
Yunhao Zhou,
Ziyang Zheng,
Jiawei Liu,
Hongyang Pan,
Lingfeng Zhou,
Kezhi Li,
Jiaying Zhu,
Lingwei Yan,
Zhiqiang He,
Chenhao Xue,
Wentao Jiang,
Fan Yang,
Guangyu Sun,
Xiaoyan Yang,
Gang Chen,
Chuan Shi,
Zhufei Chu,
Jun Yang,
Qiang Xu
Abstract:
We introduce ForgeEDA, an open-source comprehensive circuit dataset across various categories. ForgeEDA includes diverse circuit representations such as Register Transfer Level (RTL) code, Post-mapping (PM) netlists, And-Inverter Graphs (AIGs), and placed netlists, enabling comprehensive analysis and development. We demonstrate ForgeEDA's utility by benchmarking state-of-the-art EDA algorithms on…
▽ More
We introduce ForgeEDA, an open-source comprehensive circuit dataset across various categories. ForgeEDA includes diverse circuit representations such as Register Transfer Level (RTL) code, Post-mapping (PM) netlists, And-Inverter Graphs (AIGs), and placed netlists, enabling comprehensive analysis and development. We demonstrate ForgeEDA's utility by benchmarking state-of-the-art EDA algorithms on critical tasks such as Power, Performance, and Area (PPA) optimization, highlighting its ability to expose performance gaps and drive advancements. Additionally, ForgeEDA's scale and diversity facilitate the training of AI models for EDA tasks, demonstrating its potential to improve model performance and generalization. By addressing limitations in existing datasets, ForgeEDA aims to catalyze breakthroughs in modern IC design and support the next generation of innovations in EDA.
△ Less
Submitted 4 May, 2025;
originally announced May 2025.
-
Edge-Cloud Collaborative Computing on Distributed Intelligence and Model Optimization: A Survey
Authors:
Jing Liu,
Yao Du,
Kun Yang,
Yan Wang,
Xiping Hu,
Zehua Wang,
Yang Liu,
Peng Sun,
Azzedine Boukerche,
Victor C. M. Leung
Abstract:
Edge-cloud collaborative computing (ECCC) has emerged as a pivotal paradigm for addressing the computational demands of modern intelligent applications, integrating cloud resources with edge devices to enable efficient, low-latency processing. Recent advancements in AI, particularly deep learning and large language models (LLMs), have dramatically enhanced the capabilities of these distributed sys…
▽ More
Edge-cloud collaborative computing (ECCC) has emerged as a pivotal paradigm for addressing the computational demands of modern intelligent applications, integrating cloud resources with edge devices to enable efficient, low-latency processing. Recent advancements in AI, particularly deep learning and large language models (LLMs), have dramatically enhanced the capabilities of these distributed systems, yet introduce significant challenges in model deployment and resource management. In this survey, we comprehensive examine the intersection of distributed intelligence and model optimization within edge-cloud environments, providing a structured tutorial on fundamental architectures, enabling technologies, and emerging applications. Additionally, we systematically analyze model optimization approaches, including compression, adaptation, and neural architecture search, alongside AI-driven resource management strategies that balance performance, energy efficiency, and latency requirements. We further explore critical aspects of privacy protection and security enhancement within ECCC systems and examines practical deployments through diverse applications, spanning autonomous driving, healthcare, and industrial automation. Performance analysis and benchmarking techniques are also thoroughly explored to establish evaluation standards for these complex systems. Furthermore, the review identifies critical research directions including LLMs deployment, 6G integration, neuromorphic computing, and quantum computing, offering a roadmap for addressing persistent challenges in heterogeneity management, real-time processing, and scalability. By bridging theoretical advancements and practical deployments, this survey offers researchers and practitioners a holistic perspective on leveraging AI to optimize distributed computing environments, fostering innovation in next-generation intelligent systems.
△ Less
Submitted 3 May, 2025;
originally announced May 2025.
-
3DWG: 3D Weakly Supervised Visual Grounding via Category and Instance-Level Alignment
Authors:
Xiaoqi Li,
Jiaming Liu,
Nuowei Han,
Liang Heng,
Yandong Guo,
Hao Dong,
Yang Liu
Abstract:
The 3D weakly-supervised visual grounding task aims to localize oriented 3D boxes in point clouds based on natural language descriptions without requiring annotations to guide model learning. This setting presents two primary challenges: category-level ambiguity and instance-level complexity. Category-level ambiguity arises from representing objects of fine-grained categories in a highly sparse po…
▽ More
The 3D weakly-supervised visual grounding task aims to localize oriented 3D boxes in point clouds based on natural language descriptions without requiring annotations to guide model learning. This setting presents two primary challenges: category-level ambiguity and instance-level complexity. Category-level ambiguity arises from representing objects of fine-grained categories in a highly sparse point cloud format, making category distinction challenging. Instance-level complexity stems from multiple instances of the same category coexisting in a scene, leading to distractions during grounding. To address these challenges, we propose a novel weakly-supervised grounding approach that explicitly differentiates between categories and instances. In the category-level branch, we utilize extensive category knowledge from a pre-trained external detector to align object proposal features with sentence-level category features, thereby enhancing category awareness. In the instance-level branch, we utilize spatial relationship descriptions from language queries to refine object proposal features, ensuring clear differentiation among objects. These designs enable our model to accurately identify target-category objects while distinguishing instances within the same category. Compared to previous methods, our approach achieves state-of-the-art performance on three widely used benchmarks: Nr3D, Sr3D, and ScanRef.
△ Less
Submitted 3 May, 2025;
originally announced May 2025.
-
Probabilistic Interactive 3D Segmentation with Hierarchical Neural Processes
Authors:
Jie Liu,
Pan Zhou,
Zehao Xiao,
Jiayi Shen,
Wenzhe Yin,
Jan-Jakob Sonke,
Efstratios Gavves
Abstract:
Interactive 3D segmentation has emerged as a promising solution for generating accurate object masks in complex 3D scenes by incorporating user-provided clicks. However, two critical challenges remain underexplored: (1) effectively generalizing from sparse user clicks to produce accurate segmentation, and (2) quantifying predictive uncertainty to help users identify unreliable regions. In this wor…
▽ More
Interactive 3D segmentation has emerged as a promising solution for generating accurate object masks in complex 3D scenes by incorporating user-provided clicks. However, two critical challenges remain underexplored: (1) effectively generalizing from sparse user clicks to produce accurate segmentation, and (2) quantifying predictive uncertainty to help users identify unreliable regions. In this work, we propose NPISeg3D, a novel probabilistic framework that builds upon Neural Processes (NPs) to address these challenges. Specifically, NPISeg3D introduces a hierarchical latent variable structure with scene-specific and object-specific latent variables to enhance few-shot generalization by capturing both global context and object-specific characteristics. Additionally, we design a probabilistic prototype modulator that adaptively modulates click prototypes with object-specific latent variables, improving the model's ability to capture object-aware context and quantify predictive uncertainty. Experiments on four 3D point cloud datasets demonstrate that NPISeg3D achieves superior segmentation performance with fewer clicks while providing reliable uncertainty estimations.
△ Less
Submitted 3 May, 2025;
originally announced May 2025.
-
PoseX: AI Defeats Physics Approaches on Protein-Ligand Cross Docking
Authors:
Yize Jiang,
Xinze Li,
Yuanyuan Zhang,
Jin Han,
Youjun Xu,
Ayush Pandit,
Zaixi Zhang,
Mengdi Wang,
Mengyang Wang,
Chong Liu,
Guang Yang,
Yejin Choi,
Wu-Jun Li,
Tianfan Fu,
Fang Wu,
Junhong Liu
Abstract:
Recently, significant progress has been made in protein-ligand docking, especially in modern deep learning methods, and some benchmarks were proposed, e.g., PoseBench, Plinder. However, these benchmarks suffer from less practical evaluation setups (e.g., blind docking, self docking), or heavy framework that involves training, raising challenges to assess docking methods efficiently. To fill this g…
▽ More
Recently, significant progress has been made in protein-ligand docking, especially in modern deep learning methods, and some benchmarks were proposed, e.g., PoseBench, Plinder. However, these benchmarks suffer from less practical evaluation setups (e.g., blind docking, self docking), or heavy framework that involves training, raising challenges to assess docking methods efficiently. To fill this gap, we proposed PoseX, an open-source benchmark focusing on self-docking and cross-docking, to evaluate the algorithmic advances practically and comprehensively. Specifically, first, we curate a new evaluation dataset with 718 entries for self docking and 1,312 for cross docking; second, we incorporate 22 docking methods across three methodological categories, including (1) traditional physics-based methods (e.g., Schrödinger Glide), (2) AI docking methods (e.g., DiffDock), (3) AI co-folding methods (e.g., AlphaFold3); third, we design a relaxation method as post-processing to minimize conformation energy and refine binding pose; fourth, we released a leaderboard to rank submitted models in real time. We draw some key insights via extensive experiments: (1) AI-based approaches have already surpassed traditional physics-based approaches in overall docking accuracy (RMSD). The longstanding generalization issues that have plagued AI molecular docking have been significantly alleviated in the latest models. (2) The stereochemical deficiencies of AI-based approaches can be greatly alleviated with post-processing relaxation. Combining AI docking methods with the enhanced relaxation method achieves the best performance to date. (3) AI co-folding methods commonly face ligand chirality issues, which cannot be resolved by relaxation. The code, curated dataset and leaderboard are released at https://github.com/CataAI/PoseX.
△ Less
Submitted 3 May, 2025;
originally announced May 2025.
-
TSTMotion: Training-free Scene-aware Text-to-motion Generation
Authors:
Ziyan Guo,
Haoxuan Qu,
Hossein Rahmani,
Dewen Soh,
Ping Hu,
Qiuhong Ke,
Jun Liu
Abstract:
Text-to-motion generation has recently garnered significant research interest, primarily focusing on generating human motion sequences in blank backgrounds. However, human motions commonly occur within diverse 3D scenes, which has prompted exploration into scene-aware text-to-motion generation methods. Yet, existing scene-aware methods often rely on large-scale ground-truth motion sequences in div…
▽ More
Text-to-motion generation has recently garnered significant research interest, primarily focusing on generating human motion sequences in blank backgrounds. However, human motions commonly occur within diverse 3D scenes, which has prompted exploration into scene-aware text-to-motion generation methods. Yet, existing scene-aware methods often rely on large-scale ground-truth motion sequences in diverse 3D scenes, which poses practical challenges due to the expensive cost. To mitigate this challenge, we are the first to propose a \textbf{T}raining-free \textbf{S}cene-aware \textbf{T}ext-to-\textbf{Motion} framework, dubbed as \textbf{TSTMotion}, that efficiently empowers pre-trained blank-background motion generators with the scene-aware capability. Specifically, conditioned on the given 3D scene and text description, we adopt foundation models together to reason, predict and validate a scene-aware motion guidance. Then, the motion guidance is incorporated into the blank-background motion generators with two modifications, resulting in scene-aware text-driven motion sequences. Extensive experiments demonstrate the efficacy and generalizability of our proposed framework. We release our code in \href{https://tstmotion.github.io/}{Project Page}.
△ Less
Submitted 5 May, 2025; v1 submitted 2 May, 2025;
originally announced May 2025.
-
Leveraging Pretrained Diffusion Models for Zero-Shot Part Assembly
Authors:
Ruiyuan Zhang,
Qi Wang,
Jiaxiang Liu,
Yu Zhang,
Yuchi Huo,
Chao Wu
Abstract:
3D part assembly aims to understand part relationships and predict their 6-DoF poses to construct realistic 3D shapes, addressing the growing demand for autonomous assembly, which is crucial for robots. Existing methods mainly estimate the transformation of each part by training neural networks under supervision, which requires a substantial quantity of manually labeled data. However, the high cos…
▽ More
3D part assembly aims to understand part relationships and predict their 6-DoF poses to construct realistic 3D shapes, addressing the growing demand for autonomous assembly, which is crucial for robots. Existing methods mainly estimate the transformation of each part by training neural networks under supervision, which requires a substantial quantity of manually labeled data. However, the high cost of data collection and the immense variability of real-world shapes and parts make traditional methods impractical for large-scale applications. In this paper, we propose first a zero-shot part assembly method that utilizes pre-trained point cloud diffusion models as discriminators in the assembly process, guiding the manipulation of parts to form realistic shapes. Specifically, we theoretically demonstrate that utilizing a diffusion model for zero-shot part assembly can be transformed into an Iterative Closest Point (ICP) process. Then, we propose a novel pushing-away strategy to address the overlap parts, thereby further enhancing the robustness of the method. To verify our work, we conduct extensive experiments and quantitative comparisons to several strong baseline methods, demonstrating the effectiveness of the proposed approach, which even surpasses the supervised learning method. The code has been released on https://github.com/Ruiyuan-Zhang/Zero-Shot-Assembly.
△ Less
Submitted 1 May, 2025;
originally announced May 2025.
-
Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems
Authors:
Shaokun Zhang,
Ming Yin,
Jieyu Zhang,
Jiale Liu,
Zhiguang Han,
Jingyang Zhang,
Beibin Li,
Chi Wang,
Huazheng Wang,
Yiran Chen,
Qingyun Wu
Abstract:
Failure attribution in LLM multi-agent systems-identifying the agent and step responsible for task failures-provides crucial clues for systems debugging but remains underexplored and labor-intensive. In this paper, we propose and formulate a new research area: automated failure attribution for LLM multi-agent systems. To support this initiative, we introduce the Who&When dataset, comprising extens…
▽ More
Failure attribution in LLM multi-agent systems-identifying the agent and step responsible for task failures-provides crucial clues for systems debugging but remains underexplored and labor-intensive. In this paper, we propose and formulate a new research area: automated failure attribution for LLM multi-agent systems. To support this initiative, we introduce the Who&When dataset, comprising extensive failure logs from 127 LLM multi-agent systems with fine-grained annotations linking failures to specific agents and decisive error steps. Using the Who&When, we develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons. The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps, with some methods performing below random. Even SOTA reasoning models, such as OpenAI o1 and DeepSeek R1, fail to achieve practical usability. These results highlight the task's complexity and the need for further research in this area. Code and dataset are available at https://github.com/mingyin1/Agents_Failure_Attribution
△ Less
Submitted 30 April, 2025;
originally announced May 2025.
-
Consistency-aware Fake Videos Detection on Short Video Platforms
Authors:
Junxi Wang,
Jize liu,
Na Zhang,
Yaxiong Wang
Abstract:
This paper focuses to detect the fake news on the short video platforms. While significant research efforts have been devoted to this task with notable progress in recent years, current detection accuracy remains suboptimal due to the rapid evolution of content manipulation and generation technologies. Existing approaches typically employ a cross-modal fusion strategy that directly combines raw vi…
▽ More
This paper focuses to detect the fake news on the short video platforms. While significant research efforts have been devoted to this task with notable progress in recent years, current detection accuracy remains suboptimal due to the rapid evolution of content manipulation and generation technologies. Existing approaches typically employ a cross-modal fusion strategy that directly combines raw video data with metadata inputs before applying a classification layer. However, our empirical observations reveal a critical oversight: manipulated content frequently exhibits inter-modal inconsistencies that could serve as valuable discriminative features, yet remain underutilized in contemporary detection frameworks. Motivated by this insight, we propose a novel detection paradigm that explicitly identifies and leverages cross-modal contradictions as discriminative cues. Our approach consists of two core modules: Cross-modal Consistency Learning (CMCL) and Multi-modal Collaborative Diagnosis (MMCD). CMCL includes Pseudo-label Generation (PLG) and Cross-modal Consistency Diagnosis (CMCD). In PLG, a Multimodal Large Language Model is used to generate pseudo-labels for evaluating cross-modal semantic consistency. Then, CMCD extracts [CLS] tokens and computes cosine loss to quantify cross-modal inconsistencies. MMCD further integrates multimodal features through Multimodal Feature Fusion (MFF) and Probability Scores Fusion (PSF). MFF employs a co-attention mechanism to enhance semantic interactions across different modalities, while a Transformer is utilized for comprehensive feature fusion. Meanwhile, PSF further integrates the fake news probability scores obtained in the previous step. Extensive experiments on established benchmarks (FakeSV and FakeTT) demonstrate our model exhibits outstanding performance in Fake videos detection.
△ Less
Submitted 30 April, 2025;
originally announced April 2025.
-
Robust Orthogonal NMF with Label Propagation for Image Clustering
Authors:
Jingjing Liu,
Nian Wu,
Xianchao Xiu,
Jianhua Zhang
Abstract:
Non-negative matrix factorization (NMF) is a popular unsupervised learning approach widely used in image clustering. However, in real-world clustering scenarios, most existing NMF methods are highly sensitive to noise corruption and are unable to effectively leverage limited supervised information. To overcome these drawbacks, we propose a unified non-convex framework with label propagation called…
▽ More
Non-negative matrix factorization (NMF) is a popular unsupervised learning approach widely used in image clustering. However, in real-world clustering scenarios, most existing NMF methods are highly sensitive to noise corruption and are unable to effectively leverage limited supervised information. To overcome these drawbacks, we propose a unified non-convex framework with label propagation called robust orthogonal nonnegative matrix factorization (RONMF). This method not only considers the graph Laplacian and label propagation as regularization terms but also introduces a more effective non-convex structure to measure the reconstruction error and imposes orthogonal constraints on the basis matrix to reduce the noise corruption, thereby achieving higher robustness. To solve RONMF, we develop an alternating direction method of multipliers (ADMM)-based optimization algorithm. In particular, all subproblems have closed-form solutions, which ensures its efficiency. Experimental evaluations on eight public image datasets demonstrate that the proposed RONMF outperforms state-of-the-art NMF methods across various standard metrics and shows excellent robustness. The code will be available at https://github.com/slinda-liu.
△ Less
Submitted 30 April, 2025;
originally announced April 2025.
-
A Unified QoS-Aware Multiplexing Framework for Next Generation Immersive Communication with Legacy Wireless Applications
Authors:
Jihong Li,
Shunqing Zhang,
Tao Yu,
Guangjin Pan,
Kaixuan Huang,
Xiaojing Chen,
Yanzan Sun,
Junyu Liu,
Jiandong Li,
Derrick Wing Kwan Ng
Abstract:
Immersive communication, including emerging augmented reality, virtual reality, and holographic telepresence, has been identified as a key service for enabling next-generation wireless applications. To align with legacy wireless applications, such as enhanced mobile broadband or ultra-reliable low-latency communication, network slicing has been widely adopted. However, attempting to statistically…
▽ More
Immersive communication, including emerging augmented reality, virtual reality, and holographic telepresence, has been identified as a key service for enabling next-generation wireless applications. To align with legacy wireless applications, such as enhanced mobile broadband or ultra-reliable low-latency communication, network slicing has been widely adopted. However, attempting to statistically isolate the above types of wireless applications through different network slices may lead to throughput degradation and increased queue backlog. To address these challenges, we establish a unified QoS-aware framework that supports immersive communication and legacy wireless applications simultaneously. Based on the Lyapunov drift theorem, we transform the original long-term throughput maximization problem into an equivalent short-term throughput maximization weighted by virtual queue length. Moreover, to cope with the challenges introduced by the interaction between large-timescale network slicing and short-timescale resource allocation, we propose an adaptive adversarial slicing (Ad2S) scheme for networks with invarying channel statistics. To track the network channel variations, we also propose a measurement extrapolation-Kalman filter (ME-KF)-based method and refine our scheme into Ad2S-non-stationary refinement (Ad2S-NR). Through extended numerical examples, we demonstrate that our proposed schemes achieve 3.86 Mbps throughput improvement and 63.96% latency reduction with 24.36% convergence time reduction. Within our framework, the trade-off between total throughput and user service experience can be achieved by tuning systematic parameters.
△ Less
Submitted 2 May, 2025; v1 submitted 30 April, 2025;
originally announced April 2025.
-
Kimina Lean Server: Technical Report
Authors:
Marco Dos Santos,
Haiming Wang,
Hugues de Saxcé,
Ran Wang,
Mantas Baksys,
Mert Unsal,
Junqi Liu,
Zhengying Liu,
Jia Li
Abstract:
We introduce the Kimina Lean Server, an open-source project that enables fast and scalable interaction with Lean 4 via a unified REST API, designed as a simple verifier for reinforcement learning pipelines. Built on top of the Lean FRO's LeanREPL, it combines server-side parallelization by managing multiple Lean REPL processes in parallel, with an LRU caching strategy that reuses Lean imports acro…
▽ More
We introduce the Kimina Lean Server, an open-source project that enables fast and scalable interaction with Lean 4 via a unified REST API, designed as a simple verifier for reinforcement learning pipelines. Built on top of the Lean FRO's LeanREPL, it combines server-side parallelization by managing multiple Lean REPL processes in parallel, with an LRU caching strategy that reuses Lean imports across multiple requests. These features help reduce initialization overhead and allow large-scale batch processing of Lean code. The client-side interface allows users to submit batches of proofs and receive Lean feedback, including extracted tactics and tactic states via infotree processing. These features enable a high-performance, scalable workflow for both interaction and extraction of proofs, tactics, and tactic states. We open source our implementation on GitHub.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
MemeBLIP2: A novel lightweight multimodal system to detect harmful memes
Authors:
Jiaqi Liu,
Ran Tong,
Aowei Shen,
Shuzheng Li,
Changlin Yang,
Lisha Xu
Abstract:
Memes often merge visuals with brief text to share humor or opinions, yet some memes contain harmful messages such as hate speech. In this paper, we introduces MemeBLIP2, a light weight multimodal system that detects harmful memes by combining image and text features effectively. We build on previous studies by adding modules that align image and text representations into a shared space and fuse t…
▽ More
Memes often merge visuals with brief text to share humor or opinions, yet some memes contain harmful messages such as hate speech. In this paper, we introduces MemeBLIP2, a light weight multimodal system that detects harmful memes by combining image and text features effectively. We build on previous studies by adding modules that align image and text representations into a shared space and fuse them for better classification. Using BLIP-2 as the core vision-language model, our system is evaluated on the PrideMM datasets. The results show that MemeBLIP2 can capture subtle cues in both modalities, even in cases with ironic or culturally specific content, thereby improving the detection of harmful material.
△ Less
Submitted 6 May, 2025; v1 submitted 29 April, 2025;
originally announced April 2025.
-
ComplexVCoder: An LLM-Driven Framework for Systematic Generation of Complex Verilog Code
Authors:
Jian Zuo,
Junzhe Liu,
Xianyong Wang,
Yicheng Liu,
Navya Goli,
Tong Xu,
Hao Zhang,
Umamaheswara Rao Tida,
Zhenge Jia,
Mengying Zhao
Abstract:
Recent advances have demonstrated the promising capabilities of large language models (LLMs) in generating register-transfer level (RTL) code, such as Verilog. However, existing LLM-based frameworks still face significant challenges in accurately handling the complexity of real-world RTL designs, particularly those that are large-scale and involve multi-level module instantiations. To address this…
▽ More
Recent advances have demonstrated the promising capabilities of large language models (LLMs) in generating register-transfer level (RTL) code, such as Verilog. However, existing LLM-based frameworks still face significant challenges in accurately handling the complexity of real-world RTL designs, particularly those that are large-scale and involve multi-level module instantiations. To address this issue, we present ComplexVCoder, an open-source LLM-driven framework that enhances both the generation quality and efficiency of complex Verilog code. Specifically, we introduce a two-stage generation mechanism, which leverages an intermediate representation to enable a more accurate and structured transition from natural language descriptions to intricate Verilog designs. In addition, we introduce a rule-based alignment method and a domain-specific retrieval-augmented generation (RAG) to further improve the correctness of the synthesized code by incorporating relevant design knowledge during generation. To evaluate our approach, we construct a comprehensive dataset comprising 55 complex Verilog designs derived from real-world implementations. We also release an open-source benchmark suite for systematically assessing the quality of auto-generated RTL code together with the ComplexVCoder framework. Experimental results show that ComplexVCoder outperforms SOTA frameworks such as CodeV and RTLCoder by 14.6% and 22.2%, respectively, in terms of function correctness on complex Verilog benchmarks. Furthermore, ComplexVcoder achieves comparable generation performances in terms of functionality correctness using a lightweight 32B model (Qwen2.5), rivaling larger-scale models such as GPT-3.5 and DeepSeek-V3.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
SAM-Guided Robust Representation Learning for One-Shot 3D Medical Image Segmentation
Authors:
Jia Wang,
Yunan Mei,
Jiarui Liu,
Xin Fan
Abstract:
One-shot medical image segmentation (MIS) is crucial for medical analysis due to the burden of medical experts on manual annotation. The recent emergence of the segment anything model (SAM) has demonstrated remarkable adaptation in MIS but cannot be directly applied to one-shot medical image segmentation (MIS) due to its reliance on labor-intensive user interactions and the high computational cost…
▽ More
One-shot medical image segmentation (MIS) is crucial for medical analysis due to the burden of medical experts on manual annotation. The recent emergence of the segment anything model (SAM) has demonstrated remarkable adaptation in MIS but cannot be directly applied to one-shot medical image segmentation (MIS) due to its reliance on labor-intensive user interactions and the high computational cost. To cope with these limitations, we propose a novel SAM-guided robust representation learning framework, named RRL-MedSAM, to adapt SAM to one-shot 3D MIS, which exploits the strong generalization capabilities of the SAM encoder to learn better feature representation. We devise a dual-stage knowledge distillation (DSKD) strategy to distill general knowledge between natural and medical images from the foundation model to train a lightweight encoder, and then adopt a mutual exponential moving average (mutual-EMA) to update the weights of the general lightweight encoder and medical-specific encoder. Specifically, pseudo labels from the registration network are used to perform mutual supervision for such two encoders. Moreover, we introduce an auto-prompting (AP) segmentation decoder which adopts the mask generated from the general lightweight model as a prompt to assist the medical-specific model in boosting the final segmentation performance. Extensive experiments conducted on three public datasets, i.e., OASIS, CT-lung demonstrate that the proposed RRL-MedSAM outperforms state-of-the-art one-shot MIS methods for both segmentation and registration tasks. Especially, our lightweight encoder uses only 3\% of the parameters compared to the encoder of SAM-Base.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.