-
Sage Deer: A Super-Aligned Driving Generalist Is Your Copilot
Authors:
Hao Lu,
Jiaqi Tang,
Jiyao Wang,
Yunfan LU,
Xu Cao,
Qingyong Hu,
Yin Wang,
Yuting Zhang,
Tianxin Xie,
Yunpeng Zhang,
Yong Chen,
Jiayu. Gao,
Bin Huang,
Dengbo He,
Shuiguang Deng,
Hao Chen,
Ying-Cong Chen
Abstract:
The intelligent driving cockpit, an important part of intelligent driving, needs to match different users' comfort, interaction, and safety needs. This paper aims to build a Super-Aligned and GEneralist DRiving agent, SAGE DeeR. Sage Deer achieves three highlights: (1) Super alignment: It achieves different reactions according to different people's preferences and biases. (2) Generalist: It can u…
▽ More
The intelligent driving cockpit, an important part of intelligent driving, needs to match different users' comfort, interaction, and safety needs. This paper aims to build a Super-Aligned and GEneralist DRiving agent, SAGE DeeR. Sage Deer achieves three highlights: (1) Super alignment: It achieves different reactions according to different people's preferences and biases. (2) Generalist: It can understand the multi-view and multi-mode inputs to reason the user's physiological indicators, facial emotions, hand movements, body movements, driving scenarios, and behavioral decisions. (3) Self-Eliciting: It can elicit implicit thought chains in the language space to further increase generalist and super-aligned abilities. Besides, we collected multiple data sets and built a large-scale benchmark. This benchmark measures the deer's perceptual decision-making ability and the super alignment's accuracy.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Large Wireless Localization Model (LWLM): A Foundation Model for Positioning in 6G Networks
Authors:
Guangjin Pan,
Kaixuan Huang,
Hui Chen,
Shunqing Zhang,
Christian Häger,
Henk Wymeersch
Abstract:
Accurate and robust localization is a critical enabler for emerging 5G and 6G applications, including autonomous driving, extended reality (XR), and smart manufacturing. While data-driven approaches have shown promise, most existing models require large amounts of labeled data and struggle to generalize across deployment scenarios and wireless configurations. To address these limitations, we propo…
▽ More
Accurate and robust localization is a critical enabler for emerging 5G and 6G applications, including autonomous driving, extended reality (XR), and smart manufacturing. While data-driven approaches have shown promise, most existing models require large amounts of labeled data and struggle to generalize across deployment scenarios and wireless configurations. To address these limitations, we propose a foundation-model-based solution tailored for wireless localization. We first analyze how different self-supervised learning (SSL) tasks acquire general-purpose and task-specific semantic features based on information bottleneck (IB) theory. Building on this foundation, we design a pretraining methodology for the proposed Large Wireless Localization Model (LWLM). Specifically, we propose an SSL framework that jointly optimizes three complementary objectives: (i) spatial-frequency masked channel modeling (SF-MCM), (ii) domain-transformation invariance (DTI), and (iii) position-invariant contrastive learning (PICL). These objectives jointly capture the underlying semantics of wireless channel from multiple perspectives. We further design lightweight decoders for key downstream tasks, including time-of-arrival (ToA) estimation, angle-of-arrival (AoA) estimation, single base station (BS) localization, and multiple BS localization. Comprehensive experimental results confirm that LWLM consistently surpasses both model-based and supervised learning baselines across all localization tasks. In particular, LWLM achieves 26.0%--87.5% improvement over transformer models without pretraining, and exhibits strong generalization under label-limited fine-tuning and unseen BS configurations, confirming its potential as a foundation model for wireless localization.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates
Authors:
Hang Chen,
Jiaying Zhu,
Xinyu Yang,
Wenya Wang
Abstract:
Circuit discovery has gradually become one of the prominent methods for mechanistic interpretability, and research on circuit completeness has also garnered increasing attention. Methods of circuit discovery that do not guarantee completeness not only result in circuits that are not fixed across different runs but also cause key mechanisms to be omitted. The nature of incompleteness arises from th…
▽ More
Circuit discovery has gradually become one of the prominent methods for mechanistic interpretability, and research on circuit completeness has also garnered increasing attention. Methods of circuit discovery that do not guarantee completeness not only result in circuits that are not fixed across different runs but also cause key mechanisms to be omitted. The nature of incompleteness arises from the presence of OR gates within the circuit, which are often only partially detected in standard circuit discovery methods. To this end, we systematically introduce three types of logic gates: AND, OR, and ADDER gates, and decompose the circuit into combinations of these logical gates. Through the concept of these gates, we derive the minimum requirements necessary to achieve faithfulness and completeness. Furthermore, we propose a framework that combines noising-based and denoising-based interventions, which can be easily integrated into existing circuit discovery methods without significantly increasing computational complexity. This framework is capable of fully identifying the logic gates and distinguishing them within the circuit. In addition to the extensive experimental validation of the framework's ability to restore the faithfulness, completeness, and sparsity of circuits, using this framework, we uncover fundamental properties of the three logic gates, such as their proportions and contributions to the output, and explore how they behave among the functionalities of language models.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
SnapNCode: An Integrated Development Environment for Programming Physical Objects Interactions
Authors:
Xiaoyan Wei,
Zijian Yue,
Hsiang-Ting Chen
Abstract:
Spatial computing technologies have the potential to revolutionize how we interact with the world around us. However, most modern integrated development environments (IDEs) have not fully adapted to this paradigm shift. For example, physical 3D objects in the real world are still represented as 2D text variables in code, creating a significant perceptual distance between these representations. In…
▽ More
Spatial computing technologies have the potential to revolutionize how we interact with the world around us. However, most modern integrated development environments (IDEs) have not fully adapted to this paradigm shift. For example, physical 3D objects in the real world are still represented as 2D text variables in code, creating a significant perceptual distance between these representations. In response to this challenge, we introduce SnapNCode, a novel IDE for spatial programming. SnapNCode enables programmers to capture various states of physical objects through live video streams from cameras and directly insert these visual representations into their code. Moreover, users can augment physical objects by attaching code snippets onto objects, which are opportunistically triggered when observed by cameras. We conducted a user study (N=12) to assess the usability of SnapNCode. Feedback from participants indicates that the system is easy-to-use and holds promise for daily casual uses and integration into a broader range of workflows.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Context-AI Tunes: Context-Aware AI-Generated Music for Stress Reduction
Authors:
Xiaoyan Wei,
Zebang Zhang,
Zijian Yue,
Hsiang-Ting Chen
Abstract:
Music plays a critical role in emotional regulation and stress relief; however, individuals often need different types of music tailored to their unique stress levels or surrounding environment. Choosing the right music can be challenging due to the overwhelming number of options and the time-consuming trial-and-error process. To address this, we propose Context-AI Tune (CAT), a system that genera…
▽ More
Music plays a critical role in emotional regulation and stress relief; however, individuals often need different types of music tailored to their unique stress levels or surrounding environment. Choosing the right music can be challenging due to the overwhelming number of options and the time-consuming trial-and-error process. To address this, we propose Context-AI Tune (CAT), a system that generates personalized music based on environmental inputs and the user's self-assessed stress level. A 2x2 within-subject experiment (N=26) was conducted with two independent variables: AI (AI, NoAI) and Environment (Busy Hub, Quiet Library). CAT's effectiveness in reducing stress was evaluated using the Visual Analog Scale for Stress (VAS-S). Results show that CAT is more effective than manually chosen music in reducing stress by adapting to user context.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
On Alternating 6-Cycles in Edge-Coloured Graphs
Authors:
Hao Chen,
Jonathan A. Noel
Abstract:
In this short note, we use flag algebras to prove that the number of colour alternating 6-cycles in a red/blue colouring of a large clique is asymptotically maximized by a uniformly random colouring. This settles the first open case of a problem of Basit, Granet, Horsley, Kündgen and Staden.
In this short note, we use flag algebras to prove that the number of colour alternating 6-cycles in a red/blue colouring of a large clique is asymptotically maximized by a uniformly random colouring. This settles the first open case of a problem of Basit, Granet, Horsley, Kündgen and Staden.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Enabling Group Fairness in Graph Unlearning via Bi-level Debiasing
Authors:
Yezi Liu,
Prathyush Poduval,
Wenjun Huang,
Yang Ni,
Hanning Chen,
Mohsen Imani
Abstract:
Graph unlearning is a crucial approach for protecting user privacy by erasing the influence of user data on trained graph models. Recent developments in graph unlearning methods have primarily focused on maintaining model prediction performance while removing user information. However, we have observed that when user information is deleted from the model, the prediction distribution across differe…
▽ More
Graph unlearning is a crucial approach for protecting user privacy by erasing the influence of user data on trained graph models. Recent developments in graph unlearning methods have primarily focused on maintaining model prediction performance while removing user information. However, we have observed that when user information is deleted from the model, the prediction distribution across different sensitive groups often changes. Furthermore, graph models are shown to be prone to amplifying biases, making the study of fairness in graph unlearning particularly important. This raises the question: Does graph unlearning actually introduce bias? Our findings indicate that the predictions of post-unlearning models become highly correlated with sensitive attributes, confirming the introduction of bias in the graph unlearning process. To address this issue, we propose a fair graph unlearning method, FGU. To guarantee privacy, FGU trains shard models on partitioned subgraphs, unlearns the requested data from the corresponding subgraphs, and retrains the shard models on the modified subgraphs. To ensure fairness, FGU employs a bi-level debiasing process: it first enables shard-level fairness by incorporating a fairness regularizer in the shard model retraining, and then achieves global-level fairness by aligning all shard models to minimize global disparity. Our experiments demonstrate that FGU achieves superior fairness while maintaining privacy and accuracy. Additionally, FGU is robust to diverse unlearning requests, ensuring fairness and utility performance across various data distributions.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
A Survey of 3D Reconstruction with Event Cameras: From Event-based Geometry to Neural 3D Rendering
Authors:
Chuanzhi Xu,
Haoxian Zhou,
Langyi Chen,
Haodong Chen,
Ying Zhou,
Vera Chung,
Qiang Qu
Abstract:
Event cameras have emerged as promising sensors for 3D reconstruction due to their ability to capture per-pixel brightness changes asynchronously. Unlike conventional frame-based cameras, they produce sparse and temporally rich data streams, which enable more accurate 3D reconstruction and open up the possibility of performing reconstruction in extreme environments such as high-speed motion, low l…
▽ More
Event cameras have emerged as promising sensors for 3D reconstruction due to their ability to capture per-pixel brightness changes asynchronously. Unlike conventional frame-based cameras, they produce sparse and temporally rich data streams, which enable more accurate 3D reconstruction and open up the possibility of performing reconstruction in extreme environments such as high-speed motion, low light, or high dynamic range scenes. In this survey, we provide the first comprehensive review focused exclusively on 3D reconstruction using event cameras. The survey categorises existing works into three major types based on input modality - stereo, monocular, and multimodal systems, and further classifies them by reconstruction approach, including geometry-based, deep learning-based, and recent neural rendering techniques such as Neural Radiance Fields and 3D Gaussian Splatting. Methods with a similar research focus were organised chronologically into the most subdivided groups. We also summarise public datasets relevant to event-based 3D reconstruction. Finally, we highlight current research limitations in data availability, evaluation, representation, and dynamic scene handling, and outline promising future research directions. This survey aims to serve as a comprehensive reference and a roadmap for future developments in event-driven 3D reconstruction.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
A Practical Introduction to Deep Reinforcement Learning
Authors:
Yinghan Sun,
Hongxi Wang,
Hua Chen,
Wei Zhang
Abstract:
Deep reinforcement learning (DRL) has emerged as a powerful framework for solving sequential decision-making problems, achieving remarkable success in a wide range of applications, including game AI, autonomous driving, biomedicine, and large language models. However, the diversity of algorithms and the complexity of theoretical foundations often pose significant challenges for beginners seeking t…
▽ More
Deep reinforcement learning (DRL) has emerged as a powerful framework for solving sequential decision-making problems, achieving remarkable success in a wide range of applications, including game AI, autonomous driving, biomedicine, and large language models. However, the diversity of algorithms and the complexity of theoretical foundations often pose significant challenges for beginners seeking to enter the field. This tutorial aims to provide a concise, intuitive, and practical introduction to DRL, with a particular focus on the Proximal Policy Optimization (PPO) algorithm, which is one of the most widely used and effective DRL methods. To facilitate learning, we organize all algorithms under the Generalized Policy Iteration (GPI) framework, offering readers a unified and systematic perspective. Instead of lengthy theoretical proofs, we emphasize intuitive explanations, illustrative examples, and practical engineering techniques. This work serves as an efficient and accessible guide, helping readers rapidly progress from basic concepts to the implementation of advanced DRL algorithms.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
RDD: Robust Feature Detector and Descriptor using Deformable Transformer
Authors:
Gonglin Chen,
Tianwen Fu,
Haiwei Chen,
Wenbin Teng,
Hanyuan Xiao,
Yajie Zhao
Abstract:
As a core step in structure-from-motion and SLAM, robust feature detection and description under challenging scenarios such as significant viewpoint changes remain unresolved despite their ubiquity. While recent works have identified the importance of local features in modeling geometric transformations, these methods fail to learn the visual cues present in long-range relationships. We present Ro…
▽ More
As a core step in structure-from-motion and SLAM, robust feature detection and description under challenging scenarios such as significant viewpoint changes remain unresolved despite their ubiquity. While recent works have identified the importance of local features in modeling geometric transformations, these methods fail to learn the visual cues present in long-range relationships. We present Robust Deformable Detector (RDD), a novel and robust keypoint detector/descriptor leveraging the deformable transformer, which captures global context and geometric invariance through deformable self-attention mechanisms. Specifically, we observed that deformable attention focuses on key locations, effectively reducing the search space complexity and modeling the geometric invariance. Furthermore, we collected an Air-to-Ground dataset for training in addition to the standard MegaDepth dataset. Our proposed method outperforms all state-of-the-art keypoint detection/description methods in sparse matching tasks and is also capable of semi-dense matching. To ensure comprehensive evaluation, we introduce two challenging benchmarks: one emphasizing large viewpoint and scale variations, and the other being an Air-to-Ground benchmark -- an evaluation setting that has recently gaining popularity for 3D reconstruction across different altitudes.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Self-cross Feature based Spiking Neural Networks for Efficient Few-shot Learning
Authors:
Qi Xu,
Junyang Zhu,
Dongdong Zhou,
Hao Chen,
Yang Liu,
Jiangrong Shen,
Qiang Zhang
Abstract:
Deep neural networks (DNNs) excel in computer vision tasks, especially, few-shot learning (FSL), which is increasingly important for generalizing from limited examples. However, DNNs are computationally expensive with scalability issues in real world. Spiking Neural Networks (SNNs), with their event-driven nature and low energy consumption, are particularly efficient in processing sparse and dynam…
▽ More
Deep neural networks (DNNs) excel in computer vision tasks, especially, few-shot learning (FSL), which is increasingly important for generalizing from limited examples. However, DNNs are computationally expensive with scalability issues in real world. Spiking Neural Networks (SNNs), with their event-driven nature and low energy consumption, are particularly efficient in processing sparse and dynamic data, though they still encounter difficulties in capturing complex spatiotemporal features and performing accurate cross-class comparisons. To further enhance the performance and efficiency of SNNs in few-shot learning, we propose a few-shot learning framework based on SNNs, which combines a self-feature extractor module and a cross-feature contrastive module to refine feature representation and reduce power consumption. We apply the combination of temporal efficient training loss and InfoNCE loss to optimize the temporal dynamics of spike trains and enhance the discriminative power. Experimental results show that the proposed FSL-SNN significantly improves the classification performance on the neuromorphic dataset N-Omniglot, and also achieves competitive performance to ANNs on static datasets such as CUB and miniImageNet with low power consumption.
△ Less
Submitted 14 May, 2025; v1 submitted 12 May, 2025;
originally announced May 2025.
-
Sub-diffraction terahertz backpropagation compressive imaging
Authors:
Yongsheng Zhu,
Shaojing Liu,
Ximiao Wang,
Runli Li,
Haili Yang,
Jiali Wang,
Hongjia Zhu,
Yanlin Ke,
Ningsheng Xu,
Huanjun Chen,
Shaozhi Deng
Abstract:
Terahertz single-pixel imaging (TSPI) has garnered significant attention due to its simplicity and cost-effectiveness. However, the relatively long wavelength of THz waves limits sub-diffraction-scale imaging resolution. Although TSPI technique can achieve sub-wavelength resolution, it requires harsh experimental conditions and time-consuming processes. Here, we propose a sub-diffraction THz backp…
▽ More
Terahertz single-pixel imaging (TSPI) has garnered significant attention due to its simplicity and cost-effectiveness. However, the relatively long wavelength of THz waves limits sub-diffraction-scale imaging resolution. Although TSPI technique can achieve sub-wavelength resolution, it requires harsh experimental conditions and time-consuming processes. Here, we propose a sub-diffraction THz backpropagation compressive imaging technique. We illuminate the object with monochromatic continuous-wave THz radiation. The transmitted THz wave is modulated by prearranged patterns generated on the back surface of a 500-μm-thick silicon wafer, realized through photoexcited carriers using a 532-nm laser. The modulated THz wave is then recorded by a single-element detector. An untrained neural network is employed to iteratively reconstruct the object image with an ultralow compression ratio of 1.5625% under a physical model constraint, thus reducing the long sampling times. To further suppress the diffraction-field effects, embedded with the angular spectrum propagation (ASP) theory to model the diffraction of THz waves during propagation, the network retrieves near-field information from the object, enabling sub-diffraction imaging with a spatial resolution of ~λ0/7 (λ0 = 833.3 μm at 0.36 THz) and eliminating the need for ultrathin photomodulators. This approach provides an efficient solution for advancing THz microscopic imaging and addressing other inverse imaging challenges.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
KAQG: A Knowledge-Graph-Enhanced RAG for Difficulty-Controlled Question Generation
Authors:
Ching Han Chen,
Ming Fang Shiu
Abstract:
KAQG introduces a decisive breakthrough for Retrieval-Augmented Generation (RAG) by explicitly tackling the two chronic weaknesses of current pipelines: transparent multi-step reasoning and fine-grained cognitive difficulty control. This transforms RAG from a passive retriever into an accountable generator of calibrated exam items. Technically, the framework fuses knowledge graphs, RAG retrieval,…
▽ More
KAQG introduces a decisive breakthrough for Retrieval-Augmented Generation (RAG) by explicitly tackling the two chronic weaknesses of current pipelines: transparent multi-step reasoning and fine-grained cognitive difficulty control. This transforms RAG from a passive retriever into an accountable generator of calibrated exam items. Technically, the framework fuses knowledge graphs, RAG retrieval, and educational assessment theory into a single pipeline. Domain passages are parsed into a structured graph; graph-aware retrieval feeds fact chains to an LLM; and an assessment layer governed by Bloom's Taxonomy levels and Item Response Theory (IRT) transforms those chains into psychometrically sound questions. This cross-disciplinary marriage yields two scholarly contributions: it shows how semantic graph contexts guide LLM reasoning paths, and it operationalizes difficulty metrics within the generation process, producing items whose IRT parameters match expert benchmarks. Every module, from KG construction scripts to the multi-agent reasoning scheduler and the automatic IRT validator, is openly released on GitHub. This enables peer laboratories to replicate experiments, benchmark against baselines, and extend individual components without licensing barriers. Its reproducible design paves the way for rigorous ablation studies, cross-domain transfer experiments, and shared leaderboards on multi-step reasoning benchmarks.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
AgentFlow: Resilient Adaptive Cloud-Edge Framework for Multi-Agent Coordination
Authors:
Ching Han Chen,
Ming Fang Shiu
Abstract:
This paper presents AgentFlow, a MAS-based framework for programmable distributed systems in heterogeneous cloud-edge environments. It introduces logistics objects and abstract agent interfaces to enable dynamic service flows and modular orchestration. AgentFlow supports decentralized publish-subscribe messaging and many-to-many service elections, enabling decision coordination without a central s…
▽ More
This paper presents AgentFlow, a MAS-based framework for programmable distributed systems in heterogeneous cloud-edge environments. It introduces logistics objects and abstract agent interfaces to enable dynamic service flows and modular orchestration. AgentFlow supports decentralized publish-subscribe messaging and many-to-many service elections, enabling decision coordination without a central server. It features plug-and-play node discovery, flexible task reorganization, and highly adaptable fault tolerance and substitution mechanisms. AgentFlow advances scalable, real-time coordination for resilient and autonomous mission-critical systems.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Minimal Linear Codes Violating the Ashikhmin-Barg Condition from Arbitrary Projective Linear Codes
Authors:
Hao Chen,
Yaqi Chen,
Conghui Xie,
Huimin Lao
Abstract:
In recent years, there have been many constructions of minimal linear codes violating the Ashikhmin-Barg condition from Boolean functions, linear codes with few nonzero weights or partial difference sets. In this paper, we first give a general method to transform a minimal code satisfying the Ashikhmin-Barg condition to a minimal code violating the Ashikhmin-Barg condition. Then we give a construc…
▽ More
In recent years, there have been many constructions of minimal linear codes violating the Ashikhmin-Barg condition from Boolean functions, linear codes with few nonzero weights or partial difference sets. In this paper, we first give a general method to transform a minimal code satisfying the Ashikhmin-Barg condition to a minimal code violating the Ashikhmin-Barg condition. Then we give a construction of a minimal code satisfying the Ashikhmin-Barg condition from an arbitrary projective linear code. Hence an arbitrary projective linear code can be transformed to a minimal codes violating the Ashikhmin-Barg condition. Then we give infinite many families of minimal codes violating the Ashikhamin-Barg condition. Weight distributions of constructed minimal codes violating the Ashikhmin-Barg condition in this paper are determined. Many minimal linear codes violating the Ashikhmin-Barg condition with their minimum weights close to the optimal or the best known minimum weights of linear codes are constructed in this paper. Moreover, many infinite families of self-orthogonal binary minimal codes violating the Ashikhmin-Barg condition are also given.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
Seed1.5-VL Technical Report
Authors:
Dong Guo,
Faming Wu,
Feida Zhu,
Fuxing Leng,
Guang Shi,
Haobin Chen,
Haoqi Fan,
Jian Wang,
Jianyu Jiang,
Jiawei Wang,
Jingji Chen,
Jingjia Huang,
Kang Lei,
Liping Yuan,
Lishu Luo,
Pengfei Liu,
Qinghao Ye,
Rui Qian,
Shen Yan,
Shixiong Zhao,
Shuai Peng,
Shuangye Li,
Sihang Yuan,
Sijin Wu,
Tianheng Cheng
, et al. (172 additional authors not shown)
Abstract:
We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluati…
▽ More
We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at https://www.volcengine.com/ (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
Unsupervised Learning for Class Distribution Mismatch
Authors:
Pan Du,
Wangbo Zhao,
Xinai Lu,
Nian Liu,
Zhikai Li,
Chaoyu Gong,
Suyun Zhao,
Hong Chen,
Cuiping Li,
Kai Wang,
Yang You
Abstract:
Class distribution mismatch (CDM) refers to the discrepancy between class distributions in training data and target tasks. Previous methods address this by designing classifiers to categorize classes known during training, while grouping unknown or new classes into an "other" category. However, they focus on semi-supervised scenarios and heavily rely on labeled data, limiting their applicability a…
▽ More
Class distribution mismatch (CDM) refers to the discrepancy between class distributions in training data and target tasks. Previous methods address this by designing classifiers to categorize classes known during training, while grouping unknown or new classes into an "other" category. However, they focus on semi-supervised scenarios and heavily rely on labeled data, limiting their applicability and performance. To address this, we propose Unsupervised Learning for Class Distribution Mismatch (UCDM), which constructs positive-negative pairs from unlabeled data for classifier training. Our approach randomly samples images and uses a diffusion model to add or erase semantic classes, synthesizing diverse training pairs. Additionally, we introduce a confidence-based labeling mechanism that iteratively assigns pseudo-labels to valuable real-world data and incorporates them into the training process. Extensive experiments on three datasets demonstrate UCDM's superiority over previous semi-supervised methods. Specifically, with a 60% mismatch proportion on Tiny-ImageNet dataset, our approach, without relying on labeled data, surpasses OpenMatch (with 40 labels per class) by 35.1%, 63.7%, and 72.5% in classifying known, unknown, and new classes.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
Enhancing Monocular Height Estimation via Sparse LiDAR-Guided Correction
Authors:
Jian Song,
Hongruixuan Chen,
Naoto Yokoya
Abstract:
Monocular height estimation (MHE) from very-high-resolution (VHR) remote sensing imagery via deep learning is notoriously challenging due to the lack of sufficient structural information. Conventional digital elevation models (DEMs), typically derived from airborne LiDAR or multi-view stereo, remain costly and geographically limited. Recently, models trained on synthetic data and refined through d…
▽ More
Monocular height estimation (MHE) from very-high-resolution (VHR) remote sensing imagery via deep learning is notoriously challenging due to the lack of sufficient structural information. Conventional digital elevation models (DEMs), typically derived from airborne LiDAR or multi-view stereo, remain costly and geographically limited. Recently, models trained on synthetic data and refined through domain adaptation have shown remarkable performance in MHE, yet it remains unclear how these models make predictions or how reliable they truly are. In this paper, we investigate a state-of-the-art MHE model trained purely on synthetic data to explore where the model looks when making height predictions. Through systematic analyses, we find that the model relies heavily on shadow cues, a factor that can lead to overestimation or underestimation of heights when shadows deviate from expected norms. Furthermore, the inherent difficulty of evaluating regression tasks with the human eye underscores additional limitations of purely synthetic training. To address these issues, we propose a novel correction pipeline that integrates sparse, imperfect global LiDAR measurements (ICESat-2) with deep-learning outputs to improve local accuracy and achieve spatially consistent corrections. Our method comprises two stages: pre-processing raw ICESat-2 data, followed by a random forest-based approach to densely refine height estimates. Experiments in three representative urban regions -- Saint-Omer, Tokyo, and Sao Paulo -- reveal substantial error reductions, with mean absolute error (MAE) decreased by 22.8\%, 6.9\%, and 4.9\%, respectively. These findings highlight the critical role of shadow awareness in synthetic data-driven models and demonstrate how fusing imperfect real-world LiDAR data can bolster the robustness of MHE, paving the way for more reliable and scalable 3D mapping solutions.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
Two-Stage Random Alternation Framework for Zero-Shot Pansharpening
Authors:
Haorui Chen,
Zeyu Ren,
Jiaxuan Ren,
Ran Ran,
Jinliang Shao,
Jie Huang,
Liangjian Deng
Abstract:
In recent years, pansharpening has seen rapid advancements with deep learning methods, which have demonstrated impressive fusion quality. However, the challenge of acquiring real high-resolution images limits the practical applicability of these methods. To address this, we propose a two-stage random alternating framework (TRA-PAN) that effectively integrates strong supervision constraints from re…
▽ More
In recent years, pansharpening has seen rapid advancements with deep learning methods, which have demonstrated impressive fusion quality. However, the challenge of acquiring real high-resolution images limits the practical applicability of these methods. To address this, we propose a two-stage random alternating framework (TRA-PAN) that effectively integrates strong supervision constraints from reduced-resolution images with the physical characteristics of full-resolution images. The first stage introduces a pre-training procedure, which includes Degradation-Aware Modeling (DAM) to capture spatial-spectral degradation mappings, alongside a warm-up procedure designed to reduce training time and mitigate the negative effects of reduced-resolution data. In the second stage, Random Alternation Optimization (RAO) is employed, where random alternating training leverages the strengths of both reduced- and full-resolution images, further optimizing the fusion model. By primarily relying on full-resolution images, our method enables zero-shot training with just a single image pair, obviating the need for large datasets. Experimental results demonstrate that TRA-PAN outperforms state-of-the-art (SOTA) methods in both quantitative metrics and visual quality in real-world scenarios, highlighting its strong practical applicability.
△ Less
Submitted 10 May, 2025;
originally announced May 2025.
-
QiMeng-TensorOp: Automatically Generating High-Performance Tensor Operators with Hardware Primitives
Authors:
Xuzhi Zhang,
Shaohui Peng,
Qirui Zhou,
Yuanbo Wen,
Qi Guo,
Ruizhi Chen,
Xinguo Zhu,
Weiqiang Xiong,
Haixin Chen,
Congying Ma,
Ke Gao,
Chen Zhao,
Yanjun Wu,
Yunji Chen,
Ling Li
Abstract:
Computation-intensive tensor operators constitute over 90\% of the computations in Large Language Models (LLMs) and Deep Neural Networks.Automatically and efficiently generating high-performance tensor operators with hardware primitives is crucial for diverse and ever-evolving hardware architectures like RISC-V, ARM, and GPUs, as manually optimized implementation takes at least months and lacks po…
▽ More
Computation-intensive tensor operators constitute over 90\% of the computations in Large Language Models (LLMs) and Deep Neural Networks.Automatically and efficiently generating high-performance tensor operators with hardware primitives is crucial for diverse and ever-evolving hardware architectures like RISC-V, ARM, and GPUs, as manually optimized implementation takes at least months and lacks portability.LLMs excel at generating high-level language codes, but they struggle to fully comprehend hardware characteristics and produce high-performance tensor operators. We introduce a tensor-operator auto-generation framework with a one-line user prompt (QiMeng-TensorOp), which enables LLMs to automatically exploit hardware characteristics to generate tensor operators with hardware primitives, and tune parameters for optimal performance across diverse hardware. Experimental results on various hardware platforms, SOTA LLMs, and typical tensor operators demonstrate that QiMeng-TensorOp effectively unleashes the computing capability of various hardware platforms, and automatically generates tensor operators of superior performance. Compared with vanilla LLMs, QiMeng-TensorOp achieves up to $1291 \times$ performance improvement. Even compared with human experts, QiMeng-TensorOp could reach $251 \%$ of OpenBLAS on RISC-V CPUs, and $124 \%$ of cuBLAS on NVIDIA GPUs. Additionally, QiMeng-TensorOp also significantly reduces development costs by $200 \times$ compared with human experts.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
PARM: Multi-Objective Test-Time Alignment via Preference-Aware Autoregressive Reward Model
Authors:
Baijiong Lin,
Weisen Jiang,
Yuancheng Xu,
Hao Chen,
Ying-Cong Chen
Abstract:
Multi-objective test-time alignment aims to adapt large language models (LLMs) to diverse multi-dimensional user preferences during inference while keeping LLMs frozen. Recently, GenARM (Xu et al., 2025) first independently trains Autoregressive Reward Models (ARMs) for each preference dimension without awareness of each other, then combines their outputs based on user-specific preference vectors…
▽ More
Multi-objective test-time alignment aims to adapt large language models (LLMs) to diverse multi-dimensional user preferences during inference while keeping LLMs frozen. Recently, GenARM (Xu et al., 2025) first independently trains Autoregressive Reward Models (ARMs) for each preference dimension without awareness of each other, then combines their outputs based on user-specific preference vectors during inference to achieve multi-objective test-time alignment, leading to two key limitations: the need for \textit{multiple} ARMs increases the inference cost, and the separate training of ARMs causes the misalignment between the guided generation and the user preferences. To address these issues, we propose Preference-aware ARM (PARM), a single unified ARM trained across all preference dimensions. PARM uses our proposed Preference-Aware Bilinear Low-Rank Adaptation (PBLoRA), which employs a bilinear form to condition the ARM on preference vectors, enabling it to achieve precise control over preference trade-offs during inference. Experiments demonstrate that PARM reduces inference costs and achieves better alignment with preference vectors compared with existing methods. Additionally, PARM enables weak-to-strong guidance, allowing a smaller PARM to guide a larger frozen LLM without expensive training, making multi-objective alignment accessible with limited computing resources. The code is available at https://github.com/Baijiong-Lin/PARM.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
Photovoltaic Defect Image Generator with Boundary Alignment Smoothing Constraint for Domain Shift Mitigation
Authors:
Dongying Li,
Binyi Su,
Hua Zhang,
Yong Li,
Haiyong Chen
Abstract:
Accurate defect detection of photovoltaic (PV) cells is critical for ensuring quality and efficiency in intelligent PV manufacturing systems. However, the scarcity of rich defect data poses substantial challenges for effective model training. While existing methods have explored generative models to augment datasets, they often suffer from instability, limited diversity, and domain shifts. To addr…
▽ More
Accurate defect detection of photovoltaic (PV) cells is critical for ensuring quality and efficiency in intelligent PV manufacturing systems. However, the scarcity of rich defect data poses substantial challenges for effective model training. While existing methods have explored generative models to augment datasets, they often suffer from instability, limited diversity, and domain shifts. To address these issues, we propose PDIG, a Photovoltaic Defect Image Generator based on Stable Diffusion (SD). PDIG leverages the strong priors learned from large-scale datasets to enhance generation quality under limited data. Specifically, we introduce a Semantic Concept Embedding (SCE) module that incorporates text-conditioned priors to capture the relational concepts between defect types and their appearances. To further enrich the domain distribution, we design a Lightweight Industrial Style Adaptor (LISA), which injects industrial defect characteristics into the SD model through cross-disentangled attention. At inference, we propose a Text-Image Dual-Space Constraints (TIDSC) module, enforcing the quality of generated images via positional consistency and spatial smoothing alignment. Extensive experiments demonstrate that PDIG achieves superior realism and diversity compared to state-of-the-art methods. Specifically, our approach improves Frechet Inception Distance (FID) by 19.16 points over the second-best method and significantly enhances the performance of downstream defect detection tasks.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
Web2Grasp: Learning Functional Grasps from Web Images of Hand-Object Interactions
Authors:
Hongyi Chen,
Yunchao Yao,
Yufei Ye,
Zhixuan Xu,
Homanga Bharadhwaj,
Jiashun Wang,
Shubham Tulsiani,
Zackory Erickson,
Jeffrey Ichnowski
Abstract:
Functional grasp is essential for enabling dexterous multi-finger robot hands to manipulate objects effectively. However, most prior work either focuses on power grasping, which simply involves holding an object still, or relies on costly teleoperated robot demonstrations to teach robots how to grasp each object functionally. Instead, we propose extracting human grasp information from web images s…
▽ More
Functional grasp is essential for enabling dexterous multi-finger robot hands to manipulate objects effectively. However, most prior work either focuses on power grasping, which simply involves holding an object still, or relies on costly teleoperated robot demonstrations to teach robots how to grasp each object functionally. Instead, we propose extracting human grasp information from web images since they depict natural and functional object interactions, thereby bypassing the need for curated demonstrations. We reconstruct human hand-object interaction (HOI) 3D meshes from RGB images, retarget the human hand to multi-finger robot hands, and align the noisy object mesh with its accurate 3D shape. We show that these relatively low-quality HOI data from inexpensive web sources can effectively train a functional grasping model. To further expand the grasp dataset for seen and unseen objects, we use the initially-trained grasping policy with web data in the IsaacGym simulator to generate physically feasible grasps while preserving functionality. We train the grasping model on 10 object categories and evaluate it on 9 unseen objects, including challenging items such as syringes, pens, spray bottles, and tongs, which are underrepresented in existing datasets. The model trained on the web HOI dataset, achieving a 75.8% success rate on seen objects and 61.8% across all objects in simulation, with a 6.7% improvement in success rate and a 1.8x increase in functionality ratings over baselines. Simulator-augmented data further boosts performance from 61.8% to 83.4%. The sim-to-real transfer to the LEAP Hand achieves a 85% success rate. Project website is at: https://web2grasp.github.io/.
△ Less
Submitted 12 May, 2025; v1 submitted 7 May, 2025;
originally announced May 2025.
-
EAM: Enhancing Anything with Diffusion Transformers for Blind Super-Resolution
Authors:
Haizhen Xie,
Kunpeng Du,
Qiangyu Yan,
Sen Lu,
Jianhong Han,
Hanting Chen,
Hailin Hu,
Jie Hu
Abstract:
Utilizing pre-trained Text-to-Image (T2I) diffusion models to guide Blind Super-Resolution (BSR) has become a predominant approach in the field. While T2I models have traditionally relied on U-Net architectures, recent advancements have demonstrated that Diffusion Transformers (DiT) achieve significantly higher performance in this domain. In this work, we introduce Enhancing Anything Model (EAM),…
▽ More
Utilizing pre-trained Text-to-Image (T2I) diffusion models to guide Blind Super-Resolution (BSR) has become a predominant approach in the field. While T2I models have traditionally relied on U-Net architectures, recent advancements have demonstrated that Diffusion Transformers (DiT) achieve significantly higher performance in this domain. In this work, we introduce Enhancing Anything Model (EAM), a novel BSR method that leverages DiT and outperforms previous U-Net-based approaches. We introduce a novel block, $Ψ$-DiT, which effectively guides the DiT to enhance image restoration. This block employs a low-resolution latent as a separable flow injection control, forming a triple-flow architecture that effectively leverages the prior knowledge embedded in the pre-trained DiT. To fully exploit the prior guidance capabilities of T2I models and enhance their generalization in BSR, we introduce a progressive Masked Image Modeling strategy, which also reduces training costs. Additionally, we propose a subject-aware prompt generation strategy that employs a robust multi-modal model in an in-context learning framework. This strategy automatically identifies key image areas, provides detailed descriptions, and optimizes the utilization of T2I diffusion priors. Our experiments demonstrate that EAM achieves state-of-the-art results across multiple datasets, outperforming existing methods in both quantitative metrics and visual quality.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
Rethinking Boundary Detection in Deep Learning-Based Medical Image Segmentation
Authors:
Yi Lin,
Dong Zhang,
Xiao Fang,
Yufan Chen,
Kwang-Ting Cheng,
Hao Chen
Abstract:
Medical image segmentation is a pivotal task within the realms of medical image analysis and computer vision. While current methods have shown promise in accurately segmenting major regions of interest, the precise segmentation of boundary areas remains challenging. In this study, we propose a novel network architecture named CTO, which combines Convolutional Neural Networks (CNNs), Vision Transfo…
▽ More
Medical image segmentation is a pivotal task within the realms of medical image analysis and computer vision. While current methods have shown promise in accurately segmenting major regions of interest, the precise segmentation of boundary areas remains challenging. In this study, we propose a novel network architecture named CTO, which combines Convolutional Neural Networks (CNNs), Vision Transformer (ViT) models, and explicit edge detection operators to tackle this challenge. CTO surpasses existing methods in terms of segmentation accuracy and strikes a better balance between accuracy and efficiency, without the need for additional data inputs or label injections. Specifically, CTO adheres to the canonical encoder-decoder network paradigm, with a dual-stream encoder network comprising a mainstream CNN stream for capturing local features and an auxiliary StitchViT stream for integrating long-range dependencies. Furthermore, to enhance the model's ability to learn boundary areas, we introduce a boundary-guided decoder network that employs binary boundary masks generated by dedicated edge detection operators to provide explicit guidance during the decoding process. We validate the performance of CTO through extensive experiments conducted on seven challenging medical image segmentation datasets, namely ISIC 2016, PH2, ISIC 2018, CoNIC, LiTS17, and BTCV. Our experimental results unequivocally demonstrate that CTO achieves state-of-the-art accuracy on these datasets while maintaining competitive model complexity. The codes have been released at: https://github.com/xiaofang007/CTO.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
A Survey on Temporal Interaction Graph Representation Learning: Progress, Challenges, and Opportunities
Authors:
Pengfei Jiao,
Hongjiang Chen,
Xuan Guo,
Zhidong Zhao,
Dongxiao He,
Di Jin
Abstract:
Temporal interaction graphs (TIGs), defined by sequences of timestamped interaction events, have become ubiquitous in real-world applications due to their capability to model complex dynamic system behaviors. As a result, temporal interaction graph representation learning (TIGRL) has garnered significant attention in recent years. TIGRL aims to embed nodes in TIGs into low-dimensional representati…
▽ More
Temporal interaction graphs (TIGs), defined by sequences of timestamped interaction events, have become ubiquitous in real-world applications due to their capability to model complex dynamic system behaviors. As a result, temporal interaction graph representation learning (TIGRL) has garnered significant attention in recent years. TIGRL aims to embed nodes in TIGs into low-dimensional representations that effectively preserve both structural and temporal information, thereby enhancing the performance of downstream tasks such as classification, prediction, and clustering within constantly evolving data environments. In this paper, we begin by introducing the foundational concepts of TIGs and emphasize the critical role of temporal dependencies. We then propose a comprehensive taxonomy of state-of-the-art TIGRL methods, systematically categorizing them based on the types of information utilized during the learning process to address the unique challenges inherent to TIGs. To facilitate further research and practical applications, we curate the source of datasets and benchmarks, providing valuable resources for empirical investigations. Finally, we examine key open challenges and explore promising research directions in TIGRL, laying the groundwork for future advancements that have the potential to shape the evolution of this field.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
Impact Analysis of Inference Time Attack of Perception Sensors on Autonomous Vehicles
Authors:
Hanlin Chen,
Simin Chen,
Wenyu Li,
Wei Yang,
Yiheng Feng
Abstract:
As a safety-critical cyber-physical system, cybersecurity and related safety issues for Autonomous Vehicles (AVs) have been important research topics for a while. Among all the modules on AVs, perception is one of the most accessible attack surfaces, as drivers and AVs have no control over the outside environment. Most current work targeting perception security for AVs focuses on perception correc…
▽ More
As a safety-critical cyber-physical system, cybersecurity and related safety issues for Autonomous Vehicles (AVs) have been important research topics for a while. Among all the modules on AVs, perception is one of the most accessible attack surfaces, as drivers and AVs have no control over the outside environment. Most current work targeting perception security for AVs focuses on perception correctness. In this work, we propose an impact analysis based on inference time attacks for autonomous vehicles. We demonstrate in a simulation system that such inference time attacks can also threaten the safety of both the ego vehicle and other traffic participants.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
Implementation of Shor Algorithm: Factoring a 4096-Bit Integer Under Specific Constraints
Authors:
Abel C. H. Chen
Abstract:
In recent years, advancements in quantum chip technology, such as Willow, have contributed to reducing quantum computation error rates, potentially accelerating the practical adoption of quantum computing. As a result, the design of quantum algorithms suitable for real-world applications has become a crucial research direction. This study focuses on the implementation of Shor algorithm, aiming to…
▽ More
In recent years, advancements in quantum chip technology, such as Willow, have contributed to reducing quantum computation error rates, potentially accelerating the practical adoption of quantum computing. As a result, the design of quantum algorithms suitable for real-world applications has become a crucial research direction. This study focuses on the implementation of Shor algorithm, aiming to improve modular computation efficiency and demonstrate the factorization of a 4096-bit integer under specific constraints. Experimental results, when compared with state-of-the-art (SOTA) methods, indicate a significant improvement in efficiency while enabling the factorization of longer integers.
△ Less
Submitted 6 April, 2025;
originally announced May 2025.
-
Knowledge Augmented Complex Problem Solving with Large Language Models: A Survey
Authors:
Da Zheng,
Lun Du,
Junwei Su,
Yuchen Tian,
Yuqi Zhu,
Jintian Zhang,
Lanning Wei,
Ningyu Zhang,
Huajun Chen
Abstract:
Problem-solving has been a fundamental driver of human progress in numerous domains. With advancements in artificial intelligence, Large Language Models (LLMs) have emerged as powerful tools capable of tackling complex problems across diverse domains. Unlike traditional computational systems, LLMs combine raw computational power with an approximation of human reasoning, allowing them to generate s…
▽ More
Problem-solving has been a fundamental driver of human progress in numerous domains. With advancements in artificial intelligence, Large Language Models (LLMs) have emerged as powerful tools capable of tackling complex problems across diverse domains. Unlike traditional computational systems, LLMs combine raw computational power with an approximation of human reasoning, allowing them to generate solutions, make inferences, and even leverage external computational tools. However, applying LLMs to real-world problem-solving presents significant challenges, including multi-step reasoning, domain knowledge integration, and result verification. This survey explores the capabilities and limitations of LLMs in complex problem-solving, examining techniques including Chain-of-Thought (CoT) reasoning, knowledge augmentation, and various LLM-based and tool-based verification techniques. Additionally, we highlight domain-specific challenges in various domains, such as software engineering, mathematical reasoning and proving, data analysis and modeling, and scientific research. The paper further discusses the fundamental limitations of the current LLM solutions and the future directions of LLM-based complex problems solving from the perspective of multi-step reasoning, domain knowledge integration and result verification.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
Learn to Swim: Data-Driven LSTM Hydrodynamic Model for Quadruped Robot Gait Optimization
Authors:
Fei Han,
Pengming Guo,
Hao Chen,
Weikun Li,
Jingbo Ren,
Naijun Liu,
Ning Yang,
Dixia Fan
Abstract:
This paper presents a Long Short-Term Memory network-based Fluid Experiment Data-Driven model (FED-LSTM) for predicting unsteady, nonlinear hydrodynamic forces on the underwater quadruped robot we constructed. Trained on experimental data from leg force and body drag tests conducted in both a recirculating water tank and a towing tank, FED-LSTM outperforms traditional Empirical Formulas (EF) commo…
▽ More
This paper presents a Long Short-Term Memory network-based Fluid Experiment Data-Driven model (FED-LSTM) for predicting unsteady, nonlinear hydrodynamic forces on the underwater quadruped robot we constructed. Trained on experimental data from leg force and body drag tests conducted in both a recirculating water tank and a towing tank, FED-LSTM outperforms traditional Empirical Formulas (EF) commonly used for flow prediction over flat surfaces. The model demonstrates superior accuracy and adaptability in capturing complex fluid dynamics, particularly in straight-line and turning-gait optimizations via the NSGA-II algorithm. FED-LSTM reduces deflection errors during straight-line swimming and improves turn times without increasing the turning radius. Hardware experiments further validate the model's precision and stability over EF. This approach provides a robust framework for enhancing the swimming performance of legged robots, laying the groundwork for future advances in underwater robotic locomotion.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
DPNet: Dynamic Pooling Network for Tiny Object Detection
Authors:
Luqi Gong,
Haotian Chen,
Yikun Chen,
Tianliang Yao,
Chao Li,
Shuai Zhao,
Guangjie Han
Abstract:
In unmanned aerial systems, especially in complex environments, accurately detecting tiny objects is crucial. Resizing images is a common strategy to improve detection accuracy, particularly for small objects. However, simply enlarging images significantly increases computational costs and the number of negative samples, severely degrading detection performance and limiting its applicability. This…
▽ More
In unmanned aerial systems, especially in complex environments, accurately detecting tiny objects is crucial. Resizing images is a common strategy to improve detection accuracy, particularly for small objects. However, simply enlarging images significantly increases computational costs and the number of negative samples, severely degrading detection performance and limiting its applicability. This paper proposes a Dynamic Pooling Network (DPNet) for tiny object detection to mitigate these issues. DPNet employs a flexible down-sampling strategy by introducing a factor (df) to relax the fixed downsampling process of the feature map to an adjustable one. Furthermore, we design a lightweight predictor to predict df for each input image, which is used to decrease the resolution of feature maps in the backbone. Thus, we achieve input-aware downsampling. We also design an Adaptive Normalization Module (ANM) to make a unified detector compatible with different dfs. A guidance loss supervises the predictor's training. DPNet dynamically allocates computing resources to trade off between detection accuracy and efficiency. Experiments on the TinyCOCO and TinyPerson datasets show that DPNet can save over 35% and 25% GFLOPs, respectively, while maintaining comparable detection performance. The code will be made publicly available.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
Dance of Fireworks: An Interactive Broadcast Gymnastics Training System Based on Pose Estimation
Authors:
Haotian Chen,
Ziyu Liu,
Xi Cheng,
Chuangqi Li
Abstract:
This study introduces Dance of Fireworks, an interactive system designed to combat sedentary health risks by enhancing engagement in radio calisthenics. Leveraging mobile device cameras and lightweight pose estimation (PoseNet/TensorFlow Lite), the system extracts body keypoints, computes joint angles, and compares them with standardized motions to deliver real-time corrective feedback. To incenti…
▽ More
This study introduces Dance of Fireworks, an interactive system designed to combat sedentary health risks by enhancing engagement in radio calisthenics. Leveraging mobile device cameras and lightweight pose estimation (PoseNet/TensorFlow Lite), the system extracts body keypoints, computes joint angles, and compares them with standardized motions to deliver real-time corrective feedback. To incentivize participation, it dynamically maps users' movements (such as joint angles and velocity) to customizable fireworks animations, rewarding improved accuracy with richer visual effects.
Experiments involving 136 participants demonstrated a significant reduction in average joint angle errors from 21.3 degrees to 9.8 degrees (p < 0.01) over four sessions, with 93.4 percent of users affirming its exercise-promoting efficacy and 85.4 percent praising its entertainment value. The system operates without predefined motion templates or specialised hardware, enabling seamless integration into office environments. Future enhancements will focus on improving pose recognition accuracy, reducing latency, and adding features such as multiplayer interaction and music synchronisation. This work presents a cost-effective, engaging solution to promote physical activity in sedentary populations.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
An Arbitrary-Modal Fusion Network for Volumetric Cranial Nerves Tract Segmentation
Authors:
Lei Xie,
Huajun Zhou,
Junxiong Huang,
Jiahao Huang,
Qingrun Zeng,
Jianzhong He,
Jiawei Zhang,
Baohua Fan,
Mingchu Li,
Guoqiang Xie,
Hao Chen,
Yuanjing Feng
Abstract:
The segmentation of cranial nerves (CNs) tract provides a valuable quantitative tool for the analysis of the morphology and trajectory of individual CNs. Multimodal CNs tract segmentation networks, e.g., CNTSeg, which combine structural Magnetic Resonance Imaging (MRI) and diffusion MRI, have achieved promising segmentation performance. However, it is laborious or even infeasible to collect comple…
▽ More
The segmentation of cranial nerves (CNs) tract provides a valuable quantitative tool for the analysis of the morphology and trajectory of individual CNs. Multimodal CNs tract segmentation networks, e.g., CNTSeg, which combine structural Magnetic Resonance Imaging (MRI) and diffusion MRI, have achieved promising segmentation performance. However, it is laborious or even infeasible to collect complete multimodal data in clinical practice due to limitations in equipment, user privacy, and working conditions. In this work, we propose a novel arbitrary-modal fusion network for volumetric CNs tract segmentation, called CNTSeg-v2, which trains one model to handle different combinations of available modalities. Instead of directly combining all the modalities, we select T1-weighted (T1w) images as the primary modality due to its simplicity in data acquisition and contribution most to the results, which supervises the information selection of other auxiliary modalities. Our model encompasses an Arbitrary-Modal Collaboration Module (ACM) designed to effectively extract informative features from other auxiliary modalities, guided by the supervision of T1w images. Meanwhile, we construct a Deep Distance-guided Multi-stage (DDM) decoder to correct small errors and discontinuities through signed distance maps to improve segmentation accuracy. We evaluate our CNTSeg-v2 on the Human Connectome Project (HCP) dataset and the clinical Multi-shell Diffusion MRI (MDM) dataset. Extensive experimental results show that our CNTSeg-v2 achieves state-of-the-art segmentation performance, outperforming all competing methods.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
Generative Sign-description Prompts with Multi-positive Contrastive Learning for Sign Language Recognition
Authors:
Siyu Liang,
Yunan Li,
Wentian Xin,
Huizhou Chen,
Xujie Liu,
Kang Liu,
Qiguang Miao
Abstract:
Sign language recognition (SLR) faces fundamental challenges in creating accurate annotations due to the inherent complexity of simultaneous manual and non-manual signals. To the best of our knowledge, this is the first work to integrate generative large language models (LLMs) into SLR tasks. We propose a novel Generative Sign-description Prompts Multi-positive Contrastive learning (GSP-MC) method…
▽ More
Sign language recognition (SLR) faces fundamental challenges in creating accurate annotations due to the inherent complexity of simultaneous manual and non-manual signals. To the best of our knowledge, this is the first work to integrate generative large language models (LLMs) into SLR tasks. We propose a novel Generative Sign-description Prompts Multi-positive Contrastive learning (GSP-MC) method that leverages retrieval-augmented generation (RAG) with domain-specific LLMs, incorporating multi-step prompt engineering and expert-validated sign language corpora to produce precise multipart descriptions. The GSP-MC method also employs a dual-encoder architecture to bidirectionally align hierarchical skeleton features with multiple text descriptions (global, synonym, and part level) through probabilistic matching. Our approach combines global and part-level losses, optimizing KL divergence to ensure robust alignment across all relevant text-skeleton pairs while capturing both sign-level semantics and detailed part dynamics. Experiments demonstrate state-of-the-art performance against existing methods on the Chinese SLR500 (reaching 97.1%) and Turkish AUTSL datasets (97.07% accuracy). The method's cross-lingual effectiveness highlight its potential for developing inclusive communication technologies.
△ Less
Submitted 4 May, 2025;
originally announced May 2025.
-
A Comprehensive Analysis for Visual Object Hallucination in Large Vision-Language Models
Authors:
Liqiang Jing,
Guiming Hardy Chen,
Ehsan Aghazadeh,
Xin Eric Wang,
Xinya Du
Abstract:
Large Vision-Language Models (LVLMs) demonstrate remarkable capabilities in multimodal tasks, but visual object hallucination remains a persistent issue. It refers to scenarios where models generate inaccurate visual object-related information based on the query input, potentially leading to misinformation and concerns about safety and reliability. Previous works focus on the evaluation and mitiga…
▽ More
Large Vision-Language Models (LVLMs) demonstrate remarkable capabilities in multimodal tasks, but visual object hallucination remains a persistent issue. It refers to scenarios where models generate inaccurate visual object-related information based on the query input, potentially leading to misinformation and concerns about safety and reliability. Previous works focus on the evaluation and mitigation of visual hallucinations, but the underlying causes have not been comprehensively investigated. In this paper, we analyze each component of LLaVA-like LVLMs -- the large language model, the vision backbone, and the projector -- to identify potential sources of error and their impact. Based on our observations, we propose methods to mitigate hallucination for each problematic component. Additionally, we developed two hallucination benchmarks: QA-VisualGenome, which emphasizes attribute and relation hallucinations, and QA-FB15k, which focuses on cognition-based hallucinations.
△ Less
Submitted 3 May, 2025;
originally announced May 2025.
-
Analytic Energy-Guided Policy Optimization for Offline Reinforcement Learning
Authors:
Jifeng Hu,
Sili Huang,
Zhejian Yang,
Shengchao Hu,
Li Shen,
Hechang Chen,
Lichao Sun,
Yi Chang,
Dacheng Tao
Abstract:
Conditional decision generation with diffusion models has shown powerful competitiveness in reinforcement learning (RL). Recent studies reveal the relation between energy-function-guidance diffusion models and constrained RL problems. The main challenge lies in estimating the intermediate energy, which is intractable due to the log-expectation formulation during the generation process. To address…
▽ More
Conditional decision generation with diffusion models has shown powerful competitiveness in reinforcement learning (RL). Recent studies reveal the relation between energy-function-guidance diffusion models and constrained RL problems. The main challenge lies in estimating the intermediate energy, which is intractable due to the log-expectation formulation during the generation process. To address this issue, we propose the Analytic Energy-guided Policy Optimization (AEPO). Specifically, we first provide a theoretical analysis and the closed-form solution of the intermediate guidance when the diffusion model obeys the conditional Gaussian transformation. Then, we analyze the posterior Gaussian distribution in the log-expectation formulation and obtain the target estimation of the log-expectation under mild assumptions. Finally, we train an intermediate energy neural network to approach the target estimation of log-expectation formulation. We apply our method in 30+ offline RL tasks to demonstrate the effectiveness of our method. Extensive experiments illustrate that our method surpasses numerous representative baselines in D4RL offline reinforcement learning benchmarks.
△ Less
Submitted 3 May, 2025;
originally announced May 2025.
-
An LLM-Empowered Low-Resolution Vision System for On-Device Human Behavior Understanding
Authors:
Siyang Jiang,
Bufang Yang,
Lilin Xu,
Mu Yuan,
Yeerzhati Abudunuer,
Kaiwei Liu,
Liekang Zeng,
Hongkai Chen,
Zhenyu Yan,
Xiaofan Jiang,
Guoliang Xing
Abstract:
The rapid advancements in Large Vision Language Models (LVLMs) offer the potential to surpass conventional labeling by generating richer, more detailed descriptions of on-device human behavior understanding (HBU) in low-resolution vision systems, such as depth, thermal, and infrared. However, existing large vision language model (LVLM) approaches are unable to understand low-resolution data well a…
▽ More
The rapid advancements in Large Vision Language Models (LVLMs) offer the potential to surpass conventional labeling by generating richer, more detailed descriptions of on-device human behavior understanding (HBU) in low-resolution vision systems, such as depth, thermal, and infrared. However, existing large vision language model (LVLM) approaches are unable to understand low-resolution data well as they are primarily designed for high-resolution data, such as RGB images. A quick fixing approach is to caption a large amount of low-resolution data, but it requires a significant amount of labor-intensive annotation efforts. In this paper, we propose a novel, labor-saving system, Llambda, designed to support low-resolution HBU. The core idea is to leverage limited labeled data and a large amount of unlabeled data to guide LLMs in generating informative captions, which can be combined with raw data to effectively fine-tune LVLM models for understanding low-resolution videos in HBU. First, we propose a Contrastive-Oriented Data Labeler, which can capture behavior-relevant information from long, low-resolution videos and generate high-quality pseudo labels for unlabeled data via contrastive learning. Second, we propose a Physical-Knowledge Guided Captioner, which utilizes spatial and temporal consistency checks to mitigate errors in pseudo labels. Therefore, it can improve LLMs' understanding of sequential data and then generate high-quality video captions. Finally, to ensure on-device deployability, we employ LoRA-based efficient fine-tuning to adapt LVLMs for low-resolution data. We evaluate Llambda using a region-scale real-world testbed and three distinct low-resolution datasets, and the experiments show that Llambda outperforms several state-of-the-art LVLM systems up to $40.03\%$ on average Bert-Score.
△ Less
Submitted 3 May, 2025;
originally announced May 2025.
-
SimAug: Enhancing Recommendation with Pretrained Language Models for Dense and Balanced Data Augmentation
Authors:
Yuying Zhao,
Xiaodong Yang,
Huiyuan Chen,
Xiran Fan,
Yu Wang,
Yiwei Cai,
Tyler Derr
Abstract:
Deep Neural Networks (DNNs) are extensively used in collaborative filtering due to their impressive effectiveness. These systems depend on interaction data to learn user and item embeddings that are crucial for recommendations. However, the data often suffers from sparsity and imbalance issues: limited observations of user-item interactions can result in sub-optimal performance, and a predominance…
▽ More
Deep Neural Networks (DNNs) are extensively used in collaborative filtering due to their impressive effectiveness. These systems depend on interaction data to learn user and item embeddings that are crucial for recommendations. However, the data often suffers from sparsity and imbalance issues: limited observations of user-item interactions can result in sub-optimal performance, and a predominance of interactions with popular items may introduce recommendation bias. To address these challenges, we employ Pretrained Language Models (PLMs) to enhance the interaction data with textual information, leading to a denser and more balanced dataset. Specifically, we propose a simple yet effective data augmentation method (SimAug) based on the textual similarity from PLMs, which can be seamlessly integrated to any systems as a lightweight, plug-and-play component in the pre-processing stage. Our experiments across nine datasets consistently demonstrate improvements in both utility and fairness when training with the augmented data generated by SimAug. The code is available at https://github.com/YuyingZhao/SimAug.
△ Less
Submitted 3 May, 2025;
originally announced May 2025.
-
Machine Learning for Cyber-Attack Identification from Traffic Flows
Authors:
Yujing Zhou,
Marc L. Jacquet,
Robel Dawit,
Skyler Fabre,
Dev Sarawat,
Faheem Khan,
Madison Newell,
Yongxin Liu,
Dahai Liu,
Hongyun Chen,
Jian Wang,
Huihui Wang
Abstract:
This paper presents our simulation of cyber-attacks and detection strategies on the traffic control system in Daytona Beach, FL. using Raspberry Pi virtual machines and the OPNSense firewall, along with traffic dynamics from SUMO and exploitation via the Metasploit framework. We try to answer the research questions: are we able to identify cyber attacks by only analyzing traffic flow patterns. In…
▽ More
This paper presents our simulation of cyber-attacks and detection strategies on the traffic control system in Daytona Beach, FL. using Raspberry Pi virtual machines and the OPNSense firewall, along with traffic dynamics from SUMO and exploitation via the Metasploit framework. We try to answer the research questions: are we able to identify cyber attacks by only analyzing traffic flow patterns. In this research, the cyber attacks are focused particularly when lights are randomly turned all green or red at busy intersections by adversarial attackers. Despite challenges stemming from imbalanced data and overlapping traffic patterns, our best model shows 85\% accuracy when detecting intrusions purely using traffic flow statistics. Key indicators for successful detection included occupancy, jam length, and halting durations.
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
Explainable Machine Learning for Cyberattack Identification from Traffic Flows
Authors:
Yujing Zhou,
Marc L. Jacquet,
Robel Dawit,
Skyler Fabre,
Dev Sarawat,
Faheem Khan,
Madison Newell,
Yongxin Liu,
Dahai Liu,
Hongyun Chen,
Jian Wang,
Huihui Wang
Abstract:
The increasing automation of traffic management systems has made them prime targets for cyberattacks, disrupting urban mobility and public safety. Traditional network-layer defenses are often inaccessible to transportation agencies, necessitating a machine learning-based approach that relies solely on traffic flow data. In this study, we simulate cyberattacks in a semi-realistic environment, using…
▽ More
The increasing automation of traffic management systems has made them prime targets for cyberattacks, disrupting urban mobility and public safety. Traditional network-layer defenses are often inaccessible to transportation agencies, necessitating a machine learning-based approach that relies solely on traffic flow data. In this study, we simulate cyberattacks in a semi-realistic environment, using a virtualized traffic network to analyze disruption patterns. We develop a deep learning-based anomaly detection system, demonstrating that Longest Stop Duration and Total Jam Distance are key indicators of compromised signals. To enhance interpretability, we apply Explainable AI (XAI) techniques, identifying critical decision factors and diagnosing misclassification errors. Our analysis reveals two primary challenges: transitional data inconsistencies, where mislabeled recovery-phase traffic misleads the model, and model limitations, where stealth attacks in low-traffic conditions evade detection. This work enhances AI-driven traffic security, improving both detection accuracy and trustworthiness in smart transportation systems.
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
Early Detection of Patient Deterioration from Real-Time Wearable Monitoring System
Authors:
Lo Pang-Yun Ting,
Hong-Pei Chen,
An-Shan Liu,
Chun-Yin Yeh,
Po-Lin Chen,
Kun-Ta Chuang
Abstract:
Early detection of patient deterioration is crucial for reducing mortality rates. Heart rate data has shown promise in assessing patient health, and wearable devices offer a cost-effective solution for real-time monitoring. However, extracting meaningful insights from diverse heart rate data and handling missing values in wearable device data remain key challenges. To address these challenges, we…
▽ More
Early detection of patient deterioration is crucial for reducing mortality rates. Heart rate data has shown promise in assessing patient health, and wearable devices offer a cost-effective solution for real-time monitoring. However, extracting meaningful insights from diverse heart rate data and handling missing values in wearable device data remain key challenges. To address these challenges, we propose TARL, an innovative approach that models the structural relationships of representative subsequences, known as shapelets, in heart rate time series. TARL creates a shapelet-transition knowledge graph to model shapelet dynamics in heart rate time series, indicating illness progression and potential future changes. We further introduce a transition-aware knowledge embedding to reinforce relationships among shapelets and quantify the impact of missing values, enabling the formulation of comprehensive heart rate representations. These representations capture explanatory structures and predict future heart rate trends, aiding early illness detection. We collaborate with physicians and nurses to gather ICU patient heart rate data from wearables and diagnostic metrics assessing illness severity for evaluating deterioration. Experiments on real-world ICU data demonstrate that TARL achieves both high reliability and early detection. A case study further showcases TARL's explainable detection process, highlighting its potential as an AI-driven tool to assist clinicians in recognizing early signs of patient deterioration.
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
PREMISE: Matching-based Prediction for Accurate Review Recommendation
Authors:
Wei Han,
Hui Chen,
Soujanya Poria
Abstract:
We present PREMISE (PREdict with Matching ScorEs), a new architecture for the matching-based learning in the multimodal fields for the multimodal review helpfulness (MRHP) task. Distinct to previous fusion-based methods which obtains multimodal representations via cross-modal attention for downstream tasks, PREMISE computes the multi-scale and multi-field representations, filters duplicated semant…
▽ More
We present PREMISE (PREdict with Matching ScorEs), a new architecture for the matching-based learning in the multimodal fields for the multimodal review helpfulness (MRHP) task. Distinct to previous fusion-based methods which obtains multimodal representations via cross-modal attention for downstream tasks, PREMISE computes the multi-scale and multi-field representations, filters duplicated semantics, and then obtained a set of matching scores as feature vectors for the downstream recommendation task. This new architecture significantly boosts the performance for such multimodal tasks whose context matching content are highly correlated to the targets of that task, compared to the state-of-the-art fusion-based methods. Experimental results on two publicly available datasets show that PREMISE achieves promising performance with less computational cost.
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
Where's the liability in the Generative Era? Recovery-based Black-Box Detection of AI-Generated Content
Authors:
Haoyue Bai,
Yiyou Sun,
Wei Cheng,
Haifeng Chen
Abstract:
The recent proliferation of photorealistic images created by generative models has sparked both excitement and concern, as these images are increasingly indistinguishable from real ones to the human eye. While offering new creative and commercial possibilities, the potential for misuse, such as in misinformation and fraud, highlights the need for effective detection methods. Current detection appr…
▽ More
The recent proliferation of photorealistic images created by generative models has sparked both excitement and concern, as these images are increasingly indistinguishable from real ones to the human eye. While offering new creative and commercial possibilities, the potential for misuse, such as in misinformation and fraud, highlights the need for effective detection methods. Current detection approaches often rely on access to model weights or require extensive collections of real image datasets, limiting their scalability and practical application in real world scenarios. In this work, we introduce a novel black box detection framework that requires only API access, sidestepping the need for model weights or large auxiliary datasets. Our approach leverages a corrupt and recover strategy: by masking part of an image and assessing the model ability to reconstruct it, we measure the likelihood that the image was generated by the model itself. For black-box models that do not support masked image inputs, we incorporate a cost efficient surrogate model trained to align with the target model distribution, enhancing detection capability. Our framework demonstrates strong performance, outperforming baseline methods by 4.31% in mean average precision across eight diffusion model variant datasets.
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
Attack and defense techniques in large language models: A survey and new perspectives
Authors:
Zhiyu Liao,
Kang Chen,
Yuanguo Lin,
Kangkang Li,
Yunxuan Liu,
Hefeng Chen,
Xingwang Huang,
Yuanhui Yu
Abstract:
Large Language Models (LLMs) have become central to numerous natural language processing tasks, but their vulnerabilities present significant security and ethical challenges. This systematic survey explores the evolving landscape of attack and defense techniques in LLMs. We classify attacks into adversarial prompt attack, optimized attacks, model theft, as well as attacks on application of LLMs, d…
▽ More
Large Language Models (LLMs) have become central to numerous natural language processing tasks, but their vulnerabilities present significant security and ethical challenges. This systematic survey explores the evolving landscape of attack and defense techniques in LLMs. We classify attacks into adversarial prompt attack, optimized attacks, model theft, as well as attacks on application of LLMs, detailing their mechanisms and implications. Consequently, we analyze defense strategies, including prevention-based and detection-based defense methods. Although advances have been made, challenges remain to adapt to the dynamic threat landscape, balance usability with robustness, and address resource constraints in defense implementation. We highlight open problems, including the need for adaptive scalable defenses, explainable security techniques, and standardized evaluation frameworks. This survey provides actionable insights and directions for developing secure and resilient LLMs, emphasizing the importance of interdisciplinary collaboration and ethical considerations to mitigate risks in real-world applications.
△ Less
Submitted 1 May, 2025;
originally announced May 2025.
-
A Survey of Interactive Generative Video
Authors:
Jiwen Yu,
Yiran Qin,
Haoxuan Che,
Quande Liu,
Xintao Wang,
Pengfei Wan,
Di Zhang,
Kun Gai,
Hao Chen,
Xihui Liu
Abstract:
Interactive Generative Video (IGV) has emerged as a crucial technology in response to the growing demand for high-quality, interactive video content across various domains. In this paper, we define IGV as a technology that combines generative capabilities to produce diverse high-quality video content with interactive features that enable user engagement through control signals and responsive feedb…
▽ More
Interactive Generative Video (IGV) has emerged as a crucial technology in response to the growing demand for high-quality, interactive video content across various domains. In this paper, we define IGV as a technology that combines generative capabilities to produce diverse high-quality video content with interactive features that enable user engagement through control signals and responsive feedback. We survey the current landscape of IGV applications, focusing on three major domains: 1) gaming, where IGV enables infinite exploration in virtual worlds; 2) embodied AI, where IGV serves as a physics-aware environment synthesizer for training agents in multimodal interaction with dynamically evolving scenes; and 3) autonomous driving, where IGV provides closed-loop simulation capabilities for safety-critical testing and validation. To guide future development, we propose a comprehensive framework that decomposes an ideal IGV system into five essential modules: Generation, Control, Memory, Dynamics, and Intelligence. Furthermore, we systematically analyze the technical challenges and future directions in realizing each component for an ideal IGV system, such as achieving real-time generation, enabling open-domain control, maintaining long-term coherence, simulating accurate physics, and integrating causal reasoning. We believe that this systematic analysis will facilitate future research and development in the field of IGV, ultimately advancing the technology toward more sophisticated and practical applications.
△ Less
Submitted 30 April, 2025;
originally announced April 2025.
-
Rethinking Visual Layer Selection in Multimodal LLMs
Authors:
Haoran Chen,
Junyan Lin,
Xinhao Chen,
Yue Fan,
Xin Jin,
Hui Su,
Jianfeng Dong,
Jinlan Fu,
Xiaoyu Shen
Abstract:
Multimodal large language models (MLLMs) have achieved impressive performance across a wide range of tasks, typically using CLIP-ViT as their visual encoder due to its strong text-image alignment capabilities. While prior studies suggest that different CLIP-ViT layers capture different types of information, with shallower layers focusing on fine visual details and deeper layers aligning more close…
▽ More
Multimodal large language models (MLLMs) have achieved impressive performance across a wide range of tasks, typically using CLIP-ViT as their visual encoder due to its strong text-image alignment capabilities. While prior studies suggest that different CLIP-ViT layers capture different types of information, with shallower layers focusing on fine visual details and deeper layers aligning more closely with textual semantics, most MLLMs still select visual features based on empirical heuristics rather than systematic analysis. In this work, we propose a Layer-wise Representation Similarity approach to group CLIP-ViT layers with similar behaviors into {shallow, middle, and deep} categories and assess their impact on MLLM performance. Building on this foundation, we revisit the visual layer selection problem in MLLMs at scale, training LLaVA-style models ranging from 1.4B to 7B parameters. Through extensive experiments across 10 datasets and 4 tasks, we find that: (1) deep layers are essential for OCR tasks; (2) shallow and middle layers substantially outperform deep layers on reasoning tasks involving counting, positioning, and object localization; (3) a lightweight fusion of features across shallow, middle, and deep layers consistently outperforms specialized fusion baselines and single-layer selections, achieving gains on 9 out of 10 datasets. Our work offers the first principled study of visual layer selection in MLLMs, laying the groundwork for deeper investigations into visual representation learning for MLLMs.
△ Less
Submitted 30 April, 2025;
originally announced April 2025.
-
UniBiomed: A Universal Foundation Model for Grounded Biomedical Image Interpretation
Authors:
Linshan Wu,
Yuxiang Nie,
Sunan He,
Jiaxin Zhuang,
Hao Chen
Abstract:
Multi-modal interpretation of biomedical images opens up novel opportunities in biomedical image analysis. Conventional AI approaches typically rely on disjointed training, i.e., Large Language Models (LLMs) for clinical text generation and segmentation models for target extraction, which results in inflexible real-world deployment and a failure to leverage holistic biomedical information. To this…
▽ More
Multi-modal interpretation of biomedical images opens up novel opportunities in biomedical image analysis. Conventional AI approaches typically rely on disjointed training, i.e., Large Language Models (LLMs) for clinical text generation and segmentation models for target extraction, which results in inflexible real-world deployment and a failure to leverage holistic biomedical information. To this end, we introduce UniBiomed, the first universal foundation model for grounded biomedical image interpretation. UniBiomed is based on a novel integration of Multi-modal Large Language Model (MLLM) and Segment Anything Model (SAM), which effectively unifies the generation of clinical texts and the segmentation of corresponding biomedical objects for grounded interpretation. In this way, UniBiomed is capable of tackling a wide range of biomedical tasks across ten diverse biomedical imaging modalities. To develop UniBiomed, we curate a large-scale dataset comprising over 27 million triplets of images, annotations, and text descriptions across ten imaging modalities. Extensive validation on 84 internal and external datasets demonstrated that UniBiomed achieves state-of-the-art performance in segmentation, disease recognition, region-aware diagnosis, visual question answering, and report generation. Moreover, unlike previous models that rely on clinical experts to pre-diagnose images and manually craft precise textual or visual prompts, UniBiomed can provide automated and end-to-end grounded interpretation for biomedical image analysis. This represents a novel paradigm shift in clinical workflows, which will significantly improve diagnostic efficiency. In summary, UniBiomed represents a novel breakthrough in biomedical AI, unlocking powerful grounded interpretation capabilities for more accurate and efficient biomedical image analysis.
△ Less
Submitted 30 April, 2025;
originally announced April 2025.
-
Unsupervised Feature Transformation via In-context Generation, Generator-critic LLM Agents, and Duet-play Teaming
Authors:
Nanxu Gong,
Xinyuan Wang,
Wangyang Ying,
Haoyue Bai,
Sixun Dong,
Haifeng Chen,
Yanjie Fu
Abstract:
Feature transformation involves generating a new set of features from the original dataset to enhance the data's utility. In certain domains like material performance screening, dimensionality is large and collecting labels is expensive and lengthy. It highly necessitates transforming feature spaces efficiently and without supervision to enhance data readiness and AI utility. However, existing met…
▽ More
Feature transformation involves generating a new set of features from the original dataset to enhance the data's utility. In certain domains like material performance screening, dimensionality is large and collecting labels is expensive and lengthy. It highly necessitates transforming feature spaces efficiently and without supervision to enhance data readiness and AI utility. However, existing methods fall short in efficient navigation of a vast space of feature combinations, and are mostly designed for supervised settings. To fill this gap, our unique perspective is to leverage a generator-critic duet-play teaming framework using LLM agents and in-context learning to derive pseudo-supervision from unsupervised data. The framework consists of three interconnected steps: (1) Critic agent diagnoses data to generate actionable advice, (2) Generator agent produces tokenized feature transformations guided by the critic's advice, and (3) Iterative refinement ensures continuous improvement through feedback between agents. The generator-critic framework can be generalized to human-agent collaborative generation, by replacing the critic agent with human experts. Extensive experiments demonstrate that the proposed framework outperforms even supervised baselines in feature transformation efficiency, robustness, and practical applicability across diverse datasets.
△ Less
Submitted 30 April, 2025;
originally announced April 2025.
-
FFCBA: Feature-based Full-target Clean-label Backdoor Attacks
Authors:
Yangxu Yin,
Honglong Chen,
Yudong Gao,
Peng Sun,
Liantao Wu,
Zhe Li,
Weifeng Liu
Abstract:
Backdoor attacks pose a significant threat to deep neural networks, as backdoored models would misclassify poisoned samples with specific triggers into target classes while maintaining normal performance on clean samples. Among these, multi-target backdoor attacks can simultaneously target multiple classes. However, existing multi-target backdoor attacks all follow the dirty-label paradigm, where…
▽ More
Backdoor attacks pose a significant threat to deep neural networks, as backdoored models would misclassify poisoned samples with specific triggers into target classes while maintaining normal performance on clean samples. Among these, multi-target backdoor attacks can simultaneously target multiple classes. However, existing multi-target backdoor attacks all follow the dirty-label paradigm, where poisoned samples are mislabeled, and most of them require an extremely high poisoning rate. This makes them easily detectable by manual inspection. In contrast, clean-label attacks are more stealthy, as they avoid modifying the labels of poisoned samples. However, they generally struggle to achieve stable and satisfactory attack performance and often fail to scale effectively to multi-target attacks. To address this issue, we propose the Feature-based Full-target Clean-label Backdoor Attacks (FFCBA) which consists of two paradigms: Feature-Spanning Backdoor Attacks (FSBA) and Feature-Migrating Backdoor Attacks (FMBA). FSBA leverages class-conditional autoencoders to generate noise triggers that align perturbed in-class samples with the original category's features, ensuring the effectiveness, intra-class consistency, inter-class specificity and natural-feature correlation of triggers. While FSBA supports swift and efficient attacks, its cross-model attack capability is relatively weak. FMBA employs a two-stage class-conditional autoencoder training process that alternates between using out-of-class samples and in-class samples. This allows FMBA to generate triggers with strong target-class features, making it highly effective for cross-model attacks. We conduct experiments on multiple datasets and models, the results show that FFCBA achieves outstanding attack performance and maintains desirable robustness against the state-of-the-art backdoor defenses.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
SFIBA: Spatial-based Full-target Invisible Backdoor Attacks
Authors:
Yangxu Yin,
Honglong Chen,
Yudong Gao,
Peng Sun,
Zhishuai Li,
Weifeng Liu
Abstract:
Multi-target backdoor attacks pose significant security threats to deep neural networks, as they can preset multiple target classes through a single backdoor injection. This allows attackers to control the model to misclassify poisoned samples with triggers into any desired target class during inference, exhibiting superior attack performance compared with conventional backdoor attacks. However, e…
▽ More
Multi-target backdoor attacks pose significant security threats to deep neural networks, as they can preset multiple target classes through a single backdoor injection. This allows attackers to control the model to misclassify poisoned samples with triggers into any desired target class during inference, exhibiting superior attack performance compared with conventional backdoor attacks. However, existing multi-target backdoor attacks fail to guarantee trigger specificity and stealthiness in black-box settings, resulting in two main issues. First, they are unable to simultaneously target all classes when only training data can be manipulated, limiting their effectiveness in realistic attack scenarios. Second, the triggers often lack visual imperceptibility, making poisoned samples easy to detect. To address these problems, we propose a Spatial-based Full-target Invisible Backdoor Attack, called SFIBA. It restricts triggers for different classes to specific local spatial regions and morphologies in the pixel space to ensure specificity, while employing a frequency-domain-based trigger injection method to guarantee stealthiness. Specifically, for injection of each trigger, we first apply fast fourier transform to obtain the amplitude spectrum of clean samples in local spatial regions. Then, we employ discrete wavelet transform to extract the features from the amplitude spectrum and use singular value decomposition to integrate the trigger. Subsequently, we selectively filter parts of the trigger in pixel space to implement trigger morphology constraints and adjust injection coefficients based on visual effects. We conduct experiments on multiple datasets and models. The results demonstrate that SFIBA can achieve excellent attack performance and stealthiness, while preserving the model's performance on benign samples, and can also bypass existing backdoor defenses.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.