-
InvDesFlow-AL: Active Learning-based Workflow for Inverse Design of Functional Materials
Authors:
Xiao-Qi Han,
Peng-Jie Guo,
Ze-Feng Gao,
Hao Sun,
Zhong-Yi Lu
Abstract:
Developing inverse design methods for functional materials with specific properties is critical to advancing fields like renewable energy, catalysis, energy storage, and carbon capture. Generative models based on diffusion principles can directly produce new materials that meet performance constraints, thereby significantly accelerating the material design process. However, existing methods for ge…
▽ More
Developing inverse design methods for functional materials with specific properties is critical to advancing fields like renewable energy, catalysis, energy storage, and carbon capture. Generative models based on diffusion principles can directly produce new materials that meet performance constraints, thereby significantly accelerating the material design process. However, existing methods for generating and predicting crystal structures often remain limited by low success rates. In this work, we propose a novel inverse material design generative framework called InvDesFlow-AL, which is based on active learning strategies. This framework can iteratively optimize the material generation process to gradually guide it towards desired performance characteristics. In terms of crystal structure prediction, the InvDesFlow-AL model achieves an RMSE of 0.0423 Å, representing an 32.96% improvement in performance compared to exsisting generative models. Additionally, InvDesFlow-AL has been successfully validated in the design of low-formation-energy and low-Ehull materials. It can systematically generate materials with progressively lower formation energies while continuously expanding the exploration across diverse chemical spaces. These results fully demonstrate the effectiveness of the proposed active learning-driven generative model in accelerating material discovery and inverse design. To further prove the effectiveness of this method, we took the search for BCS superconductors under ambient pressure as an example explored by InvDesFlow-AL. As a result, we successfully identified Li\(_2\)AuH\(_6\) as a conventional BCS superconductor with an ultra-high transition temperature of 140 K. This discovery provides strong empirical support for the application of inverse design in materials science.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
DeepMath-Creative: A Benchmark for Evaluating Mathematical Creativity of Large Language Models
Authors:
Xiaoyang Chen,
Xinan Dai,
Yu Du,
Qian Feng,
Naixu Guo,
Tingshuo Gu,
Yuting Gao,
Yingyi Gao,
Xudong Han,
Xiang Jiang,
Yilin Jin,
Hongyi Lin,
Shisheng Lin,
Xiangnan Li,
Yuante Li,
Yixing Li,
Zhentao Lai,
Zilu Ma,
Yingrong Peng,
Jiacheng Qian,
Hao-Yu Sun,
Jianbo Sun,
Zirui Wang,
Siwei Wu,
Zian Wang
, et al. (6 additional authors not shown)
Abstract:
To advance the mathematical proficiency of large language models (LLMs), the DeepMath team has launched an open-source initiative aimed at developing an open mathematical LLM and systematically evaluating its mathematical creativity. This paper represents the initial contribution of this initiative. While recent developments in mathematical LLMs have predominantly emphasized reasoning skills, as e…
▽ More
To advance the mathematical proficiency of large language models (LLMs), the DeepMath team has launched an open-source initiative aimed at developing an open mathematical LLM and systematically evaluating its mathematical creativity. This paper represents the initial contribution of this initiative. While recent developments in mathematical LLMs have predominantly emphasized reasoning skills, as evidenced by benchmarks on elementary to undergraduate-level mathematical tasks, the creative capabilities of these models have received comparatively little attention, and evaluation datasets remain scarce. To address this gap, we propose an evaluation criteria for mathematical creativity and introduce DeepMath-Creative, a novel, high-quality benchmark comprising constructive problems across algebra, geometry, analysis, and other domains. We conduct a systematic evaluation of mainstream LLMs' creative problem-solving abilities using this dataset. Experimental results show that even under lenient scoring criteria -- emphasizing core solution components and disregarding minor inaccuracies, such as small logical gaps, incomplete justifications, or redundant explanations -- the best-performing model, O3 Mini, achieves merely 70% accuracy, primarily on basic undergraduate-level constructive tasks. Performance declines sharply on more complex problems, with models failing to provide substantive strategies for open problems. These findings suggest that, although current LLMs display a degree of constructive proficiency on familiar and lower-difficulty problems, such performance is likely attributable to the recombination of memorized patterns rather than authentic creative insight or novel synthesis.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
EventDiff: A Unified and Efficient Diffusion Model Framework for Event-based Video Frame Interpolation
Authors:
Hanle Zheng,
Xujie Han,
Zegang Peng,
Shangbin Zhang,
Guangxun Du,
Zhuo Zou,
Xilin Wang,
Jibin Wu,
Hao Guo,
Lei Deng
Abstract:
Video Frame Interpolation (VFI) is a fundamental yet challenging task in computer vision, particularly under conditions involving large motion, occlusion, and lighting variation. Recent advancements in event cameras have opened up new opportunities for addressing these challenges. While existing event-based VFI methods have succeeded in recovering large and complex motions by leveraging handcrafte…
▽ More
Video Frame Interpolation (VFI) is a fundamental yet challenging task in computer vision, particularly under conditions involving large motion, occlusion, and lighting variation. Recent advancements in event cameras have opened up new opportunities for addressing these challenges. While existing event-based VFI methods have succeeded in recovering large and complex motions by leveraging handcrafted intermediate representations such as optical flow, these designs often compromise high-fidelity image reconstruction under subtle motion scenarios due to their reliance on explicit motion modeling. Meanwhile, diffusion models provide a promising alternative for VFI by reconstructing frames through a denoising process, eliminating the need for explicit motion estimation or warping operations. In this work, we propose EventDiff, a unified and efficient event-based diffusion model framework for VFI. EventDiff features a novel Event-Frame Hybrid AutoEncoder (HAE) equipped with a lightweight Spatial-Temporal Cross Attention (STCA) module that effectively fuses dynamic event streams with static frames. Unlike previous event-based VFI methods, EventDiff performs interpolation directly in the latent space via a denoising diffusion process, making it more robust across diverse and challenging VFI scenarios. Through a two-stage training strategy that first pretrains the HAE and then jointly optimizes it with the diffusion model, our method achieves state-of-the-art performance across multiple synthetic and real-world event VFI datasets. The proposed method outperforms existing state-of-the-art event-based VFI methods by up to 1.98dB in PSNR on Vimeo90K-Triplet and shows superior performance in SNU-FILM tasks with multiple difficulty levels. Compared to the emerging diffusion-based VFI approach, our method achieves up to 5.72dB PSNR gain on Vimeo90K-Triplet and 4.24X faster inference.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Joint Graph Convolution and Sequential Modeling for Scalable Network Traffic Estimation
Authors:
Nan Jiang,
Wenxuan Zhu,
Xu Han,
Weiqiang Huang,
Yumeng Sun
Abstract:
This study focuses on the challenge of predicting network traffic within complex topological environments. It introduces a spatiotemporal modeling approach that integrates Graph Convolutional Networks (GCN) with Gated Recurrent Units (GRU). The GCN component captures spatial dependencies among network nodes, while the GRU component models the temporal evolution of traffic data. This combination al…
▽ More
This study focuses on the challenge of predicting network traffic within complex topological environments. It introduces a spatiotemporal modeling approach that integrates Graph Convolutional Networks (GCN) with Gated Recurrent Units (GRU). The GCN component captures spatial dependencies among network nodes, while the GRU component models the temporal evolution of traffic data. This combination allows for precise forecasting of future traffic patterns. The effectiveness of the proposed model is validated through comprehensive experiments on the real-world Abilene network traffic dataset. The model is benchmarked against several popular deep learning methods. Furthermore, a set of ablation experiments is conducted to examine the influence of various components on performance, including changes in the number of graph convolution layers, different temporal modeling strategies, and methods for constructing the adjacency matrix. Results indicate that the proposed approach achieves superior performance across multiple metrics, demonstrating robust stability and strong generalization capabilities in complex network traffic forecasting scenarios.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
E2E-FANet: A Highly Generalizable Framework for Waves prediction Behind Floating Breakwaters via Exogenous-to-Endogenous Variable Attention
Authors:
Jianxin Zhang,
Lianzi Jiang,
Xinyu Han,
Xiangrong Wang,
Weinan Huang
Abstract:
Accurate prediction of waves behind floating breakwaters (FB) is crucial for optimizing coastal engineering structures, enhancing safety, and improving design efficiency. Existing methods demonstrate limitations in capturing nonlinear interactions between waves and structures, while exhibiting insufficient capability in modeling the complex frequency-domain relationships among elevations of differ…
▽ More
Accurate prediction of waves behind floating breakwaters (FB) is crucial for optimizing coastal engineering structures, enhancing safety, and improving design efficiency. Existing methods demonstrate limitations in capturing nonlinear interactions between waves and structures, while exhibiting insufficient capability in modeling the complex frequency-domain relationships among elevations of different wave gauges. To address these challenges, this study introduces the Exogenous-to-Endogenous Frequency-Aware Network (E2E-FANet), a novel end-to-end neural network designed to model relationships between waves and structures. The E2E-FANetarchitecture incorporates a Dual-Basis Frequency Mapping (DBFM) module that leverages orthogonal cosine and sine bases to extract wave features from the frequency domain while preserving temporal information. Additionally, we introduce the Exogenous-to-Endogenous Cross-Attention (E2ECA) module, which employs cross attention to model the interactions between endogenous and exogenous variables. We incorporate a Temporal-wise Attention (TA) mechanism that adaptively captures complex dependencies in endogenous variables. These integrated modules function synergistically, enabling E2E-FANet to achieve both comprehensive feature perception in the time-frequency domain and precise modeling of wave-structure interactions. To comprehensively evaluate the performance of E2E-FANet, we constructed a multi-level validation framework comprising three distinct testing scenarios: internal validation under identical wave conditions, generalization testing across different wave conditions, and adaptability testing with varying relative water density (RW) conditions. These comprehensive tests demonstrate that E2E-FANet provides accurate waves behind FB predictions while successfully generalizing diverse wave conditions.
△ Less
Submitted 10 May, 2025;
originally announced May 2025.
-
A Novel Framework for Significant Wave Height Prediction based on Adaptive Feature Extraction Time-Frequency Network
Authors:
Jianxin Zhang,
Lianzi Jiang,
Xinyu Han,
Xiangrong Wang
Abstract:
Precise forecasting of significant wave height (Hs) is essential for the development and utilization of wave energy. The challenges in predicting Hs arise from its non-linear and non-stationary characteristics. The combination of decomposition preprocessing and machine learning models have demonstrated significant effectiveness in Hs prediction by extracting data features. However, decomposing the…
▽ More
Precise forecasting of significant wave height (Hs) is essential for the development and utilization of wave energy. The challenges in predicting Hs arise from its non-linear and non-stationary characteristics. The combination of decomposition preprocessing and machine learning models have demonstrated significant effectiveness in Hs prediction by extracting data features. However, decomposing the unknown data in the test set can lead to data leakage issues. To simultaneously achieve data feature extraction and prevent data leakage, a novel Adaptive Feature Extraction Time-Frequency Network (AFE-TFNet) is proposed to improve prediction accuracy and stability. It is encoder-decoder rolling framework. The encoder consists of two stages: feature extraction and feature fusion. In the feature extraction stage, global and local frequency domain features are extracted by combining Wavelet Transform (WT) and Fourier Transform (FT), and multi-scale frequency analysis is performed using Inception blocks. In the feature fusion stage, time-domain and frequency-domain features are integrated through dominant harmonic sequence energy weighting (DHSEW). The decoder employed an advanced long short-term memory (LSTM) model. Hourly measured wind speed (Ws), dominant wave period (DPD), average wave period (APD) and Hs from three stations are used as the dataset, and the four metrics are employed to evaluate the forecasting performance. Results show that AFE-TFNet significantly outperforms benchmark methods in terms of prediction accuracy. Feature extraction can significantly improve the prediction accuracy. DHSEW has substantially increased the accuracy of medium-term to long-term forecasting. The prediction accuracy of AFE-TFNet does not demonstrate significant variability with changes of rolling time window size. Overall, AFE-TFNet shows strong potential for handling complex signal forecasting.
△ Less
Submitted 10 May, 2025;
originally announced May 2025.
-
Towards Robust Few-Shot Text Classification Using Transformer Architectures and Dual Loss Strategies
Authors:
Xu Han,
Yumeng Sun,
Weiqiang Huang,
Hongye Zheng,
Junliang Du
Abstract:
Few-shot text classification has important application value in low-resource environments. This paper proposes a strategy that combines adaptive fine-tuning, contrastive learning, and regularization optimization to improve the classification performance of Transformer-based models. Experiments on the FewRel 2.0 dataset show that T5-small, DeBERTa-v3, and RoBERTa-base perform well in few-shot tasks…
▽ More
Few-shot text classification has important application value in low-resource environments. This paper proposes a strategy that combines adaptive fine-tuning, contrastive learning, and regularization optimization to improve the classification performance of Transformer-based models. Experiments on the FewRel 2.0 dataset show that T5-small, DeBERTa-v3, and RoBERTa-base perform well in few-shot tasks, especially in the 5-shot setting, which can more effectively capture text features and improve classification accuracy. The experiment also found that there are significant differences in the classification difficulty of different relationship categories. Some categories have fuzzy semantic boundaries or complex feature distributions, making it difficult for the standard cross entropy loss to learn the discriminative information required to distinguish categories. By introducing contrastive loss and regularization loss, the generalization ability of the model is enhanced, effectively alleviating the overfitting problem in few-shot environments. In addition, the research results show that the use of Transformer models or generative architectures with stronger self-attention mechanisms can help improve the stability and accuracy of few-shot classification.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
The Application of Deep Learning for Lymph Node Segmentation: A Systematic Review
Authors:
Jingguo Qu,
Xinyang Han,
Man-Lik Chui,
Yao Pu,
Simon Takadiyi Gunda,
Ziman Chen,
Jing Qin,
Ann Dorothy King,
Winnie Chiu-Wing Chu,
Jing Cai,
Michael Tin-Cheung Ying
Abstract:
Automatic lymph node segmentation is the cornerstone for advances in computer vision tasks for early detection and staging of cancer. Traditional segmentation methods are constrained by manual delineation and variability in operator proficiency, limiting their ability to achieve high accuracy. The introduction of deep learning technologies offers new possibilities for improving the accuracy of lym…
▽ More
Automatic lymph node segmentation is the cornerstone for advances in computer vision tasks for early detection and staging of cancer. Traditional segmentation methods are constrained by manual delineation and variability in operator proficiency, limiting their ability to achieve high accuracy. The introduction of deep learning technologies offers new possibilities for improving the accuracy of lymph node image analysis. This study evaluates the application of deep learning in lymph node segmentation and discusses the methodologies of various deep learning architectures such as convolutional neural networks, encoder-decoder networks, and transformers in analyzing medical imaging data across different modalities. Despite the advancements, it still confronts challenges like the shape diversity of lymph nodes, the scarcity of accurately labeled datasets, and the inadequate development of methods that are robust and generalizable across different imaging modalities. To the best of our knowledge, this is the first study that provides a comprehensive overview of the application of deep learning techniques in lymph node segmentation task. Furthermore, this study also explores potential future research directions, including multimodal fusion techniques, transfer learning, and the use of large-scale pre-trained models to overcome current limitations while enhancing cancer diagnosis and treatment planning strategies.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
Modeling Multi-Hop Semantic Paths for Recommendation in Heterogeneous Information Networks
Authors:
Hongye Zheng,
Yue Xing,
Lipeng Zhu,
Xu Han,
Junliang Du,
Wanyu Cui
Abstract:
This study focuses on the problem of path modeling in heterogeneous information networks and proposes a multi-hop path-aware recommendation framework. The method centers on multi-hop paths composed of various types of entities and relations. It models user preferences through three stages: path selection, semantic representation, and attention-based fusion. In the path selection stage, a path filt…
▽ More
This study focuses on the problem of path modeling in heterogeneous information networks and proposes a multi-hop path-aware recommendation framework. The method centers on multi-hop paths composed of various types of entities and relations. It models user preferences through three stages: path selection, semantic representation, and attention-based fusion. In the path selection stage, a path filtering mechanism is introduced to remove redundant and noisy information. In the representation learning stage, a sequential modeling structure is used to jointly encode entities and relations, preserving the semantic dependencies within paths. In the fusion stage, an attention mechanism assigns different weights to each path to generate a global user interest representation. Experiments conducted on real-world datasets such as Amazon-Book show that the proposed method significantly outperforms existing recommendation models across multiple evaluation metrics, including HR@10, Recall@10, and Precision@10. The results confirm the effectiveness of multi-hop paths in capturing high-order interaction semantics and demonstrate the expressive modeling capabilities of the framework in heterogeneous recommendation scenarios. This method provides both theoretical and practical value by integrating structural information modeling in heterogeneous networks with recommendation algorithm design. It offers a more expressive and flexible paradigm for learning user preferences in complex data environments.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data
Authors:
Yudong Wang,
Zixuan Fu,
Jie Cai,
Peijun Tang,
Hongya Lyu,
Yewei Fang,
Zhi Zheng,
Jie Zhou,
Guoyang Zeng,
Chaojun Xiao,
Xu Han,
Zhiyuan Liu
Abstract:
Data quality has become a key factor in enhancing model performance with the rapid development of large language models (LLMs). Model-driven data filtering has increasingly become a primary approach for acquiring high-quality data. However, it still faces two main challenges: (1) the lack of an efficient data verification strategy makes it difficult to provide timely feedback on data quality; and…
▽ More
Data quality has become a key factor in enhancing model performance with the rapid development of large language models (LLMs). Model-driven data filtering has increasingly become a primary approach for acquiring high-quality data. However, it still faces two main challenges: (1) the lack of an efficient data verification strategy makes it difficult to provide timely feedback on data quality; and (2) the selection of seed data for training classifiers lacks clear criteria and relies heavily on human expertise, introducing a degree of subjectivity. To address the first challenge, we introduce an efficient verification strategy that enables rapid evaluation of the impact of data on LLM training with minimal computational cost. To tackle the second challenge, we build upon the assumption that high-quality seed data is beneficial for LLM training, and by integrating the proposed verification strategy, we optimize the selection of positive and negative samples and propose an efficient data filtering pipeline. This pipeline not only improves filtering efficiency, classifier quality, and robustness, but also significantly reduces experimental and inference costs. In addition, to efficiently filter high-quality data, we employ a lightweight classifier based on fastText, and successfully apply the filtering pipeline to two widely-used pre-training corpora, FineWeb and Chinese FineWeb datasets, resulting in the creation of the higher-quality Ultra-FineWeb dataset. Ultra-FineWeb contains approximately 1 trillion English tokens and 120 billion Chinese tokens. Empirical results demonstrate that the LLMs trained on Ultra-FineWeb exhibit significant performance improvements across multiple benchmark tasks, validating the effectiveness of our pipeline in enhancing both data quality and training efficiency.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
MDAA-Diff: CT-Guided Multi-Dose Adaptive Attention Diffusion Model for PET Denoising
Authors:
Xiaolong Niu,
Zanting Ye,
Xu Han,
Yanchao Huang,
Hao Sun,
Hubing Wu,
Lijun Lu
Abstract:
Acquiring high-quality Positron Emission Tomography (PET) images requires administering high-dose radiotracers, which increases radiation exposure risks. Generating standard-dose PET (SPET) from low-dose PET (LPET) has become a potential solution. However, previous studies have primarily focused on single low-dose PET denoising, neglecting two critical factors: discrepancies in dose response cause…
▽ More
Acquiring high-quality Positron Emission Tomography (PET) images requires administering high-dose radiotracers, which increases radiation exposure risks. Generating standard-dose PET (SPET) from low-dose PET (LPET) has become a potential solution. However, previous studies have primarily focused on single low-dose PET denoising, neglecting two critical factors: discrepancies in dose response caused by inter-patient variability, and complementary anatomical constraints derived from CT images. In this work, we propose a novel CT-Guided Multi-dose Adaptive Attention Denoising Diffusion Model (MDAA-Diff) for multi-dose PET denoising. Our approach integrates anatomical guidance and dose-level adaptation to achieve superior denoising performance under low-dose conditions. Specifically, this approach incorporates a CT-Guided High-frequency Wavelet Attention (HWA) module, which uses wavelet transforms to separate high-frequency anatomical boundary features from CT images. These extracted features are then incorporated into PET imaging through an adaptive weighted fusion mechanism to enhance edge details. Additionally, we propose the Dose-Adaptive Attention (DAA) module, a dose-conditioned enhancement mechanism that dynamically integrates dose levels into channel-spatial attention weight calculation. Extensive experiments on 18F-FDG and 68Ga-FAPI datasets demonstrate that MDAA-Diff outperforms state-of-the-art approaches in preserving diagnostic quality under reduced-dose conditions. Our code is publicly available.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
PrimitiveAnything: Human-Crafted 3D Primitive Assembly Generation with Auto-Regressive Transformer
Authors:
Jingwen Ye,
Yuze He,
Yanning Zhou,
Yiqin Zhu,
Kaiwen Xiao,
Yong-Jin Liu,
Wei Yang,
Xiao Han
Abstract:
Shape primitive abstraction, which decomposes complex 3D shapes into simple geometric elements, plays a crucial role in human visual cognition and has broad applications in computer vision and graphics. While recent advances in 3D content generation have shown remarkable progress, existing primitive abstraction methods either rely on geometric optimization with limited semantic understanding or le…
▽ More
Shape primitive abstraction, which decomposes complex 3D shapes into simple geometric elements, plays a crucial role in human visual cognition and has broad applications in computer vision and graphics. While recent advances in 3D content generation have shown remarkable progress, existing primitive abstraction methods either rely on geometric optimization with limited semantic understanding or learn from small-scale, category-specific datasets, struggling to generalize across diverse shape categories. We present PrimitiveAnything, a novel framework that reformulates shape primitive abstraction as a primitive assembly generation task. PrimitiveAnything includes a shape-conditioned primitive transformer for auto-regressive generation and an ambiguity-free parameterization scheme to represent multiple types of primitives in a unified manner. The proposed framework directly learns the process of primitive assembly from large-scale human-crafted abstractions, enabling it to capture how humans decompose complex shapes into primitive elements. Through extensive experiments, we demonstrate that PrimitiveAnything can generate high-quality primitive assemblies that better align with human perception while maintaining geometric fidelity across diverse shape categories. It benefits various 3D applications and shows potential for enabling primitive-based user-generated content (UGC) in games. Project page: https://primitiveanything.github.io
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
WATCH: Adaptive Monitoring for AI Deployments via Weighted-Conformal Martingales
Authors:
Drew Prinster,
Xing Han,
Anqi Liu,
Suchi Saria
Abstract:
Responsibly deploying artificial intelligence (AI) / machine learning (ML) systems in high-stakes settings arguably requires not only proof of system reliability, but moreover continual, post-deployment monitoring to quickly detect and address any unsafe behavior. Statistical methods for nonparametric change-point detection -- especially the tools of conformal test martingales (CTMs) and anytime-v…
▽ More
Responsibly deploying artificial intelligence (AI) / machine learning (ML) systems in high-stakes settings arguably requires not only proof of system reliability, but moreover continual, post-deployment monitoring to quickly detect and address any unsafe behavior. Statistical methods for nonparametric change-point detection -- especially the tools of conformal test martingales (CTMs) and anytime-valid inference -- offer promising approaches to this monitoring task. However, existing methods are restricted to monitoring limited hypothesis classes or ``alarm criteria'' (such as data shifts that violate certain exchangeability assumptions), do not allow for online adaptation in response to shifts, and/or do not enable root-cause analysis of any degradation. In this paper, we expand the scope of these monitoring methods by proposing a weighted generalization of conformal test martingales (WCTMs), which lay a theoretical foundation for online monitoring for any unexpected changepoints in the data distribution while controlling false-alarms. For practical applications, we propose specific WCTM algorithms that adapt online to mild covariate shifts (in the marginal input distribution) while quickly detecting and diagnosing more severe shifts, such as concept shifts (in the conditional label distribution) or extreme (out-of-support) covariate shifts that cannot be easily adapted to. On real-world datasets, we demonstrate improved performance relative to state-of-the-art baselines.
△ Less
Submitted 12 May, 2025; v1 submitted 7 May, 2025;
originally announced May 2025.
-
Holmes: Automated Fact Check with Large Language Models
Authors:
Haoran Ou,
Gelei Deng,
Xingshuo Han,
Jie Zhang,
Xinlei He,
Han Qiu,
Shangwei Guo,
Tianwei Zhang
Abstract:
The rise of Internet connectivity has accelerated the spread of disinformation, threatening societal trust, decision-making, and national security. Disinformation has evolved from simple text to complex multimodal forms combining images and text, challenging existing detection methods. Traditional deep learning models struggle to capture the complexity of multimodal disinformation. Inspired by adv…
▽ More
The rise of Internet connectivity has accelerated the spread of disinformation, threatening societal trust, decision-making, and national security. Disinformation has evolved from simple text to complex multimodal forms combining images and text, challenging existing detection methods. Traditional deep learning models struggle to capture the complexity of multimodal disinformation. Inspired by advances in AI, this study explores using Large Language Models (LLMs) for automated disinformation detection. The empirical study shows that (1) LLMs alone cannot reliably assess the truthfulness of claims; (2) providing relevant evidence significantly improves their performance; (3) however, LLMs cannot autonomously search for accurate evidence. To address this, we propose Holmes, an end-to-end framework featuring a novel evidence retrieval method that assists LLMs in collecting high-quality evidence. Our approach uses (1) LLM-powered summarization to extract key information from open sources and (2) a new algorithm and metrics to evaluate evidence quality. Holmes enables LLMs to verify claims and generate justifications effectively. Experiments show Holmes achieves 88.3% accuracy on two open-source datasets and 90.2% in real-time verification tasks. Notably, our improved evidence retrieval boosts fact-checking accuracy by 30.8% over existing methods
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
MVHumanNet++: A Large-scale Dataset of Multi-view Daily Dressing Human Captures with Richer Annotations for 3D Human Digitization
Authors:
Chenghong Li,
Hongjie Liao,
Yihao Zhi,
Xihe Yang,
Zhengwentai Sun,
Jiahao Chang,
Shuguang Cui,
Xiaoguang Han
Abstract:
In this era, the success of large language models and text-to-image models can be attributed to the driving force of large-scale datasets. However, in the realm of 3D vision, while significant progress has been achieved in object-centric tasks through large-scale datasets like Objaverse and MVImgNet, human-centric tasks have seen limited advancement, largely due to the absence of a comparable larg…
▽ More
In this era, the success of large language models and text-to-image models can be attributed to the driving force of large-scale datasets. However, in the realm of 3D vision, while significant progress has been achieved in object-centric tasks through large-scale datasets like Objaverse and MVImgNet, human-centric tasks have seen limited advancement, largely due to the absence of a comparable large-scale human dataset. To bridge this gap, we present MVHumanNet++, a dataset that comprises multi-view human action sequences of 4,500 human identities. The primary focus of our work is on collecting human data that features a large number of diverse identities and everyday clothing using multi-view human capture systems, which facilitates easily scalable data collection. Our dataset contains 9,000 daily outfits, 60,000 motion sequences and 645 million frames with extensive annotations, including human masks, camera parameters, 2D and 3D keypoints, SMPL/SMPLX parameters, and corresponding textual descriptions. Additionally, the proposed MVHumanNet++ dataset is enhanced with newly processed normal maps and depth maps, significantly expanding its applicability and utility for advanced human-centric research. To explore the potential of our proposed MVHumanNet++ dataset in various 2D and 3D visual tasks, we conducted several pilot studies to demonstrate the performance improvements and effective applications enabled by the scale provided by MVHumanNet++. As the current largest-scale 3D human dataset, we hope that the release of MVHumanNet++ dataset with annotations will foster further innovations in the domain of 3D human-centric tasks at scale. MVHumanNet++ is publicly available at https://kevinlee09.github.io/research/MVHumanNet++/.
△ Less
Submitted 3 May, 2025;
originally announced May 2025.
-
Optimized Lattice-Structured Flexible EIT Sensor for Tactile Reconstruction and Classification
Authors:
Huazhi Dong,
Sihao Teng,
Xu Han,
Xiaopeng Wu,
Francesco Giorgio-Serchi,
Yunjie Yang
Abstract:
Flexible electrical impedance tomography (EIT) offers a promising alternative to traditional tactile sensing approaches, enabling low-cost, scalable, and deformable sensor designs. Here, we propose an optimized lattice-structured flexible EIT tactile sensor incorporating a hydrogel-based conductive layer, systematically designed through three-dimensional coupling field simulations to optimize stru…
▽ More
Flexible electrical impedance tomography (EIT) offers a promising alternative to traditional tactile sensing approaches, enabling low-cost, scalable, and deformable sensor designs. Here, we propose an optimized lattice-structured flexible EIT tactile sensor incorporating a hydrogel-based conductive layer, systematically designed through three-dimensional coupling field simulations to optimize structural parameters for enhanced sensitivity and robustness. By tuning the lattice channel width and conductive layer thickness, we achieve significant improvements in tactile reconstruction quality and classification performance. Experimental results demonstrate high-quality tactile reconstruction with correlation coefficients up to 0.9275, peak signal-to-noise ratios reaching 29.0303 dB, and structural similarity indexes up to 0.9660, while maintaining low relative errors down to 0.3798. Furthermore, the optimized sensor accurately classifies 12 distinct tactile stimuli with an accuracy reaching 99.6%. These results highlight the potential of simulation-guided structural optimization for advancing flexible EIT-based tactile sensors toward practical applications in wearable systems, robotics, and human-machine interfaces.
△ Less
Submitted 30 April, 2025;
originally announced May 2025.
-
RoboGround: Robotic Manipulation with Grounded Vision-Language Priors
Authors:
Haifeng Huang,
Xinyi Chen,
Yilun Chen,
Hao Li,
Xiaoshen Han,
Zehan Wang,
Tai Wang,
Jiangmiao Pang,
Zhou Zhao
Abstract:
Recent advancements in robotic manipulation have highlighted the potential of intermediate representations for improving policy generalization. In this work, we explore grounding masks as an effective intermediate representation, balancing two key advantages: (1) effective spatial guidance that specifies target objects and placement areas while also conveying information about object shape and siz…
▽ More
Recent advancements in robotic manipulation have highlighted the potential of intermediate representations for improving policy generalization. In this work, we explore grounding masks as an effective intermediate representation, balancing two key advantages: (1) effective spatial guidance that specifies target objects and placement areas while also conveying information about object shape and size, and (2) broad generalization potential driven by large-scale vision-language models pretrained on diverse grounding datasets. We introduce RoboGround, a grounding-aware robotic manipulation system that leverages grounding masks as an intermediate representation to guide policy networks in object manipulation tasks. To further explore and enhance generalization, we propose an automated pipeline for generating large-scale, simulated data with a diverse set of objects and instructions. Extensive experiments show the value of our dataset and the effectiveness of grounding masks as intermediate guidance, significantly enhancing the generalization abilities of robot policies.
△ Less
Submitted 30 April, 2025;
originally announced April 2025.
-
Concurrency Testing in the Linux Kernel via eBPF
Authors:
Jiacheng Xu,
Dylan Wolff,
Xing Yi Han,
Jialin Li,
Abhik Roychoudhury
Abstract:
Concurrency is vital for our critical software to meet modern performance requirements, yet concurrency bugs are notoriously difficult to detect and reproduce. Controlled Concurrency Testing (CCT) can make bugs easier to expose by enabling control over thread interleavings and systematically exploring the interleaving space through scheduling algorithms. However, existing CCT solutions for kernel…
▽ More
Concurrency is vital for our critical software to meet modern performance requirements, yet concurrency bugs are notoriously difficult to detect and reproduce. Controlled Concurrency Testing (CCT) can make bugs easier to expose by enabling control over thread interleavings and systematically exploring the interleaving space through scheduling algorithms. However, existing CCT solutions for kernel code are heavyweight, leading to significant performance, maintainability and extensibility issues. In this work, we introduce LACE, a lightweight CCT framework for kernel code empowered by eBPF. Without hypervisor modification, LACE features a custom scheduler tailored for CCT algorithms to serialize non-determistic thread execution into a controlled ordering. LACE also provides a mechanism to safely inject scheduling points into the kernel for fine-grained control. Furthermore, LACE employs a two-phase mutation strategy to integrate the scheduler with a concurrency fuzzer, allowing for automated exploration of both the input and schedule space. In our evaluation, LACE achieves 38\% more branches, 57\% overhead reduction and 11.4$\times$ speed-up in bug exposure compared to the state-of-the-art kernel concurrency fuzzers. Our qualitative analysis also demonstrates the extensibility and maintainability of LACE. Furthermore, LACE discovers eight previously unknown bugs in the Linux kernel, with six confirmed by developers.
△ Less
Submitted 30 April, 2025;
originally announced April 2025.
-
Fairness Is More Than Algorithms: Racial Disparities in Time-to-Recidivism
Authors:
Jessy Xinyi Han,
Kristjan Greenewald,
Devavrat Shah
Abstract:
Racial disparities in recidivism remain a persistent challenge within the criminal justice system, increasingly exacerbated by the adoption of algorithmic risk assessment tools. Past works have primarily focused on bias induced by these tools, treating recidivism as a binary outcome. Limited attention has been given to non-algorithmic factors (including socioeconomic ones) in driving racial dispar…
▽ More
Racial disparities in recidivism remain a persistent challenge within the criminal justice system, increasingly exacerbated by the adoption of algorithmic risk assessment tools. Past works have primarily focused on bias induced by these tools, treating recidivism as a binary outcome. Limited attention has been given to non-algorithmic factors (including socioeconomic ones) in driving racial disparities from a systemic perspective. To that end, this work presents a multi-stage causal framework to investigate the advent and extent of disparities by considering time-to-recidivism rather than a simple binary outcome. The framework captures interactions among races, the algorithm, and contextual factors. This work introduces the notion of counterfactual racial disparity and offers a formal test using survival analysis that can be conducted with observational data to assess if differences in recidivism arise from algorithmic bias, contextual factors, or their interplay. In particular, it is formally established that if sufficient statistical evidence for differences across racial groups is observed, it would support rejecting the null hypothesis that non-algorithmic factors (including socioeconomic ones) do not affect recidivism. An empirical study applying this framework to the COMPAS dataset reveals that short-term recidivism patterns do not exhibit racial disparities when controlling for risk scores. However, statistically significant disparities emerge with longer follow-up periods, particularly for low-risk groups. This suggests that factors beyond algorithmic scores, possibly structural disparities in housing, employment, and social support, may accumulate and exacerbate recidivism risks over time. This underscores the need for policy interventions extending beyond algorithmic improvements to address broader influences on recidivism trajectories.
△ Less
Submitted 25 April, 2025;
originally announced April 2025.
-
BBAL: A Bidirectional Block Floating Point-Based Quantisation Accelerator for Large Language Models
Authors:
Xiaomeng Han,
Yuan Cheng,
Jing Wang,
Junyang Lu,
Hui Wang,
X. x. Zhang,
Ning Xu,
Dawei Yang,
Zhe Jiang
Abstract:
Large language models (LLMs), with their billions of parameters, pose substantial challenges for deployment on edge devices, straining both memory capacity and computational resources. Block Floating Point (BFP) quantisation reduces memory and computational overhead by converting high-overhead floating point operations into low-bit fixed point operations. However, BFP requires aligning all data to…
▽ More
Large language models (LLMs), with their billions of parameters, pose substantial challenges for deployment on edge devices, straining both memory capacity and computational resources. Block Floating Point (BFP) quantisation reduces memory and computational overhead by converting high-overhead floating point operations into low-bit fixed point operations. However, BFP requires aligning all data to the maximum exponent, which causes loss of small and moderate values, resulting in quantisation error and degradation in the accuracy of LLMs. To address this issue, we propose a Bidirectional Block Floating Point (BBFP) data format, which reduces the probability of selecting the maximum as shared exponent, thereby reducing quantisation error. By utilizing the features in BBFP, we present a full-stack Bidirectional Block Floating Point-Based Quantisation Accelerator for LLMs (BBAL), primarily comprising a processing element array based on BBFP, paired with proposed cost-effective nonlinear computation unit. Experimental results show BBAL achieves a 22% improvement in accuracy compared to an outlier-aware accelerator at similar efficiency, and a 40% efficiency improvement over a BFP-based accelerator at similar accuracy.
△ Less
Submitted 22 April, 2025;
originally announced April 2025.
-
Shape-Guided Clothing Warping for Virtual Try-On
Authors:
Xiaoyu Han,
Shunyuan Zheng,
Zonglin Li,
Chenyang Wang,
Xin Sun,
Quanling Meng
Abstract:
Image-based virtual try-on aims to seamlessly fit in-shop clothing to a person image while maintaining pose consistency. Existing methods commonly employ the thin plate spline (TPS) transformation or appearance flow to deform in-shop clothing for aligning with the person's body. Despite their promising performance, these methods often lack precise control over fine details, leading to inconsistenc…
▽ More
Image-based virtual try-on aims to seamlessly fit in-shop clothing to a person image while maintaining pose consistency. Existing methods commonly employ the thin plate spline (TPS) transformation or appearance flow to deform in-shop clothing for aligning with the person's body. Despite their promising performance, these methods often lack precise control over fine details, leading to inconsistencies in shape between clothing and the person's body as well as distortions in exposed limb regions. To tackle these challenges, we propose a novel shape-guided clothing warping method for virtual try-on, dubbed SCW-VTON, which incorporates global shape constraints and additional limb textures to enhance the realism and consistency of the warped clothing and try-on results. To integrate global shape constraints for clothing warping, we devise a dual-path clothing warping module comprising a shape path and a flow path. The former path captures the clothing shape aligned with the person's body, while the latter path leverages the mapping between the pre- and post-deformation of the clothing shape to guide the estimation of appearance flow. Furthermore, to alleviate distortions in limb regions of try-on results, we integrate detailed limb guidance by developing a limb reconstruction network based on masked image modeling. Through the utilization of SCW-VTON, we are able to generate try-on results with enhanced clothing shape consistency and precise control over details. Extensive experiments demonstrate the superiority of our approach over state-of-the-art methods both qualitatively and quantitatively. The code is available at https://github.com/xyhanHIT/SCW-VTON.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
Talk is Not Always Cheap: Promoting Wireless Sensing Models with Text Prompts
Authors:
Zhenkui Yang,
Zeyi Huang,
Ge Wang,
Han Ding,
Tony Xiao Han,
Fei Wang
Abstract:
Wireless signal-based human sensing technologies, such as WiFi, millimeter-wave (mmWave) radar, and Radio Frequency Identification (RFID), enable the detection and interpretation of human presence, posture, and activities, thereby providing critical support for applications in public security, healthcare, and smart environments. These technologies exhibit notable advantages due to their non-contac…
▽ More
Wireless signal-based human sensing technologies, such as WiFi, millimeter-wave (mmWave) radar, and Radio Frequency Identification (RFID), enable the detection and interpretation of human presence, posture, and activities, thereby providing critical support for applications in public security, healthcare, and smart environments. These technologies exhibit notable advantages due to their non-contact operation and environmental adaptability; however, existing systems often fail to leverage the textual information inherent in datasets. To address this, we propose an innovative text-enhanced wireless sensing framework, WiTalk, that seamlessly integrates semantic knowledge through three hierarchical prompt strategies-label-only, brief description, and detailed action description-without requiring architectural modifications or incurring additional data costs. We rigorously validate this framework across three public benchmark datasets: XRF55 for human action recognition (HAR), and WiFiTAL and XRFV2 for WiFi temporal action localization (TAL). Experimental results demonstrate significant performance improvements: on XRF55, accuracy for WiFi, RFID, and mmWave increases by 3.9%, 2.59%, and 0.46%, respectively; on WiFiTAL, the average performance of WiFiTAD improves by 4.98%; and on XRFV2, the mean average precision gains across various methods range from 4.02% to 13.68%. Our codes have been included in https://github.com/yangzhenkui/WiTalk.
△ Less
Submitted 22 April, 2025; v1 submitted 20 April, 2025;
originally announced April 2025.
-
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners
Authors:
Yuhang Liu,
Pengxiang Li,
Congkai Xie,
Xavier Hu,
Xiaotian Han,
Shengyu Zhang,
Hongxia Yang,
Fei Wu
Abstract:
Multimodal Large Language Models (MLLMs) have powered Graphical User Interface (GUI) Agents, showing promise in automating tasks on computing devices. Recent works have begun exploring reasoning in GUI tasks with encouraging results. However, many current approaches rely on manually designed reasoning templates, which may result in reasoning that is not sufficiently robust and adaptive for complex…
▽ More
Multimodal Large Language Models (MLLMs) have powered Graphical User Interface (GUI) Agents, showing promise in automating tasks on computing devices. Recent works have begun exploring reasoning in GUI tasks with encouraging results. However, many current approaches rely on manually designed reasoning templates, which may result in reasoning that is not sufficiently robust and adaptive for complex GUI environments. Meanwhile, some existing agents continue to operate as Reactive Actors, relying primarily on implicit reasoning that may lack sufficient depth for GUI tasks demanding planning and error recovery. We argue that advancing these agents requires a shift from reactive acting towards acting based on deliberate reasoning. To facilitate this transformation, we introduce InfiGUI-R1, an MLLM-based GUI agent developed through our Actor2Reasoner framework, a reasoning-centric, two-stage training approach designed to progressively evolve agents from Reactive Actors to Deliberative Reasoners. The first stage, Reasoning Injection, focuses on establishing a basic reasoner. We employ Spatial Reasoning Distillation to transfer cross-modal spatial reasoning capabilities from teacher models to MLLMs through trajectories with explicit reasoning steps, enabling models to integrate GUI visual-spatial information with logical reasoning before action generation. The second stage, Deliberation Enhancement, refines the basic reasoner into a deliberative one using Reinforcement Learning. This stage introduces two approaches: Sub-goal Guidance, which rewards models for generating accurate intermediate sub-goals, and Error Recovery Scenario Construction, which creates failure-and-recovery training scenarios from identified prone-to-error steps. Experimental results show InfiGUI-R1 achieves strong performance in GUI grounding and trajectory tasks. Resources at https://github.com/Reallm-Labs/InfiGUI-R1.
△ Less
Submitted 19 April, 2025;
originally announced April 2025.
-
Feature Alignment and Representation Transfer in Knowledge Distillation for Large Language Models
Authors:
Junjie Yang,
Junhao Song,
Xudong Han,
Ziqian Bi,
Tianyang Wang,
Chia Xin Liang,
Xinyuan Song,
Yichao Zhang,
Qian Niu,
Benji Peng,
Keyu Chen,
Ming Liu
Abstract:
Knowledge distillation (KD) is a technique for transferring knowledge from complex teacher models to simpler student models, significantly enhancing model efficiency and accuracy. It has demonstrated substantial advancements in various applications including image classification, object detection, language modeling, text classification, and sentiment analysis. Recent innovations in KD methods, suc…
▽ More
Knowledge distillation (KD) is a technique for transferring knowledge from complex teacher models to simpler student models, significantly enhancing model efficiency and accuracy. It has demonstrated substantial advancements in various applications including image classification, object detection, language modeling, text classification, and sentiment analysis. Recent innovations in KD methods, such as attention-based approaches, block-wise logit distillation, and decoupling distillation, have notably improved student model performance. These techniques focus on stimulus complexity, attention mechanisms, and global information capture to optimize knowledge transfer. In addition, KD has proven effective in compressing large language models while preserving accuracy, reducing computational overhead, and improving inference speed. This survey synthesizes the latest literature, highlighting key findings, contributions, and future directions in knowledge distillation to provide insights for researchers and practitioners on its evolving role in artificial intelligence and machine learning.
△ Less
Submitted 18 April, 2025;
originally announced April 2025.
-
Testing the Fault-Tolerance of Multi-Sensor Fusion Perception in Autonomous Driving Systems
Authors:
Haoxiang Tian,
Wenqiang Ding,
Xingshuo Han,
Guoquan Wu,
An Guo,
Junqi Zhang. Wei Chen,
Jun Wei,
Tianwei Zhang
Abstract:
High-level Autonomous Driving Systems (ADSs), such as Google Waymo and Baidu Apollo, typically rely on multi-sensor fusion (MSF) based approaches to perceive their surroundings. This strategy increases perception robustness by combining the respective strengths of the camera and LiDAR and directly affects the safety-critical driving decisions of autonomous vehicles (AVs). However, in real-world au…
▽ More
High-level Autonomous Driving Systems (ADSs), such as Google Waymo and Baidu Apollo, typically rely on multi-sensor fusion (MSF) based approaches to perceive their surroundings. This strategy increases perception robustness by combining the respective strengths of the camera and LiDAR and directly affects the safety-critical driving decisions of autonomous vehicles (AVs). However, in real-world autonomous driving scenarios, cameras and LiDAR are subject to various faults, which can probably significantly impact the decision-making and behaviors of ADSs. Existing MSF testing approaches only discovered corner cases that the MSF-based perception cannot accurately detected by MSF-based perception, while lacking research on how sensor faults affect the system-level behaviors of ADSs.
To address this gap, we conduct the first exploration of the fault tolerance of MSF perception-based ADS for sensor faults. In this paper, we systematically and comprehensively build fault models for cameras and LiDAR in AVs and inject them into the MSF perception-based ADS to test its behaviors in test scenarios. To effectively and efficiently explore the parameter spaces of sensor fault models, we design a feedback-guided differential fuzzer to discover the safety violations of MSF perception-based ADS caused by the injected sensor faults. We evaluate FADE on the representative and practical industrial ADS, Baidu Apollo. Our evaluation results demonstrate the effectiveness and efficiency of FADE, and we conclude some useful findings from the experimental results. To validate the findings in the physical world, we use a real Baidu Apollo 6.0 EDU autonomous vehicle to conduct the physical experiments, and the results show the practical significance of our findings.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
POET: Supporting Prompting Creativity and Personalization with Automated Expansion of Text-to-Image Generation
Authors:
Evans Xu Han,
Alice Qian Zhang,
Hong Shen,
Haiyi Zhu,
Paul Pu Liang,
Jane Hsieh
Abstract:
State-of-the-art visual generative AI tools hold immense potential to assist users in the early ideation stages of creative tasks -- offering the ability to generate (rather than search for) novel and unprecedented (instead of existing) images of considerable quality that also adhere to boundless combinations of user specifications. However, many large-scale text-to-image systems are designed for…
▽ More
State-of-the-art visual generative AI tools hold immense potential to assist users in the early ideation stages of creative tasks -- offering the ability to generate (rather than search for) novel and unprecedented (instead of existing) images of considerable quality that also adhere to boundless combinations of user specifications. However, many large-scale text-to-image systems are designed for broad applicability, yielding conventional output that may limit creative exploration. They also employ interaction methods that may be difficult for beginners. Given that creative end users often operate in diverse, context-specific ways that are often unpredictable, more variation and personalization are necessary. We introduce POET, a real-time interactive tool that (1) automatically discovers dimensions of homogeneity in text-to-image generative models, (2) expands these dimensions to diversify the output space of generated images, and (3) learns from user feedback to personalize expansions. An evaluation with 28 users spanning four creative task domains demonstrated POET's ability to generate results with higher perceived diversity and help users reach satisfaction in fewer prompts during creative tasks, thereby prompting them to deliberate and reflect more on a wider range of possible produced results during the co-creative process. Focusing on visual creativity, POET offers a first glimpse of how interaction techniques of future text-to-image generation tools may support and align with more pluralistic values and the needs of end users during the ideation stages of their work.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
CAGE-GS: High-fidelity Cage Based 3D Gaussian Splatting Deformation
Authors:
Yifei Tong,
Runze Tian,
Xiao Han,
Dingyao Liu,
Fenggen Yu,
Yan Zhang
Abstract:
As 3D Gaussian Splatting (3DGS) gains popularity as a 3D representation of real scenes, enabling user-friendly deformation to create novel scenes while preserving fine details from the original 3DGS has attracted significant research attention. We introduce CAGE-GS, a cage-based 3DGS deformation method that seamlessly aligns a source 3DGS scene with a user-defined target shape. Our approach learns…
▽ More
As 3D Gaussian Splatting (3DGS) gains popularity as a 3D representation of real scenes, enabling user-friendly deformation to create novel scenes while preserving fine details from the original 3DGS has attracted significant research attention. We introduce CAGE-GS, a cage-based 3DGS deformation method that seamlessly aligns a source 3DGS scene with a user-defined target shape. Our approach learns a deformation cage from the target, which guides the geometric transformation of the source scene. While the cages effectively control structural alignment, preserving the textural appearance of 3DGS remains challenging due to the complexity of covariance parameters. To address this, we employ a Jacobian matrix-based strategy to update the covariance parameters of each Gaussian, ensuring texture fidelity post-deformation. Our method is highly flexible, accommodating various target shape representations, including texts, images, point clouds, meshes and 3DGS models. Extensive experiments and ablation studies on both public datasets and newly proposed scenes demonstrate that our method significantly outperforms existing techniques in both efficiency and deformation quality.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
ARAP-GS: Drag-driven As-Rigid-As-Possible 3D Gaussian Splatting Editing with Diffusion Prior
Authors:
Xiao Han,
Runze Tian,
Yifei Tong,
Fenggen Yu,
Dingyao Liu,
Yan Zhang
Abstract:
Drag-driven editing has become popular among designers for its ability to modify complex geometric structures through simple and intuitive manipulation, allowing users to adjust and reshape content with minimal technical skill. This drag operation has been incorporated into numerous methods to facilitate the editing of 2D images and 3D meshes in design. However, few studies have explored drag-driv…
▽ More
Drag-driven editing has become popular among designers for its ability to modify complex geometric structures through simple and intuitive manipulation, allowing users to adjust and reshape content with minimal technical skill. This drag operation has been incorporated into numerous methods to facilitate the editing of 2D images and 3D meshes in design. However, few studies have explored drag-driven editing for the widely-used 3D Gaussian Splatting (3DGS) representation, as deforming 3DGS while preserving shape coherence and visual continuity remains challenging. In this paper, we introduce ARAP-GS, a drag-driven 3DGS editing framework based on As-Rigid-As-Possible (ARAP) deformation. Unlike previous 3DGS editing methods, we are the first to apply ARAP deformation directly to 3D Gaussians, enabling flexible, drag-driven geometric transformations. To preserve scene appearance after deformation, we incorporate an advanced diffusion prior for image super-resolution within our iterative optimization process. This approach enhances visual quality while maintaining multi-view consistency in the edited results. Experiments show that ARAP-GS outperforms current methods across diverse 3D scenes, demonstrating its effectiveness and superiority for drag-driven 3DGS editing. Additionally, our method is highly efficient, requiring only 10 to 20 minutes to edit a scene on a single RTX 3090 GPU.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results
Authors:
Xin Li,
Yeying Jin,
Xin Jin,
Zongwei Wu,
Bingchen Li,
Yufei Wang,
Wenhan Yang,
Yu Li,
Zhibo Chen,
Bihan Wen,
Robby T. Tan,
Radu Timofte,
Qiyu Rong,
Hongyuan Jing,
Mengmeng Zhang,
Jinglong Li,
Xiangyu Lu,
Yi Ren,
Yuting Liu,
Meng Zhang,
Xiang Chen,
Qiyuan Guan,
Jiangxin Dong,
Jinshan Pan,
Conglin Gou
, et al. (112 additional authors not shown)
Abstract:
This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includ…
▽ More
This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includes day raindrop-focused, day background-focused, night raindrop-focused, and night background-focused degradations. This dataset is divided into three subsets for competition: 14,139 images for training, 240 images for validation, and 731 images for testing. The primary objective of this challenge is to establish a new and powerful benchmark for the task of removing raindrops under varying lighting and focus conditions. There are a total of 361 participants in the competition, and 32 teams submitting valid solutions and fact sheets for the final testing phase. These submissions achieved state-of-the-art (SOTA) performance on the Raindrop Clarity dataset. The project can be found at https://lixinustc.github.io/CVPR-NTIRE2025-RainDrop-Competition.github.io/.
△ Less
Submitted 19 April, 2025; v1 submitted 17 April, 2025;
originally announced April 2025.
-
Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time
Authors:
Wang Yang,
Xiang Yue,
Vipin Chaudhary,
Xiaotian Han
Abstract:
Recent advances leverage post-training to enhance model reasoning performance, which typically requires costly training pipelines and still suffers from inefficient, overly lengthy outputs. We introduce Speculative Thinking, a training-free framework that enables large reasoning models to guide smaller ones during inference at the reasoning level, distinct from speculative decoding, which operates…
▽ More
Recent advances leverage post-training to enhance model reasoning performance, which typically requires costly training pipelines and still suffers from inefficient, overly lengthy outputs. We introduce Speculative Thinking, a training-free framework that enables large reasoning models to guide smaller ones during inference at the reasoning level, distinct from speculative decoding, which operates at the token level. Our approach is based on two observations: (1) reasoning-supportive tokens such as "wait" frequently appear after structural delimiters like "\n\n", serving as signals for reflection or continuation; and (2) larger models exhibit stronger control over reflective behavior, reducing unnecessary backtracking while improving reasoning quality. By strategically delegating reflective steps to a more capable model, our method significantly boosts the reasoning accuracy of reasoning models while shortening their output. With the assistance of the 32B reasoning model, the 1.5B model's accuracy on MATH500 increases from 83.2% to 89.4%, marking a substantial improvement of 6.2%. Simultaneously, the average output length is reduced from 5439 tokens to 4583 tokens, representing a 15.7% decrease. Moreover, when applied to a non-reasoning model (Qwen-2.5-7B-Instruct), our framework boosts its accuracy from 74.0% to 81.8% on the same benchmark, achieving a relative improvement of 7.8%.
△ Less
Submitted 12 April, 2025;
originally announced April 2025.
-
Exploring Video-Based Driver Activity Recognition under Noisy Labels
Authors:
Linjuan Fan,
Di Wen,
Kunyu Peng,
Kailun Yang,
Jiaming Zhang,
Ruiping Liu,
Yufan Chen,
Junwei Zheng,
Jiamin Wu,
Xudong Han,
Rainer Stiefelhagen
Abstract:
As an open research topic in the field of deep learning, learning with noisy labels has attracted much attention and grown rapidly over the past ten years. Learning with label noise is crucial for driver distraction behavior recognition, as real-world video data often contains mislabeled samples, impacting model reliability and performance. However, label noise learning is barely explored in the d…
▽ More
As an open research topic in the field of deep learning, learning with noisy labels has attracted much attention and grown rapidly over the past ten years. Learning with label noise is crucial for driver distraction behavior recognition, as real-world video data often contains mislabeled samples, impacting model reliability and performance. However, label noise learning is barely explored in the driver activity recognition field. In this paper, we propose the first label noise learning approach for the driver activity recognition task. Based on the cluster assumption, we initially enable the model to learn clustering-friendly low-dimensional representations from given videos and assign the resultant embeddings into clusters. We subsequently perform co-refinement within each cluster to smooth the classifier outputs. Furthermore, we propose a flexible sample selection strategy that combines two selection criteria without relying on any hyperparameters to filter clean samples from the training dataset. We also incorporate a self-adaptive parameter into the sample selection process to enforce balancing across classes. A comprehensive variety of experiments on the public Drive&Act dataset for all granularity levels demonstrates the superior performance of our method in comparison with other label-denoising methods derived from the image classification field. The source code is available at https://github.com/ilonafan/DAR-noisy-labels.
△ Less
Submitted 16 April, 2025;
originally announced April 2025.
-
Finding Locally Densest Subgraphs: Convex Programming with Edge and Triangle Density
Authors:
Yi Yang,
Chenhao Ma,
Reynold Cheng,
Laks V. S. Lakshmanan,
Xiaolin Han
Abstract:
Finding the densest subgraph (DS) from a graph is a fundamental problem in graph databases. The DS obtained, which reveals closely related entities, has been found to be useful in various application domains such as e-commerce, social science, and biology. However, in a big graph that contains billions of edges, it is desirable to find more than one subgraph cluster that is not necessarily the den…
▽ More
Finding the densest subgraph (DS) from a graph is a fundamental problem in graph databases. The DS obtained, which reveals closely related entities, has been found to be useful in various application domains such as e-commerce, social science, and biology. However, in a big graph that contains billions of edges, it is desirable to find more than one subgraph cluster that is not necessarily the densest, yet they reveal closely related vertices. In this paper, we study the locally densest subgraph (LDS), a recently proposed variant of DS. An LDS is a subgraph which is the densest among the ``local neighbors''. Given a graph $G$, a number of LDSs can be returned, which reflect different dense regions of $G$ and thus give more information than DS. The existing LDS solution suffers from low efficiency. We thus develop a convex-programming-based solution that enables powerful pruning. We also extend our algorithm to triangle-based density to solve LTDS problem. Based on current algorithms, we propose a unified framework for the LDS and LTDS problems. Extensive experiments on thirteen real large graph datasets show that our proposed algorithm is up to four orders of magnitude faster than the state-of-the-art.
△ Less
Submitted 15 April, 2025;
originally announced April 2025.
-
Understanding LLMs' Cross-Lingual Context Retrieval: How Good It Is And Where It Comes From
Authors:
Changjiang Gao,
Hankun Lin,
Shujian Huang,
Xin Huang,
Xue Han,
Junlan Feng,
Chao Deng,
Jiajun Chen
Abstract:
The ability of cross-lingual context retrieval is a fundamental aspect of cross-lingual alignment of large language models (LLMs), where the model extracts context information in one language based on requests in another language. Despite its importance in real-life applications, this ability has not been adequately investigated for state-of-the-art models. In this paper, we evaluate the cross-lin…
▽ More
The ability of cross-lingual context retrieval is a fundamental aspect of cross-lingual alignment of large language models (LLMs), where the model extracts context information in one language based on requests in another language. Despite its importance in real-life applications, this ability has not been adequately investigated for state-of-the-art models. In this paper, we evaluate the cross-lingual context retrieval ability of over 40 LLMs across 12 languages to understand the source of this ability, using cross-lingual machine reading comprehension (xMRC) as a representative scenario. Our results show that several small, post-trained open LLMs show strong cross-lingual context retrieval ability, comparable to closed-source LLMs such as GPT-4o, and their estimated oracle performances greatly improve after post-training. Our interpretability analysis shows that the cross-lingual context retrieval process can be divided into two main phases: question encoding and answer retrieval, which are formed in pre-training and post-training, respectively. The phasing stability correlates with xMRC performance, and the xMRC bottleneck lies at the last model layers in the second phase, where the effect of post-training can be evidently observed. Our results also indicate that larger-scale pretraining cannot improve the xMRC performance. Instead, larger LLMs need further multilingual post-training to fully unlock their cross-lingual context retrieval potential. Our code and is available at https://github.com/NJUNLP/Cross-Lingual-Context-Retrieval
△ Less
Submitted 15 April, 2025;
originally announced April 2025.
-
ARise: Towards Knowledge-Augmented Reasoning via Risk-Adaptive Search
Authors:
Yize Zhang,
Tianshu Wang,
Sirui Chen,
Kun Wang,
Xingyu Zeng,
Hongyu Lin,
Xianpei Han,
Le Sun,
Chaochao Lu
Abstract:
Large language models (LLMs) have demonstrated impressive capabilities and are receiving increasing attention to enhance their reasoning through scaling test--time compute. However, their application in open--ended, knowledge--intensive, complex reasoning scenarios is still limited. Reasoning--oriented methods struggle to generalize to open--ended scenarios due to implicit assumptions of complete…
▽ More
Large language models (LLMs) have demonstrated impressive capabilities and are receiving increasing attention to enhance their reasoning through scaling test--time compute. However, their application in open--ended, knowledge--intensive, complex reasoning scenarios is still limited. Reasoning--oriented methods struggle to generalize to open--ended scenarios due to implicit assumptions of complete world knowledge. Meanwhile, knowledge--augmented reasoning (KAR) methods fail to address two core challenges: 1) error propagation, where errors in early steps cascade through the chain, and 2) verification bottleneck, where the explore--exploit tradeoff arises in multi--branch decision processes. To overcome these limitations, we introduce ARise, a novel framework that integrates risk assessment of intermediate reasoning states with dynamic retrieval--augmented generation (RAG) within a Monte Carlo tree search paradigm. This approach enables effective construction and optimization of reasoning plans across multiple maintained hypothesis branches. Experimental results show that ARise significantly outperforms the state--of--the--art KAR methods by up to 23.10%, and the latest RAG-equipped large reasoning models by up to 25.37%.
△ Less
Submitted 15 April, 2025;
originally announced April 2025.
-
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender
Authors:
Weixiang Zhao,
Jiahe Guo,
Yulin Hu,
Yang Deng,
An Zhang,
Xingyu Sui,
Xinyang Han,
Yanyan Zhao,
Bing Qin,
Tat-Seng Chua,
Ting Liu
Abstract:
Despite extensive efforts in safety alignment, large language models (LLMs) remain vulnerable to jailbreak attacks. Activation steering offers a training-free defense method but relies on fixed steering coefficients, resulting in suboptimal protection and increased false rejections of benign inputs. To address this, we propose AdaSteer, an adaptive activation steering method that dynamically adjus…
▽ More
Despite extensive efforts in safety alignment, large language models (LLMs) remain vulnerable to jailbreak attacks. Activation steering offers a training-free defense method but relies on fixed steering coefficients, resulting in suboptimal protection and increased false rejections of benign inputs. To address this, we propose AdaSteer, an adaptive activation steering method that dynamically adjusts model behavior based on input characteristics. We identify two key properties: Rejection Law (R-Law), which shows that stronger steering is needed for jailbreak inputs opposing the rejection direction, and Harmfulness Law (H-Law), which differentiates adversarial and benign inputs. AdaSteer steers input representations along both the Rejection Direction (RD) and Harmfulness Direction (HD), with adaptive coefficients learned via logistic regression, ensuring robust jailbreak defense while preserving benign input handling. Experiments on LLaMA-3.1, Gemma-2, and Qwen2.5 show that AdaSteer outperforms baseline methods across multiple jailbreak attacks with minimal impact on utility. Our results highlight the potential of interpretable model internals for real-time, flexible safety enforcement in LLMs.
△ Less
Submitted 13 April, 2025;
originally announced April 2025.
-
Context-Aware Adaptive Sampling for Intelligent Data Acquisition Systems Using DQN
Authors:
Weiqiang Huang,
Juecen Zhan,
Yumeng Sun,
Xu Han,
Tai An,
Nan Jiang
Abstract:
Multi-sensor systems are widely used in the Internet of Things, environmental monitoring, and intelligent manufacturing. However, traditional fixed-frequency sampling strategies often lead to severe data redundancy, high energy consumption, and limited adaptability, failing to meet the dynamic sensing needs of complex environments. To address these issues, this paper proposes a DQN-based multi-sen…
▽ More
Multi-sensor systems are widely used in the Internet of Things, environmental monitoring, and intelligent manufacturing. However, traditional fixed-frequency sampling strategies often lead to severe data redundancy, high energy consumption, and limited adaptability, failing to meet the dynamic sensing needs of complex environments. To address these issues, this paper proposes a DQN-based multi-sensor adaptive sampling optimization method. By leveraging a reinforcement learning framework to learn the optimal sampling strategy, the method balances data quality, energy consumption, and redundancy. We first model the multi-sensor sampling task as a Markov Decision Process (MDP), then employ a Deep Q-Network to optimize the sampling policy. Experiments on the Intel Lab Data dataset confirm that, compared with fixed-frequency sampling, threshold-triggered sampling, and other reinforcement learning approaches, DQN significantly improves data quality while lowering average energy consumption and redundancy rates. Moreover, in heterogeneous multi-sensor environments, DQN-based adaptive sampling shows enhanced robustness, maintaining superior data collection performance even in the presence of interference factors. These findings demonstrate that DQN-based adaptive sampling can enhance overall data acquisition efficiency in multi-sensor systems, providing a new solution for efficient and intelligent sensing.
△ Less
Submitted 12 April, 2025;
originally announced April 2025.
-
Adaptive Planning Framework for UAV-Based Surface Inspection in Partially Unknown Indoor Environments
Authors:
Hanyu Jin,
Zhefan Xu,
Haoyu Shen,
Xinming Han,
Kanlong Ye,
Kenji Shimada
Abstract:
Inspecting indoor environments such as tunnels, industrial facilities, and construction sites is essential for infrastructure monitoring and maintenance. While manual inspection in these environments is often time-consuming and potentially hazardous, Unmanned Aerial Vehicles (UAVs) can improve efficiency by autonomously handling inspection tasks. Such inspection tasks usually rely on reference map…
▽ More
Inspecting indoor environments such as tunnels, industrial facilities, and construction sites is essential for infrastructure monitoring and maintenance. While manual inspection in these environments is often time-consuming and potentially hazardous, Unmanned Aerial Vehicles (UAVs) can improve efficiency by autonomously handling inspection tasks. Such inspection tasks usually rely on reference maps for coverage planning. However, in industrial applications, only the floor plans are typically available. The unforeseen obstacles not included in the floor plans will result in outdated reference maps and inefficient or unsafe inspection trajectories. In this work, we propose an adaptive inspection framework that integrates global coverage planning with local reactive adaptation to improve the coverage and efficiency of UAV-based inspection in partially unknown indoor environments. Experimental results in structured indoor scenarios demonstrate the effectiveness of the proposed approach in inspection efficiency and achieving high coverage rates with adaptive obstacle handling, highlighting its potential for enhancing the efficiency of indoor facility inspection.
△ Less
Submitted 12 April, 2025;
originally announced April 2025.
-
Bayesian Reasoning Enabled by Spin-Orbit Torque Magnetic Tunnel Junctions
Authors:
Yingqian Xu,
Xiaohan Li,
Caihua Wan,
Ran Zhang,
Bin He,
Shiqiang Liu,
Jihao Xia,
Dehao Kong,
Shilong Xiong,
Guoqiang Yu,
Xiufeng Han
Abstract:
Bayesian networks play an increasingly important role in data mining, inference, and reasoning with the rapid development of artificial intelligence. In this paper, we present proof-of-concept experiments demonstrating the use of spin-orbit torque magnetic tunnel junctions (SOT-MTJs) in Bayesian network reasoning. Not only can the target probability distribution function (PDF) of a Bayesian networ…
▽ More
Bayesian networks play an increasingly important role in data mining, inference, and reasoning with the rapid development of artificial intelligence. In this paper, we present proof-of-concept experiments demonstrating the use of spin-orbit torque magnetic tunnel junctions (SOT-MTJs) in Bayesian network reasoning. Not only can the target probability distribution function (PDF) of a Bayesian network be precisely formulated by a conditional probability table as usual but also quantitatively parameterized by a probabilistic forward propagating neuron network. Moreover, the parameters of the network can also approach the optimum through a simple point-by point training algorithm, by leveraging which we do not need to memorize all historical data nor statistically summarize conditional probabilities behind them, significantly improving storage efficiency and economizing data pretreatment. Furthermore, we developed a simple medical diagnostic system using the SOT-MTJ as a random number generator and sampler, showcasing the application of SOT-MTJ-based Bayesian reasoning. This SOT-MTJ-based Bayesian reasoning shows great promise in the field of artificial probabilistic neural network, broadening the scope of spintronic device applications and providing an efficient and low-storage solution for complex reasoning tasks.
△ Less
Submitted 11 April, 2025;
originally announced April 2025.
-
Between Linear and Sinusoidal: Rethinking the Time Encoder in Dynamic Graph Learning
Authors:
Hsing-Huan Chung,
Shravan Chaudhari,
Xing Han,
Yoav Wald,
Suchi Saria,
Joydeep Ghosh
Abstract:
Dynamic graph learning is essential for applications involving temporal networks and requires effective modeling of temporal relationships. Seminal attention-based models like TGAT and DyGFormer rely on sinusoidal time encoders to capture temporal relationships between edge events. In this paper, we study a simpler alternative: the linear time encoder, which avoids temporal information loss caused…
▽ More
Dynamic graph learning is essential for applications involving temporal networks and requires effective modeling of temporal relationships. Seminal attention-based models like TGAT and DyGFormer rely on sinusoidal time encoders to capture temporal relationships between edge events. In this paper, we study a simpler alternative: the linear time encoder, which avoids temporal information loss caused by sinusoidal functions and reduces the need for high dimensional time encoders. We show that the self-attention mechanism can effectively learn to compute time spans from linear time encodings and extract relevant temporal patterns. Through extensive experiments on six dynamic graph datasets, we demonstrate that the linear time encoder improves the performance of TGAT and DyGFormer in most cases. Moreover, the linear time encoder can lead to significant savings in model parameters with minimal performance loss. For example, compared to a 100-dimensional sinusoidal time encoder, TGAT with a 2-dimensional linear time encoder saves 43% of parameters and achieves higher average precision on five datasets. These results can be readily used to positively impact the design choices of a wide variety of dynamic graph learning architectures. The experimental code is available at: https://github.com/hsinghuan/dg-linear-time.git.
△ Less
Submitted 10 April, 2025;
originally announced April 2025.
-
HarmonySeg: Tubular Structure Segmentation with Deep-Shallow Feature Fusion and Growth-Suppression Balanced Loss
Authors:
Yi Huang,
Ke Zhang,
Wei Liu,
Yuanyuan Wang,
Vishal M. Patel,
Le Lu,
Xu Han,
Dakai Jin,
Ke Yan
Abstract:
Accurate segmentation of tubular structures in medical images, such as vessels and airway trees, is crucial for computer-aided diagnosis, radiotherapy, and surgical planning. However, significant challenges exist in algorithm design when faced with diverse sizes, complex topologies, and (often) incomplete data annotation of these structures. We address these difficulties by proposing a new tubular…
▽ More
Accurate segmentation of tubular structures in medical images, such as vessels and airway trees, is crucial for computer-aided diagnosis, radiotherapy, and surgical planning. However, significant challenges exist in algorithm design when faced with diverse sizes, complex topologies, and (often) incomplete data annotation of these structures. We address these difficulties by proposing a new tubular structure segmentation framework named HarmonySeg. First, we design a deep-to-shallow decoder network featuring flexible convolution blocks with varying receptive fields, which enables the model to effectively adapt to tubular structures of different scales. Second, to highlight potential anatomical regions and improve the recall of small tubular structures, we incorporate vesselness maps as auxiliary information. These maps are aligned with image features through a shallow-and-deep fusion module, which simultaneously eliminates unreasonable candidates to maintain high precision. Finally, we introduce a topology-preserving loss function that leverages contextual and shape priors to balance the growth and suppression of tubular structures, which also allows the model to handle low-quality and incomplete annotations. Extensive quantitative experiments are conducted on four public datasets. The results show that our model can accurately segment 2D and 3D tubular structures and outperform existing state-of-the-art methods. External validation on a private dataset also demonstrates good generalizability.
△ Less
Submitted 10 April, 2025;
originally announced April 2025.
-
P2Object: Single Point Supervised Object Detection and Instance Segmentation
Authors:
Pengfei Chen,
Xuehui Yu,
Xumeng Han,
Kuiran Wang,
Guorong Li,
Lingxi Xie,
Zhenjun Han,
Jianbin Jiao
Abstract:
Object recognition using single-point supervision has attracted increasing attention recently. However, the performance gap compared with fully-supervised algorithms remains large. Previous works generated class-agnostic \textbf{\textit{proposals in an image}} offline and then treated mixed candidates as a single bag, putting a huge burden on multiple instance learning (MIL). In this paper, we int…
▽ More
Object recognition using single-point supervision has attracted increasing attention recently. However, the performance gap compared with fully-supervised algorithms remains large. Previous works generated class-agnostic \textbf{\textit{proposals in an image}} offline and then treated mixed candidates as a single bag, putting a huge burden on multiple instance learning (MIL). In this paper, we introduce Point-to-Box Network (P2BNet), which constructs balanced \textbf{\textit{instance-level proposal bags}} by generating proposals in an anchor-like way and refining the proposals in a coarse-to-fine paradigm. Through further research, we find that the bag of proposals, either at the image level or the instance level, is established on discrete box sampling. This leads the pseudo box estimation into a sub-optimal solution, resulting in the truncation of object boundaries or the excessive inclusion of background. Hence, we conduct a series exploration of discrete-to-continuous optimization, yielding P2BNet++ and Point-to-Mask Network (P2MNet). P2BNet++ conducts an approximately continuous proposal sampling strategy by better utilizing spatial clues. P2MNet further introduces low-level image information to assist in pixel prediction, and a boundary self-prediction is designed to relieve the limitation of the estimated boxes. Benefiting from the continuous object-aware \textbf{\textit{pixel-level perception}}, P2MNet can generate more precise bounding boxes and generalize to segmentation tasks. Our method largely surpasses the previous methods in terms of the mean average precision on COCO, VOC, SBD, and Cityscapes, demonstrating great potential to bridge the performance gap compared with fully supervised tasks.
△ Less
Submitted 10 April, 2025;
originally announced April 2025.
-
Llama-3-Nanda-10B-Chat: An Open Generative Large Language Model for Hindi
Authors:
Monojit Choudhury,
Shivam Chauhan,
Rocktim Jyoti Das,
Dhruv Sahnan,
Xudong Han,
Haonan Li,
Aaryamonvikram Singh,
Alok Anil Jadhav,
Utkarsh Agarwal,
Mukund Choudhary,
Debopriyo Banerjee,
Fajri Koto,
Junaid Bhat,
Awantika Shukla,
Samujjwal Ghosh,
Samta Kamboj,
Onkar Pandit,
Lalit Pradhan,
Rahul Pal,
Sunil Sahu,
Soundar Doraiswamy,
Parvez Mullah,
Ali El Filali,
Neha Sengupta,
Gokul Ramakrishnan
, et al. (5 additional authors not shown)
Abstract:
Developing high-quality large language models (LLMs) for moderately resourced languages presents unique challenges in data availability, model adaptation, and evaluation. We introduce Llama-3-Nanda-10B-Chat, or Nanda for short, a state-of-the-art Hindi-centric instruction-tuned generative LLM, designed to push the boundaries of open-source Hindi language models. Built upon Llama-3-8B, Nanda incorp…
▽ More
Developing high-quality large language models (LLMs) for moderately resourced languages presents unique challenges in data availability, model adaptation, and evaluation. We introduce Llama-3-Nanda-10B-Chat, or Nanda for short, a state-of-the-art Hindi-centric instruction-tuned generative LLM, designed to push the boundaries of open-source Hindi language models. Built upon Llama-3-8B, Nanda incorporates continuous pre-training with expanded transformer blocks, leveraging the Llama Pro methodology. A key challenge was the limited availability of high-quality Hindi text data; we addressed this through rigorous data curation, augmentation, and strategic bilingual training, balancing Hindi and English corpora to optimize cross-linguistic knowledge transfer. With 10 billion parameters, Nanda stands among the top-performing open-source Hindi and multilingual models of similar scale, demonstrating significant advantages over many existing models. We provide an in-depth discussion of training strategies, fine-tuning techniques, safety alignment, and evaluation metrics, demonstrating how these approaches enabled Nanda to achieve state-of-the-art results. By open-sourcing Nanda, we aim to advance research in Hindi LLMs and support a wide range of real-world applications across academia, industry, and public services.
△ Less
Submitted 8 April, 2025;
originally announced April 2025.
-
Falcon: Fractional Alternating Cut with Overcoming Minima in Unsupervised Segmentation
Authors:
Xiao Zhang,
Xiangyu Han,
Xiwen Lai,
Yao Sun,
Pei Zhang,
Konrad Kording
Abstract:
Today's unsupervised image segmentation algorithms often segment suboptimally. Modern graph-cut based approaches rely on high-dimensional attention maps from Transformer-based foundation models, typically employing a relaxed Normalized Cut solved recursively via the Fiedler vector (the eigenvector of the second smallest eigenvalue). Consequently, they still lag behind supervised methods in both ma…
▽ More
Today's unsupervised image segmentation algorithms often segment suboptimally. Modern graph-cut based approaches rely on high-dimensional attention maps from Transformer-based foundation models, typically employing a relaxed Normalized Cut solved recursively via the Fiedler vector (the eigenvector of the second smallest eigenvalue). Consequently, they still lag behind supervised methods in both mask generation speed and segmentation accuracy. We present a regularized fractional alternating cut (Falcon), an optimization-based K-way Normalized Cut without relying on recursive eigenvector computations, achieving substantially improved speed and accuracy. Falcon operates in two stages: (1) a fast K-way Normalized Cut solved by extending into a fractional quadratic transformation, with an alternating iterative procedure and regularization to avoid local minima; and (2) refinement of the resulting masks using complementary low-level information, producing high-quality pixel-level segmentations. Experiments show that Falcon not only surpasses existing state-of-the-art methods by an average of 2.5% across six widely recognized benchmarks (reaching up to 4.3\% improvement on Cityscapes), but also reduces runtime by around 30% compared to prior graph-based approaches. These findings demonstrate that the semantic information within foundation-model attention can be effectively harnessed by a highly parallelizable graph cut framework. Consequently, Falcon can narrow the gap between unsupervised and supervised segmentation, enhancing scalability in real-world applications and paving the way for dense prediction-based vision pre-training in various downstream tasks. The code is released in https://github.com/KordingLab/Falcon.
△ Less
Submitted 7 April, 2025;
originally announced April 2025.
-
Edge Approximation Text Detector
Authors:
Chuang Yang,
Xu Han,
Tao Han,
Han Han,
Bingxuan Zhao,
Qi Wang
Abstract:
Pursuing efficient text shape representations helps scene text detection models focus on compact foreground regions and optimize the contour reconstruction steps to simplify the whole detection pipeline. Current approaches either represent irregular shapes via box-to-polygon strategy or decomposing a contour into pieces for fitting gradually, the deficiency of coarse contours or complex pipelines…
▽ More
Pursuing efficient text shape representations helps scene text detection models focus on compact foreground regions and optimize the contour reconstruction steps to simplify the whole detection pipeline. Current approaches either represent irregular shapes via box-to-polygon strategy or decomposing a contour into pieces for fitting gradually, the deficiency of coarse contours or complex pipelines always exists in these models. Considering the above issues, we introduce EdgeText to fit text contours compactly while alleviating excessive contour rebuilding processes. Concretely, it is observed that the two long edges of texts can be regarded as smooth curves. It allows us to build contours via continuous and smooth edges that cover text regions tightly instead of fitting piecewise, which helps avoid the two limitations in current models. Inspired by this observation, EdgeText formulates the text representation as the edge approximation problem via parameterized curve fitting functions. In the inference stage, our model starts with locating text centers, and then creating curve functions for approximating text edges relying on the points. Meanwhile, truncation points are determined based on the location features. In the end, extracting curve segments from curve functions by using the pixel coordinate information brought by truncation points to reconstruct text contours. Furthermore, considering the deep dependency of EdgeText on text edges, a bilateral enhanced perception (BEP) module is designed. It encourages our model to pay attention to the recognition of edge features. Additionally, to accelerate the learning of the curve function parameters, we introduce a proportional integral loss (PI-loss) to force the proposed model to focus on the curve distribution and avoid being disturbed by text scales.
△ Less
Submitted 4 April, 2025;
originally announced April 2025.
-
Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision
Authors:
Xiaofeng Han,
Shunpeng Chen,
Zenghuang Fu,
Zhe Feng,
Lue Fan,
Dong An,
Changwei Wang,
Li Guo,
Weiliang Meng,
Xiaopeng Zhang,
Rongtao Xu,
Shibiao Xu
Abstract:
Robot vision has greatly benefited from advancements in multimodal fusion techniques and vision-language models (VLMs). We systematically review the applications of multimodal fusion in key robotic vision tasks, including semantic scene understanding, simultaneous localization and mapping (SLAM), 3D object detection, navigation and localization, and robot manipulation. We compare VLMs based on lar…
▽ More
Robot vision has greatly benefited from advancements in multimodal fusion techniques and vision-language models (VLMs). We systematically review the applications of multimodal fusion in key robotic vision tasks, including semantic scene understanding, simultaneous localization and mapping (SLAM), 3D object detection, navigation and localization, and robot manipulation. We compare VLMs based on large language models (LLMs) with traditional multimodal fusion methods, analyzing their advantages, limitations, and synergies. Additionally, we conduct an in-depth analysis of commonly used datasets, evaluating their applicability and challenges in real-world robotic scenarios. Furthermore, we identify critical research challenges such as cross-modal alignment, efficient fusion strategies, real-time deployment, and domain adaptation, and propose future research directions, including self-supervised learning for robust multimodal representations, transformer-based fusion architectures, and scalable multimodal frameworks. Through a comprehensive review, comparative analysis, and forward-looking discussion, we provide a valuable reference for advancing multimodal perception and interaction in robotic vision. A comprehensive list of studies in this survey is available at https://github.com/Xiaofeng-Han-Res/MF-RV.
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
Improving Harmful Text Detection with Joint Retrieval and External Knowledge
Authors:
Zidong Yu,
Shuo Wang,
Nan Jiang,
Weiqiang Huang,
Xu Han,
Junliang Du
Abstract:
Harmful text detection has become a crucial task in the development and deployment of large language models, especially as AI-generated content continues to expand across digital platforms. This study proposes a joint retrieval framework that integrates pre-trained language models with knowledge graphs to improve the accuracy and robustness of harmful text detection. Experimental results demonstra…
▽ More
Harmful text detection has become a crucial task in the development and deployment of large language models, especially as AI-generated content continues to expand across digital platforms. This study proposes a joint retrieval framework that integrates pre-trained language models with knowledge graphs to improve the accuracy and robustness of harmful text detection. Experimental results demonstrate that the joint retrieval approach significantly outperforms single-model baselines, particularly in low-resource training scenarios and multilingual environments. The proposed method effectively captures nuanced harmful content by leveraging external contextual information, addressing the limitations of traditional detection models. Future research should focus on optimizing computational efficiency, enhancing model interpretability, and expanding multimodal detection capabilities to better tackle evolving harmful content patterns. This work contributes to the advancement of AI safety, ensuring more trustworthy and reliable content moderation systems.
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
Investigating and Scaling up Code-Switching for Multilingual Language Model Pre-Training
Authors:
Zhijun Wang,
Jiahuan Li,
Hao Zhou,
Rongxiang Weng,
Jingang Wang,
Xin Huang,
Xue Han,
Junlan Feng,
Chao Deng,
Shujian Huang
Abstract:
Large language models (LLMs) exhibit remarkable multilingual capabilities despite the extreme language imbalance in the pre-training data. In this paper, we closely examine the reasons behind this phenomenon, focusing on the pre-training corpus. We find that the existence of code-switching, alternating between different languages within a context, is key to multilingual capabilities. We conduct an…
▽ More
Large language models (LLMs) exhibit remarkable multilingual capabilities despite the extreme language imbalance in the pre-training data. In this paper, we closely examine the reasons behind this phenomenon, focusing on the pre-training corpus. We find that the existence of code-switching, alternating between different languages within a context, is key to multilingual capabilities. We conduct an analysis to investigate code-switching in the pre-training corpus, examining its presence and categorizing it into four types within two quadrants. We then assess its impact on multilingual performance. These types of code-switching data are unbalanced in proportions and demonstrate different effects on facilitating language transfer. To better explore the power of code-switching for language alignment during pre-training, we investigate the strategy of synthetic code-switching. We continuously scale up the synthetic code-switching data and observe remarkable improvements in both benchmarks and representation space. Extensive experiments indicate that incorporating synthetic code-switching data enables better language alignment and generalizes well to high, medium, and low-resource languages with pre-training corpora of varying qualities.
△ Less
Submitted 22 April, 2025; v1 submitted 2 April, 2025;
originally announced April 2025.
-
ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers
Authors:
Qianhao Yuan,
Qingyu Zhang,
Yanjiang Liu,
Jiawei Chen,
Yaojie Lu,
Hongyu Lin,
Jia Zheng,
Xianpei Han,
Le Sun
Abstract:
Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of a layer's transformations on visual and text tokens, respectively. The calculation of LC involves measuring t…
▽ More
Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of a layer's transformations on visual and text tokens, respectively. The calculation of LC involves measuring the divergence in model output that results from removing the layer's transformations on the specified tokens. Our pilot experiment reveals that many layers of MLLMs exhibit minimal contribution during the processing of visual tokens. Motivated by this observation, we propose ShortV, a training-free method that leverages LC to identify ineffective layers, and freezes visual token updates in these layers. Experiments show that ShortV can freeze visual token in approximately 60\% of the MLLM layers, thereby dramatically reducing computational costs related to updating visual tokens. For example, it achieves a 50\% reduction in FLOPs on LLaVA-NeXT-13B while maintaining superior performance. The code will be publicly available at https://github.com/icip-cas/ShortV
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
Memorizing is Not Enough: Deep Knowledge Injection Through Reasoning
Authors:
Ruoxi Xu,
Yunjie Ji,
Boxi Cao,
Yaojie Lu,
Hongyu Lin,
Xianpei Han,
Ben He,
Yingfei Sun,
Xiangang Li,
Le Sun
Abstract:
Although large language models (LLMs) excel in knowledge recall and reasoning, their static nature leads to outdated information as the real world evolves or when adapting to domain-specific knowledge, highlighting the need for effective knowledge injection. However, current research on knowledge injection remains superficial, mainly focusing on knowledge memorization and retrieval. This paper pro…
▽ More
Although large language models (LLMs) excel in knowledge recall and reasoning, their static nature leads to outdated information as the real world evolves or when adapting to domain-specific knowledge, highlighting the need for effective knowledge injection. However, current research on knowledge injection remains superficial, mainly focusing on knowledge memorization and retrieval. This paper proposes a four-tier knowledge injection framework that systematically defines the levels of knowledge injection: memorization, retrieval, reasoning, and association. Based on this framework, we introduce DeepKnowledge, a synthetic experimental testbed designed for fine-grained evaluation of the depth of knowledge injection across three knowledge types (novel, incremental, and updated). We then explore various knowledge injection scenarios and evaluate the depth of knowledge injection for each scenario on the benchmark. Experimental results reveal key factors to reach each level of knowledge injection for LLMs and establish a mapping between the levels of knowledge injection and the corresponding suitable injection methods, aiming to provide a comprehensive approach for efficient knowledge injection across various levels.
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model
Authors:
Xingcheng Zhou,
Xuyuan Han,
Feng Yang,
Yunpu Ma,
Alois C. Knoll
Abstract:
We present OpenDriveVLA, a Vision-Language Action (VLA) model designed for end-to-end autonomous driving. OpenDriveVLA builds upon open-source pre-trained large Vision-Language Models (VLMs) to generate reliable driving actions, conditioned on 3D environmental perception, ego vehicle states, and driver commands. To bridge the modality gap between driving visual representations and language embeddi…
▽ More
We present OpenDriveVLA, a Vision-Language Action (VLA) model designed for end-to-end autonomous driving. OpenDriveVLA builds upon open-source pre-trained large Vision-Language Models (VLMs) to generate reliable driving actions, conditioned on 3D environmental perception, ego vehicle states, and driver commands. To bridge the modality gap between driving visual representations and language embeddings, we propose a hierarchical vision-language alignment process, projecting both 2D and 3D structured visual tokens into a unified semantic space. Besides, OpenDriveVLA models the dynamic relationships between the ego vehicle, surrounding agents, and static road elements through an autoregressive agent-env-ego interaction process, ensuring both spatially and behaviorally informed trajectory planning. Extensive experiments on the nuScenes dataset demonstrate that OpenDriveVLA achieves state-of-the-art results across open-loop trajectory planning and driving-related question-answering tasks. Qualitative analyses further illustrate OpenDriveVLA's superior capability to follow high-level driving commands and robustly generate trajectories under challenging scenarios, highlighting its potential for next-generation end-to-end autonomous driving. We will release our code to facilitate further research in this domain.
△ Less
Submitted 30 March, 2025;
originally announced March 2025.