-
A Unified and Scalable Membership Inference Method for Visual Self-supervised Encoder via Part-aware Capability
Authors:
Jie Zhu,
Jirong Zha,
Ding Li,
Leye Wang
Abstract:
Self-supervised learning shows promise in harnessing extensive unlabeled data, but it also confronts significant privacy concerns, especially in vision. In this paper, we perform membership inference on visual self-supervised models in a more realistic setting: self-supervised training method and details are unknown for an adversary when attacking as he usually faces a black-box system in practice…
▽ More
Self-supervised learning shows promise in harnessing extensive unlabeled data, but it also confronts significant privacy concerns, especially in vision. In this paper, we perform membership inference on visual self-supervised models in a more realistic setting: self-supervised training method and details are unknown for an adversary when attacking as he usually faces a black-box system in practice. In this setting, considering that self-supervised model could be trained by completely different self-supervised paradigms, e.g., masked image modeling and contrastive learning, with complex training details, we propose a unified membership inference method called PartCrop. It is motivated by the shared part-aware capability among models and stronger part response on the training data. Specifically, PartCrop crops parts of objects in an image to query responses within the image in representation space. We conduct extensive attacks on self-supervised models with different training protocols and structures using three widely used image datasets. The results verify the effectiveness and generalization of PartCrop. Moreover, to defend against PartCrop, we evaluate two common approaches, i.e., early stop and differential privacy, and propose a tailored method called shrinking crop scale range. The defense experiments indicate that all of them are effective. Finally, besides prototype testing on toy visual encoders and small-scale image datasets, we quantitatively study the impacts of scaling from both data and model aspects in a realistic scenario and propose a scalable PartCrop-v2 by introducing two structural improvements to PartCrop. Our code is at https://github.com/JiePKU/PartCrop.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Private Transformer Inference in MLaaS: A Survey
Authors:
Yang Li,
Xinyu Zhou,
Yitong Wang,
Liangxin Qian,
Jun Zhao
Abstract:
Transformer models have revolutionized AI, powering applications like content generation and sentiment analysis. However, their deployment in Machine Learning as a Service (MLaaS) raises significant privacy concerns, primarily due to the centralized processing of sensitive user data. Private Transformer Inference (PTI) offers a solution by utilizing cryptographic techniques such as secure multi-pa…
▽ More
Transformer models have revolutionized AI, powering applications like content generation and sentiment analysis. However, their deployment in Machine Learning as a Service (MLaaS) raises significant privacy concerns, primarily due to the centralized processing of sensitive user data. Private Transformer Inference (PTI) offers a solution by utilizing cryptographic techniques such as secure multi-party computation and homomorphic encryption, enabling inference while preserving both user data and model privacy. This paper reviews recent PTI advancements, highlighting state-of-the-art solutions and challenges. We also introduce a structured taxonomy and evaluation framework for PTI, focusing on balancing resource efficiency with privacy and bridging the gap between high-performance inference and data privacy.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Bridging Theory and Experiment in Materials Discovery: Machine-Learning-Assisted Prediction of Synthesizable Structures
Authors:
Yu Xin,
Peng Liu,
Zhuohang Xie,
Wenhui Mi,
Pengyue Gao,
Hong Jian Zhao,
Jian Lv,
Yanchao Wang,
Yanming Ma
Abstract:
Even though thermodynamic energy-based crystal structure prediction (CSP) has revolutionized materials discovery, the energy-driven CSP approaches often struggle to identify experimentally realizable metastable materials synthesized through kinetically controlled pathways, creating a critical gap between theoretical predictions and experimental synthesis. Here, we propose a synthesizability-driven…
▽ More
Even though thermodynamic energy-based crystal structure prediction (CSP) has revolutionized materials discovery, the energy-driven CSP approaches often struggle to identify experimentally realizable metastable materials synthesized through kinetically controlled pathways, creating a critical gap between theoretical predictions and experimental synthesis. Here, we propose a synthesizability-driven CSP framework that integrates symmetry-guided structure derivation with a Wyckoff encode-based machine-learning model, allowing for the efficient localization of subspaces likely to yield highly synthesizable structures. Within the identified promising subspaces, a structure-based synthesizability evaluation model, fine-tuned using recently synthesized structures to enhance predictive accuracy, is employed in conjunction with ab initio calculations to systematically identify synthesizable candidates. The framework successfully reproduces 13 experimentally known XSe (X = Sc, Ti, Mn, Fe, Ni, Cu, Zn) structures, demonstrating its effectiveness in predicting synthesizable structures. Notably, 92,310 structures are filtered from the 554,054 candidates predicted by GNoME, exhibiting great potential for promising synthesizability. Additionally, eight thermodynamically favorable Hf-X-O (X = Ti, V, and Mn) structures have been identified, among which three HfV$_2$O$_7$ candidates exhibit high synthesizability, presenting viable candidates for experimental realization and potentially associated with experimentally observed temperature-induced phase transitions. This work establishes a data-driven paradigm for machine-learning-assisted inorganic materials synthesis, highlighting its potential to bridge the gap between computational predictions and experimental realization while unlocking new opportunities for the targeted discovery of novel functional materials.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Variational Prefix Tuning for Diverse and Accurate Code Summarization Using Pre-trained Language Models
Authors:
Junda Zhao,
Yuliang Song,
Eldan Cohen
Abstract:
Recent advancements in source code summarization have leveraged transformer-based pre-trained models, including Large Language Models of Code (LLMCs), to automate and improve the generation of code summaries. However, existing methods often focus on generating a single high-quality summary for a given source code, neglecting scenarios where the generated summary might be inadequate and alternative…
▽ More
Recent advancements in source code summarization have leveraged transformer-based pre-trained models, including Large Language Models of Code (LLMCs), to automate and improve the generation of code summaries. However, existing methods often focus on generating a single high-quality summary for a given source code, neglecting scenarios where the generated summary might be inadequate and alternative options are needed. In this paper, we introduce Variational Prefix Tuning (VPT), a novel approach that enhances pre-trained models' ability to generate diverse yet accurate sets of summaries, allowing the user to choose the most suitable one for the given source code. Our method integrates a Conditional Variational Autoencoder (CVAE) framework as a modular component into pre-trained models, enabling us to model the distribution of observed target summaries and sample continuous embeddings to be used as prefixes to steer the generation of diverse outputs during decoding. Importantly, we construct our method in a parameter-efficient manner, eliminating the need for expensive model retraining, especially when using LLMCs. Furthermore, we employ a bi-criteria reranking method to select a subset of generated summaries, optimizing both the diversity and the accuracy of the options presented to users. We present extensive experimental evaluations using widely used datasets and current state-of-the-art pre-trained code summarization models to demonstrate the effectiveness of our approach and its adaptability across models.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Generative AI for Urban Planning: Synthesizing Satellite Imagery via Diffusion Models
Authors:
Qingyi Wang,
Yuebing Liang,
Yunhan Zheng,
Kaiyuan Xu,
Jinhua Zhao,
Shenhao Wang
Abstract:
Generative AI offers new opportunities for automating urban planning by creating site-specific urban layouts and enabling flexible design exploration. However, existing approaches often struggle to produce realistic and practical designs at scale. Therefore, we adapt a state-of-the-art Stable Diffusion model, extended with ControlNet, to generate high-fidelity satellite imagery conditioned on land…
▽ More
Generative AI offers new opportunities for automating urban planning by creating site-specific urban layouts and enabling flexible design exploration. However, existing approaches often struggle to produce realistic and practical designs at scale. Therefore, we adapt a state-of-the-art Stable Diffusion model, extended with ControlNet, to generate high-fidelity satellite imagery conditioned on land use descriptions, infrastructure, and natural environments. To overcome data availability limitations, we spatially link satellite imagery with structured land use and constraint information from OpenStreetMap. Using data from three major U.S. cities, we demonstrate that the proposed diffusion model generates realistic and diverse urban landscapes by varying land-use configurations, road networks, and water bodies, facilitating cross-city learning and design diversity. We also systematically evaluate the impacts of varying language prompts and control imagery on the quality of satellite imagery generation. Our model achieves high FID and KID scores and demonstrates robustness across diverse urban contexts. Qualitative assessments from urban planners and the general public show that generated images align closely with design descriptions and constraints, and are often preferred over real images. This work establishes a benchmark for controlled urban imagery generation and highlights the potential of generative AI as a tool for enhancing planning workflows and public engagement.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
SPAT: Sensitivity-based Multihead-attention Pruning on Time Series Forecasting Models
Authors:
Suhan Guo,
Jiahong Deng,
Mengjun Yi,
Furao Shen,
Jian Zhao
Abstract:
Attention-based architectures have achieved superior performance in multivariate time series forecasting but are computationally expensive. Techniques such as patching and adaptive masking have been developed to reduce their sizes and latencies. In this work, we propose a structured pruning method, SPAT ($\textbf{S}$ensitivity $\textbf{P}$runer for $\textbf{At}$tention), which selectively removes…
▽ More
Attention-based architectures have achieved superior performance in multivariate time series forecasting but are computationally expensive. Techniques such as patching and adaptive masking have been developed to reduce their sizes and latencies. In this work, we propose a structured pruning method, SPAT ($\textbf{S}$ensitivity $\textbf{P}$runer for $\textbf{At}$tention), which selectively removes redundant attention mechanisms and yields highly effective models. Different from previous approaches, SPAT aims to remove the entire attention module, which reduces the risk of overfitting and enables speed-up without demanding specialized hardware. We propose a dynamic sensitivity metric, $\textbf{S}$ensitivity $\textbf{E}$nhanced $\textbf{N}$ormalized $\textbf{D}$ispersion (SEND) that measures the importance of each attention module during the pre-training phase. Experiments on multivariate datasets demonstrate that SPAT-pruned models achieve reductions of 2.842% in MSE, 1.996% in MAE, and 35.274% in FLOPs. Furthermore, SPAT-pruned models outperform existing lightweight, Mamba-based and LLM-based SOTA methods in both standard and zero-shot inference, highlighting the importance of retaining only the most effective attention mechanisms. We have made our code publicly available https://anonymous.4open.science/r/SPAT-6042.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
ExEBench: Benchmarking Foundation Models on Extreme Earth Events
Authors:
Shan Zhao,
Zhitong Xiong,
Jie Zhao,
Xiao Xiang Zhu
Abstract:
Our planet is facing increasingly frequent extreme events, which pose major risks to human lives and ecosystems. Recent advances in machine learning (ML), especially with foundation models (FMs) trained on extensive datasets, excel in extracting features and show promise in disaster management. Nevertheless, these models often inherit biases from training data, challenging their performance over e…
▽ More
Our planet is facing increasingly frequent extreme events, which pose major risks to human lives and ecosystems. Recent advances in machine learning (ML), especially with foundation models (FMs) trained on extensive datasets, excel in extracting features and show promise in disaster management. Nevertheless, these models often inherit biases from training data, challenging their performance over extreme values. To explore the reliability of FM in the context of extreme events, we introduce \textbf{ExE}Bench (\textbf{Ex}treme \textbf{E}arth Benchmark), a collection of seven extreme event categories across floods, wildfires, storms, tropical cyclones, extreme precipitation, heatwaves, and cold waves. The dataset features global coverage, varying data volumes, and diverse data sources with different spatial, temporal, and spectral characteristics. To broaden the real-world impact of FMs, we include multiple challenging ML tasks that are closely aligned with operational needs in extreme events detection, monitoring, and forecasting. ExEBench aims to (1) assess FM generalizability across diverse, high-impact tasks and domains, (2) promote the development of novel ML methods that benefit disaster management, and (3) offer a platform for analyzing the interactions and cascading effects of extreme events to advance our understanding of Earth system, especially under the climate change expected in the decades to come. The dataset and code are public https://github.com/zhaoshan2/EarthExtreme-Bench.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Scalable UAV Multi-Hop Networking via Multi-Agent Reinforcement Learning with Large Language Models
Authors:
Yanggang Xu,
Weijie Hong,
Jirong Zha,
Geng Chen,
Jianfeng Zheng,
Chen-Chun Hsia,
Xinlei Chen
Abstract:
In disaster scenarios, establishing robust emergency communication networks is critical, and unmanned aerial vehicles (UAVs) offer a promising solution to rapidly restore connectivity. However, organizing UAVs to form multi-hop networks in large-scale dynamic environments presents significant challenges, including limitations in algorithmic scalability and the vast exploration space required for c…
▽ More
In disaster scenarios, establishing robust emergency communication networks is critical, and unmanned aerial vehicles (UAVs) offer a promising solution to rapidly restore connectivity. However, organizing UAVs to form multi-hop networks in large-scale dynamic environments presents significant challenges, including limitations in algorithmic scalability and the vast exploration space required for coordinated decision-making. To address these issues, we propose MRLMN, a novel framework that integrates multi-agent reinforcement learning (MARL) and large language models (LLMs) to jointly optimize UAV agents toward achieving optimal networking performance. The framework incorporates a grouping strategy with reward decomposition to enhance algorithmic scalability and balance decision-making across UAVs. In addition, behavioral constraints are applied to selected key UAVs to improve the robustness of the network. Furthermore, the framework integrates LLM agents, leveraging knowledge distillation to transfer their high-level decision-making capabilities to MARL agents. This enhances both the efficiency of exploration and the overall training process. In the distillation module, a Hungarian algorithm-based matching scheme is applied to align the decision outputs of the LLM and MARL agents and define the distillation loss. Extensive simulation results validate the effectiveness of our approach, demonstrating significant improvements in network performance, including enhanced coverage and communication quality.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
LLM Enhancers for GNNs: An Analysis from the Perspective of Causal Mechanism Identification
Authors:
Hang Gao,
Wenxuan Huang,
Fengge Wu,
Junsuo Zhao,
Changwen Zheng,
Huaping Liu
Abstract:
The use of large language models (LLMs) as feature enhancers to optimize node representations, which are then used as inputs for graph neural networks (GNNs), has shown significant potential in graph representation learning. However, the fundamental properties of this approach remain underexplored. To address this issue, we propose conducting a more in-depth analysis of this issue based on the int…
▽ More
The use of large language models (LLMs) as feature enhancers to optimize node representations, which are then used as inputs for graph neural networks (GNNs), has shown significant potential in graph representation learning. However, the fundamental properties of this approach remain underexplored. To address this issue, we propose conducting a more in-depth analysis of this issue based on the interchange intervention method. First, we construct a synthetic graph dataset with controllable causal relationships, enabling precise manipulation of semantic relationships and causal modeling to provide data for analysis. Using this dataset, we conduct interchange interventions to examine the deeper properties of LLM enhancers and GNNs, uncovering their underlying logic and internal mechanisms. Building on the analytical results, we design a plug-and-play optimization module to improve the information transfer between LLM enhancers and GNNs. Experiments across multiple datasets and models validate the proposed module.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Reinforced Internal-External Knowledge Synergistic Reasoning for Efficient Adaptive Search Agent
Authors:
Ziyang Huang,
Xiaowei Yuan,
Yiming Ju,
Jun Zhao,
Kang Liu
Abstract:
Retrieval-augmented generation (RAG) is a common strategy to reduce hallucinations in Large Language Models (LLMs). While reinforcement learning (RL) can enable LLMs to act as search agents by activating retrieval capabilities, existing ones often underutilize their internal knowledge. This can lead to redundant retrievals, potential harmful knowledge conflicts, and increased inference latency. To…
▽ More
Retrieval-augmented generation (RAG) is a common strategy to reduce hallucinations in Large Language Models (LLMs). While reinforcement learning (RL) can enable LLMs to act as search agents by activating retrieval capabilities, existing ones often underutilize their internal knowledge. This can lead to redundant retrievals, potential harmful knowledge conflicts, and increased inference latency. To address these limitations, an efficient and adaptive search agent capable of discerning optimal retrieval timing and synergistically integrating parametric (internal) and retrieved (external) knowledge is in urgent need. This paper introduces the Reinforced Internal-External Knowledge Synergistic Reasoning Agent (IKEA), which could indentify its own knowledge boundary and prioritize the utilization of internal knowledge, resorting to external search only when internal knowledge is deemed insufficient. This is achieved using a novel knowledge-boundary aware reward function and a knowledge-boundary aware training dataset. These are designed for internal-external knowledge synergy oriented RL, incentivizing the model to deliver accurate answers, minimize unnecessary retrievals, and encourage appropriate external searches when its own knowledge is lacking. Evaluations across multiple knowledge reasoning tasks demonstrate that IKEA significantly outperforms baseline methods, reduces retrieval frequency significantly, and exhibits robust generalization capabilities.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
A Survey on Collaborative Mechanisms Between Large and Small Language Models
Authors:
Yi Chen,
JiaHao Zhao,
HaoHao Han
Abstract:
Large Language Models (LLMs) deliver powerful AI capabilities but face deployment challenges due to high resource costs and latency, whereas Small Language Models (SLMs) offer efficiency and deployability at the cost of reduced performance. Collaboration between LLMs and SLMs emerges as a crucial paradigm to synergistically balance these trade-offs, enabling advanced AI applications, especially on…
▽ More
Large Language Models (LLMs) deliver powerful AI capabilities but face deployment challenges due to high resource costs and latency, whereas Small Language Models (SLMs) offer efficiency and deployability at the cost of reduced performance. Collaboration between LLMs and SLMs emerges as a crucial paradigm to synergistically balance these trade-offs, enabling advanced AI applications, especially on resource-constrained edge devices. This survey provides a comprehensive overview of LLM-SLM collaboration, detailing various interaction mechanisms (pipeline, routing, auxiliary, distillation, fusion), key enabling technologies, and diverse application scenarios driven by on-device needs like low latency, privacy, personalization, and offline operation. While highlighting the significant potential for creating more efficient, adaptable, and accessible AI, we also discuss persistent challenges including system overhead, inter-model consistency, robust task allocation, evaluation complexity, and security/privacy concerns. Future directions point towards more intelligent adaptive frameworks, deeper model fusion, and expansion into multimodal and embodied AI, positioning LLM-SLM collaboration as a key driver for the next generation of practical and ubiquitous artificial intelligence.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
A Black-box Testing Framework for Oracle Quantum Programs
Authors:
Peixun Long,
Jianjun Zhao
Abstract:
Oracle quantum programs are a fundamental class of quantum programs that serve as a critical bridge between quantum computing and classical computing. Many important quantum algorithms are built upon oracle quantum programs, making it essential to ensure their correctness during development. While software testing is a well-established approach for improving program reliability, no systematic meth…
▽ More
Oracle quantum programs are a fundamental class of quantum programs that serve as a critical bridge between quantum computing and classical computing. Many important quantum algorithms are built upon oracle quantum programs, making it essential to ensure their correctness during development. While software testing is a well-established approach for improving program reliability, no systematic method has been developed to test oracle quantum programs. This paper proposes a black-box testing framework designed for general oracle quantum programs. We define these programs formally, establish the foundational theory for their testing, and propose a detailed testing framework. We develop a prototype tool and conduct extensive experimental evaluations to evaluate the framework's effectiveness. Our results demonstrate that the proposed framework significantly aids developers in testing oracle quantum programs, providing insights to enhance the reliability of quantum software.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Ranking-aware Continual Learning for LiDAR Place Recognition
Authors:
Xufei Wang,
Gengxuan Tian,
Junqiao Zhao,
Siyue Tao,
Qiwen Gu,
Qiankun Yu,
Tiantian Feng
Abstract:
Place recognition plays a significant role in SLAM, robot navigation, and autonomous driving applications. Benefiting from deep learning, the performance of LiDAR place recognition (LPR) has been greatly improved. However, many existing learning-based LPR methods suffer from catastrophic forgetting, which severely harms the performance of LPR on previously trained places after training on a new en…
▽ More
Place recognition plays a significant role in SLAM, robot navigation, and autonomous driving applications. Benefiting from deep learning, the performance of LiDAR place recognition (LPR) has been greatly improved. However, many existing learning-based LPR methods suffer from catastrophic forgetting, which severely harms the performance of LPR on previously trained places after training on a new environment. In this paper, we introduce a continual learning framework for LPR via Knowledge Distillation and Fusion (KDF) to alleviate forgetting. Inspired by the ranking process of place recognition retrieval, we present a ranking-aware knowledge distillation loss that encourages the network to preserve the high-level place recognition knowledge. We also introduce a knowledge fusion module to integrate the knowledge of old and new models for LiDAR place recognition. Our extensive experiments demonstrate that KDF can be applied to different networks to overcome catastrophic forgetting, surpassing the state-of-the-art methods in terms of mean Recall@1 and forgetting score.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
Visual Evolutionary Optimization on Combinatorial Problems with Multimodal Large Language Models: A Case Study of Influence Maximization
Authors:
Jie Zhao,
Kang Hao Cheong
Abstract:
Graph-structured combinatorial problems in complex networks are prevalent in many domains, and are computationally demanding due to their complexity and non-linear nature. Traditional evolutionary algorithms (EAs), while robust, often face obstacles due to content-shallow encoding limitations and lack of structural awareness, necessitating hand-crafted modifications for effective application. In t…
▽ More
Graph-structured combinatorial problems in complex networks are prevalent in many domains, and are computationally demanding due to their complexity and non-linear nature. Traditional evolutionary algorithms (EAs), while robust, often face obstacles due to content-shallow encoding limitations and lack of structural awareness, necessitating hand-crafted modifications for effective application. In this work, we introduce an original framework, Visual Evolutionary Optimization (VEO), leveraging multimodal large language models (MLLMs) as the backbone evolutionary optimizer in this context. Specifically, we propose a context-aware encoding way, representing the solution of the network as an image. In this manner, we can utilize MLLMs' image processing capabilities to intuitively comprehend network configurations, thus enabling machines to solve these problems in a human-like way. We have developed MLLM-based operators tailored for various evolutionary optimization stages, including initialization, crossover, and mutation. Furthermore, we propose that graph sparsification can effectively enhance the applicability and scalability of VEO on large-scale networks, owing to the scale-free nature of real-world networks. We demonstrate the effectiveness of our method using a well-known task in complex networks, influence maximization, and validate it on eight different real-world networks of various structures. The results have confirmed VEO's reliability and enhanced effectiveness compared to traditional evolutionary optimization.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
An Improved Algorithm for a Bipartite Traveling Tournament in Interleague Sports Scheduling
Authors:
Jingyang Zhao,
Mingyu Xiao
Abstract:
The bipartite traveling tournament problem (BTTP) addresses inter-league sports scheduling, which aims to design a feasible bipartite tournament between two $n$-team leagues under some constraints such that the total traveling distance of all participating teams is minimized. Since its introduction, several methods have been developed to design feasible schedules for NBA, NPB and so on. In terms o…
▽ More
The bipartite traveling tournament problem (BTTP) addresses inter-league sports scheduling, which aims to design a feasible bipartite tournament between two $n$-team leagues under some constraints such that the total traveling distance of all participating teams is minimized. Since its introduction, several methods have been developed to design feasible schedules for NBA, NPB and so on. In terms of solution quality with a theoretical guarantee, previously only a $(2+\varepsilon)$-approximation is known for the case that $n\equiv 0 \pmod 3$. Whether there are similar results for the cases that $n\equiv 1 \pmod 3$ and $n\equiv 2 \pmod 3$ was asked in the literature. In this paper, we answer this question positively by proposing a $(3/2+\varepsilon)$-approximation algorithm for any $n$ and any constant $\varepsilon>0$, which also improves the previous approximation ratio for the case that $n\equiv 0 \pmod 3$.
△ Less
Submitted 10 May, 2025;
originally announced May 2025.
-
Learn to Think: Bootstrapping LLM Reasoning Capability Through Graph Learning
Authors:
Hang Gao,
Chenhao Zhang,
Tie Wang,
Junsuo Zhao,
Fengge Wu,
Changwen Zheng,
Huaping Liu
Abstract:
Large Language Models (LLMs) have achieved remarkable success across various domains. However, they still face significant challenges, including high computational costs for training and limitations in solving complex reasoning problems. Although existing methods have extended the reasoning capabilities of LLMs through structured paradigms, these approaches often rely on task-specific prompts and…
▽ More
Large Language Models (LLMs) have achieved remarkable success across various domains. However, they still face significant challenges, including high computational costs for training and limitations in solving complex reasoning problems. Although existing methods have extended the reasoning capabilities of LLMs through structured paradigms, these approaches often rely on task-specific prompts and predefined reasoning processes, which constrain their flexibility and generalizability. To address these limitations, we propose a novel framework that leverages graph learning to enable more flexible and adaptive reasoning capabilities for LLMs. Specifically, this approach models the reasoning process of a problem as a graph and employs LLM-based graph learning to guide the adaptive generation of each reasoning step. To further enhance the adaptability of the model, we introduce a Graph Neural Network (GNN) module to perform representation learning on the generated reasoning process, enabling real-time adjustments to both the model and the prompt. Experimental results demonstrate that this method significantly improves reasoning performance across multiple tasks without requiring additional training or task-specific prompt design. Code can be found in https://github.com/zch65458525/L2T.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
Unsupervised Anomaly Detection for Autonomous Robots via Mahalanobis SVDD with Audio-IMU Fusion
Authors:
Yizhuo Yang,
Jiulin Zhao,
Xinhang Xu,
Kun Cao,
Shenghai Yuan,
Lihua Xie
Abstract:
Reliable anomaly detection is essential for ensuring the safety of autonomous robots, particularly when conventional detection systems based on vision or LiDAR become unreliable in adverse or unpredictable conditions. In such scenarios, alternative sensing modalities are needed to provide timely and robust feedback. To this end, we explore the use of audio and inertial measurement unit (IMU) senso…
▽ More
Reliable anomaly detection is essential for ensuring the safety of autonomous robots, particularly when conventional detection systems based on vision or LiDAR become unreliable in adverse or unpredictable conditions. In such scenarios, alternative sensing modalities are needed to provide timely and robust feedback. To this end, we explore the use of audio and inertial measurement unit (IMU) sensors to detect underlying anomalies in autonomous mobile robots, such as collisions and internal mechanical faults. Furthermore, to address the challenge of limited labeled anomaly data, we propose an unsupervised anomaly detection framework based on Mahalanobis Support Vector Data Description (M-SVDD). In contrast to conventional SVDD methods that rely on Euclidean distance and assume isotropic feature distributions, our approach employs the Mahalanobis distance to adaptively scale feature dimensions and capture inter-feature correlations, enabling more expressive decision boundaries. In addition, a reconstruction-based auxiliary branch is introduced to preserve feature diversity and prevent representation collapse, further enhancing the robustness of anomaly detection. Extensive experiments on a collected mobile robot dataset and four public datasets demonstrate the effectiveness of the proposed method, as shown in the video https://youtu.be/yh1tn6DDD4A. Code and dataset are available at https://github.com/jamesyang7/M-SVDD.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
CacheFL: Efficient Federated Cache Model Fine-Tuning for Vision-Language Models
Authors:
Mengjun Yi,
Hanwen Zhang,
Hui Dou,
Jian Zhao,
Furao Shen
Abstract:
Large pre-trained Vision-Language Models (VLMs), such as Contrastive Language-Image Pre-training (CLIP), have exhibited remarkable zero-shot performance across various image classification tasks. Fine-tuning these models on domain-specific datasets further enhances their effectiveness for downstream applications. However, fine-tuning in cloud environments raises significant concerns regarding data…
▽ More
Large pre-trained Vision-Language Models (VLMs), such as Contrastive Language-Image Pre-training (CLIP), have exhibited remarkable zero-shot performance across various image classification tasks. Fine-tuning these models on domain-specific datasets further enhances their effectiveness for downstream applications. However, fine-tuning in cloud environments raises significant concerns regarding data security and privacy. Federated Learning (FL) offers a decentralized solution by enabling model training across local clients without centralizing sensitive data, but the high communication and computation costs of transmitting full pre-trained models during training limit its scalability. Additionally, non-Independent and Identically Distributed (non-IID) data across local clients can negatively impact model convergence and performance. To address these challenges, we propose CacheFL, a novel federated learning method that replaces traditional full model fine-tuning with lightweight cache model fine-tuning. The cache model is initialized using a class-balanced dataset generated by a generative pre-trained model, effectively mitigating the impact of non-IID data. This cache model is then distributed to local clients for fine-tuning, and the updated parameters from each client are aggregated on the server and redistributed. With the updated cache model, the classification performance of CLIP is improved after just a few epochs. By limiting the training and communication to the cache model, CacheFL significantly reduces resource demands while ensuring data privacy and security. Extensive experiments conducted on ImageNet and 10 additional datasets demonstrate that CacheFL outperforms traditional approaches in terms of classification accuracy, resource efficiency, and privacy preservation.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
T2VTextBench: A Human Evaluation Benchmark for Textual Control in Video Generation Models
Authors:
Xuyang Guo,
Jiayan Huo,
Zhenmei Shi,
Zhao Song,
Jiahao Zhang,
Jiale Zhao
Abstract:
Thanks to recent advancements in scalable deep architectures and large-scale pretraining, text-to-video generation has achieved unprecedented capabilities in producing high-fidelity, instruction-following content across a wide range of styles, enabling applications in advertising, entertainment, and education. However, these models' ability to render precise on-screen text, such as captions or mat…
▽ More
Thanks to recent advancements in scalable deep architectures and large-scale pretraining, text-to-video generation has achieved unprecedented capabilities in producing high-fidelity, instruction-following content across a wide range of styles, enabling applications in advertising, entertainment, and education. However, these models' ability to render precise on-screen text, such as captions or mathematical formulas, remains largely untested, posing significant challenges for applications requiring exact textual accuracy. In this work, we introduce T2VTextBench, the first human-evaluation benchmark dedicated to evaluating on-screen text fidelity and temporal consistency in text-to-video models. Our suite of prompts integrates complex text strings with dynamic scene changes, testing each model's ability to maintain detailed instructions across frames. We evaluate ten state-of-the-art systems, ranging from open-source solutions to commercial offerings, and find that most struggle to generate legible, consistent text. These results highlight a critical gap in current video generators and provide a clear direction for future research aimed at enhancing textual manipulation in video synthesis.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs
Authors:
Yehui Tang,
Yichun Yin,
Yaoyuan Wang,
Hang Zhou,
Yu Pan,
Wei Guo,
Ziyang Zhang,
Miao Rang,
Fangcheng Liu,
Naifu Zhang,
Binghan Li,
Yonghan Dong,
Xiaojun Meng,
Yasheng Wang,
Dong Li,
Yin Li,
Dandan Tu,
Can Chen,
Youliang Yan,
Fisher Yu,
Ruiming Tang,
Yunhe Wang,
Botian Huang,
Bo Wang,
Boxiao Liu
, et al. (49 additional authors not shown)
Abstract:
Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models. However, the massive model scale poses significant challenges for the underlying software and hardware systems. In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs. The key goals are better usage of the computing r…
▽ More
Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models. However, the massive model scale poses significant challenges for the underlying software and hardware systems. In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs. The key goals are better usage of the computing resources under the dynamic sparse model structures and materializing the expected performance gain on the actual hardware. To select model configurations suitable for Ascend NPUs without repeatedly running the expensive experiments, we leverage simulation to compare the trade-off of various model hyperparameters. This study led to Pangu Ultra MoE, a sparse LLM with 718 billion parameters, and we conducted experiments on the model to verify the simulation results. On the system side, we dig into Expert Parallelism to optimize the communication between NPU devices to reduce the synchronization overhead. We also optimize the memory efficiency within the devices to further reduce the parameter and activation management overhead. In the end, we achieve an MFU of 30.0% when training Pangu Ultra MoE, with performance comparable to that of DeepSeek R1, on 6K Ascend NPUs, and demonstrate that the Ascend system is capable of harnessing all the training stages of the state-of-the-art language models. Extensive experiments indicate that our recipe can lead to efficient training of large-scale sparse language models with MoE. We also study the behaviors of such models for future reference.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
Modality-Guided Dynamic Graph Fusion and Temporal Diffusion for Self-Supervised RGB-T Tracking
Authors:
Shenglan Li,
Rui Yao,
Yong Zhou,
Hancheng Zhu,
Kunyang Sun,
Bing Liu,
Zhiwen Shao,
Jiaqi Zhao
Abstract:
To reduce the reliance on large-scale annotations, self-supervised RGB-T tracking approaches have garnered significant attention. However, the omission of the object region by erroneous pseudo-label or the introduction of background noise affects the efficiency of modality fusion, while pseudo-label noise triggered by similar object noise can further affect the tracking performance. In this paper,…
▽ More
To reduce the reliance on large-scale annotations, self-supervised RGB-T tracking approaches have garnered significant attention. However, the omission of the object region by erroneous pseudo-label or the introduction of background noise affects the efficiency of modality fusion, while pseudo-label noise triggered by similar object noise can further affect the tracking performance. In this paper, we propose GDSTrack, a novel approach that introduces dynamic graph fusion and temporal diffusion to address the above challenges in self-supervised RGB-T tracking. GDSTrack dynamically fuses the modalities of neighboring frames, treats them as distractor noise, and leverages the denoising capability of a generative model. Specifically, by constructing an adjacency matrix via an Adjacency Matrix Generator (AMG), the proposed Modality-guided Dynamic Graph Fusion (MDGF) module uses a dynamic adjacency matrix to guide graph attention, focusing on and fusing the object's coherent regions. Temporal Graph-Informed Diffusion (TGID) models MDGF features from neighboring frames as interference, and thus improving robustness against similar-object noise. Extensive experiments conducted on four public RGB-T tracking datasets demonstrate that GDSTrack outperforms the existing state-of-the-art methods. The source code is available at https://github.com/LiShenglana/GDSTrack.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
Physics-inspired Energy Transition Neural Network for Sequence Learning
Authors:
Zhou Wu,
Junyi An,
Baile Xu,
Furao Shen,
Jian Zhao
Abstract:
Recently, the superior performance of Transformers has made them a more robust and scalable solution for sequence modeling than traditional recurrent neural networks (RNNs). However, the effectiveness of Transformer in capturing long-term dependencies is primarily attributed to their comprehensive pair-modeling process rather than inherent inductive biases toward sequence semantics. In this study,…
▽ More
Recently, the superior performance of Transformers has made them a more robust and scalable solution for sequence modeling than traditional recurrent neural networks (RNNs). However, the effectiveness of Transformer in capturing long-term dependencies is primarily attributed to their comprehensive pair-modeling process rather than inherent inductive biases toward sequence semantics. In this study, we explore the capabilities of pure RNNs and reassess their long-term learning mechanisms. Inspired by the physics energy transition models that track energy changes over time, we propose a effective recurrent structure called the``Physics-inspired Energy Transition Neural Network" (PETNN). We demonstrate that PETNN's memory mechanism effectively stores information over long-term dependencies. Experimental results indicate that PETNN outperforms transformer-based methods across various sequence tasks. Furthermore, owing to its recurrent nature, PETNN exhibits significantly lower complexity. Our study presents an optimal foundational recurrent architecture and highlights the potential for developing effective recurrent neural networks in fields currently dominated by Transformer.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
Interactive Instance Annotation with Siamese Networks
Authors:
Xiang Xu,
Ruotong Li,
Mengjun Yi,
Baile XU,
Furao Shen,
Jian Zhao
Abstract:
Annotating instance masks is time-consuming and labor-intensive. A promising solution is to predict contours using a deep learning model and then allow users to refine them. However, most existing methods focus on in-domain scenarios, limiting their effectiveness for cross-domain annotation tasks. In this paper, we propose SiamAnno, a framework inspired by the use of Siamese networks in object tra…
▽ More
Annotating instance masks is time-consuming and labor-intensive. A promising solution is to predict contours using a deep learning model and then allow users to refine them. However, most existing methods focus on in-domain scenarios, limiting their effectiveness for cross-domain annotation tasks. In this paper, we propose SiamAnno, a framework inspired by the use of Siamese networks in object tracking. SiamAnno leverages one-shot learning to annotate previously unseen objects by taking a bounding box as input and predicting object boundaries, which can then be adjusted by annotators. Trained on one dataset and tested on another without fine-tuning, SiamAnno achieves state-of-the-art (SOTA) performance across multiple datasets, demonstrating its ability to handle domain and environment shifts in cross-domain tasks. We also provide more comprehensive results compared to previous work, establishing a strong baseline for future research. To our knowledge, SiamAnno is the first model to explore Siamese architecture for instance annotation.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
Hardware vs. Software Implementation of Warp-Level Features in Vortex RISC-V GPU
Authors:
Huanzhi Pu,
Rishabh Ravi,
Shinnung Jeong,
Udit Subramanya,
Euijun Chung,
Jisheng Zhao,
Chihyo Ahn,
Hyesoon Kim
Abstract:
RISC-V GPUs present a promising path for supporting GPU applications. Traditionally, GPUs achieve high efficiency through the SPMD (Single Program Multiple Data) programming model. However, modern GPU programming increasingly relies on warp-level features, which diverge from the conventional SPMD paradigm. In this paper, we explore how RISC-V GPUs can support these warp-level features both through…
▽ More
RISC-V GPUs present a promising path for supporting GPU applications. Traditionally, GPUs achieve high efficiency through the SPMD (Single Program Multiple Data) programming model. However, modern GPU programming increasingly relies on warp-level features, which diverge from the conventional SPMD paradigm. In this paper, we explore how RISC-V GPUs can support these warp-level features both through hardware implementation and via software-only approaches. Our evaluation shows that a hardware implementation achieves up to 4 times geomean IPC speedup in microbenchmarks, while software-based solutions provide a viable alternative for area-constrained scenarios.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction
Authors:
Inclusion AI,
Biao Gong,
Cheng Zou,
Dandan Zheng,
Hu Yu,
Jingdong Chen,
Jianxin Sun,
Junbo Zhao,
Jun Zhou,
Kaixiang Ji,
Lixiang Ru,
Libin Wang,
Qingpei Guo,
Rui Liu,
Weilong Chai,
Xinyu Xiao,
Ziyuan Huang
Abstract:
We introduce Ming-Lite-Uni, an open-source multimodal framework featuring a newly designed unified visual generator and a native multimodal autoregressive model tailored for unifying vision and language. Specifically, this project provides an open-source implementation of the integrated MetaQueries and M2-omni framework, while introducing the novel multi-scale learnable tokens and multi-scale repr…
▽ More
We introduce Ming-Lite-Uni, an open-source multimodal framework featuring a newly designed unified visual generator and a native multimodal autoregressive model tailored for unifying vision and language. Specifically, this project provides an open-source implementation of the integrated MetaQueries and M2-omni framework, while introducing the novel multi-scale learnable tokens and multi-scale representation alignment strategy. By leveraging a fixed MLLM and a learnable diffusion model, Ming-Lite-Uni enables native multimodal AR models to perform both text-to-image generation and instruction based image editing tasks, expanding their capabilities beyond pure visual understanding. Our experimental results demonstrate the strong performance of Ming-Lite-Uni and illustrate the impressive fluid nature of its interactive process. All code and model weights are open-sourced to foster further exploration within the community. Notably, this work aligns with concurrent multimodal AI milestones - such as ChatGPT-4o with native image generation updated in March 25, 2025 - underscoring the broader significance of unified models like Ming-Lite-Uni on the path toward AGI. Ming-Lite-Uni is in alpha stage and will soon be further refined.
△ Less
Submitted 7 May, 2025; v1 submitted 5 May, 2025;
originally announced May 2025.
-
Invoke Interfaces Only When Needed: Adaptive Invocation for Large Language Models in Question Answering
Authors:
Jihao Zhao,
Chunlai Zhou,
Biao Qin
Abstract:
The collaborative paradigm of large and small language models (LMs) effectively balances performance and cost, yet its pivotal challenge lies in precisely pinpointing the moment of invocation when hallucinations arise in small LMs. Previous optimization efforts primarily focused on post-processing techniques, which were separate from the reasoning process of LMs, resulting in high computational co…
▽ More
The collaborative paradigm of large and small language models (LMs) effectively balances performance and cost, yet its pivotal challenge lies in precisely pinpointing the moment of invocation when hallucinations arise in small LMs. Previous optimization efforts primarily focused on post-processing techniques, which were separate from the reasoning process of LMs, resulting in high computational costs and limited effectiveness. In this paper, we propose a practical invocation evaluation metric called AttenHScore, which calculates the accumulation and propagation of hallucinations during the generation process of small LMs, continuously amplifying potential reasoning errors. By dynamically adjusting the detection threshold, we achieve more accurate real-time invocation of large LMs. Additionally, considering the limited reasoning capacity of small LMs, we leverage uncertainty-aware knowledge reorganization to assist them better capture critical information from different text chunks. Extensive experiments reveal that our AttenHScore outperforms most baseline in enhancing real-time hallucination detection capabilities across multiple QA datasets, especially when addressing complex queries. Moreover, our strategies eliminate the need for additional model training and display flexibility in adapting to various transformer-based LMs.
△ Less
Submitted 4 May, 2025;
originally announced May 2025.
-
KineDex: Learning Tactile-Informed Visuomotor Policies via Kinesthetic Teaching for Dexterous Manipulation
Authors:
Di Zhang,
Chengbo Yuan,
Chuan Wen,
Hai Zhang,
Junqiao Zhao,
Yang Gao
Abstract:
Collecting demonstrations enriched with fine-grained tactile information is critical for dexterous manipulation, particularly in contact-rich tasks that require precise force control and physical interaction. While prior works primarily focus on teleoperation or video-based retargeting, they often suffer from kinematic mismatches and the absence of real-time tactile feedback, hindering the acquisi…
▽ More
Collecting demonstrations enriched with fine-grained tactile information is critical for dexterous manipulation, particularly in contact-rich tasks that require precise force control and physical interaction. While prior works primarily focus on teleoperation or video-based retargeting, they often suffer from kinematic mismatches and the absence of real-time tactile feedback, hindering the acquisition of high-fidelity tactile data. To mitigate this issue, we propose KineDex, a hand-over-hand kinesthetic teaching paradigm in which the operator's motion is directly transferred to the dexterous hand, enabling the collection of physically grounded demonstrations enriched with accurate tactile feedback. To resolve occlusions from human hand, we apply inpainting technique to preprocess the visual observations. Based on these demonstrations, we then train a visuomotor policy using tactile-augmented inputs and implement force control during deployment for precise contact-rich manipulation. We evaluate KineDex on a suite of challenging contact-rich manipulation tasks, including particularly difficult scenarios such as squeezing toothpaste onto a toothbrush, which require precise multi-finger coordination and stable force regulation. Across these tasks, KineDex achieves an average success rate of 74.4%, representing a 57.7% improvement over the variant without force control. Comparative experiments with teleoperation and user studies further validate the advantages of KineDex in data collection efficiency and operability. Specifically, KineDex collects data over twice as fast as teleoperation across two tasks of varying difficulty, while maintaining a near-100% success rate, compared to under 50% for teleoperation.
△ Less
Submitted 3 May, 2025;
originally announced May 2025.
-
Adaptive Branch-and-Bound Tree Exploration for Neural Network Verification
Authors:
Kota Fukuda,
Guanqin Zhang,
Zhenya Zhang,
Yulei Sui,
Jianjun Zhao
Abstract:
Formal verification is a rigorous approach that can provably ensure the quality of neural networks, and to date, Branch and Bound (BaB) is the state-of-the-art that performs verification by splitting the problem as needed and applying off-the-shelf verifiers to sub-problems for improved performance. However, existing BaB may not be efficient, due to its naive way of exploring the space of sub-prob…
▽ More
Formal verification is a rigorous approach that can provably ensure the quality of neural networks, and to date, Branch and Bound (BaB) is the state-of-the-art that performs verification by splitting the problem as needed and applying off-the-shelf verifiers to sub-problems for improved performance. However, existing BaB may not be efficient, due to its naive way of exploring the space of sub-problems that ignores the \emph{importance} of different sub-problems. To bridge this gap, we first introduce a notion of ``importance'' that reflects how likely a counterexample can be found with a sub-problem, and then we devise a novel verification approach, called ABONN, that explores the sub-problem space of BaB adaptively, in a Monte-Carlo tree search (MCTS) style. The exploration is guided by the ``importance'' of different sub-problems, so it favors the sub-problems that are more likely to find counterexamples. As soon as it finds a counterexample, it can immediately terminate; even though it cannot find, after visiting all the sub-problems, it can still manage to verify the problem. We evaluate ABONN with 552 verification problems from commonly-used datasets and neural network models, and compare it with the state-of-the-art verifiers as baseline approaches. Experimental evaluation shows that ABONN demonstrates speedups of up to $15.2\times$ on MNIST and $24.7\times$ on CIFAR-10. We further study the influences of hyperparameters to the performance of ABONN, and the effectiveness of our adaptive tree exploration.
△ Less
Submitted 1 May, 2025;
originally announced May 2025.
-
T2VPhysBench: A First-Principles Benchmark for Physical Consistency in Text-to-Video Generation
Authors:
Xuyang Guo,
Jiayan Huo,
Zhenmei Shi,
Zhao Song,
Jiahao Zhang,
Jiale Zhao
Abstract:
Text-to-video generative models have made significant strides in recent years, producing high-quality videos that excel in both aesthetic appeal and accurate instruction following, and have become central to digital art creation and user engagement online. Yet, despite these advancements, their ability to respect fundamental physical laws remains largely untested: many outputs still violate basic…
▽ More
Text-to-video generative models have made significant strides in recent years, producing high-quality videos that excel in both aesthetic appeal and accurate instruction following, and have become central to digital art creation and user engagement online. Yet, despite these advancements, their ability to respect fundamental physical laws remains largely untested: many outputs still violate basic constraints such as rigid-body collisions, energy conservation, and gravitational dynamics, resulting in unrealistic or even misleading content. Existing physical-evaluation benchmarks typically rely on automatic, pixel-level metrics applied to simplistic, life-scenario prompts, and thus overlook both human judgment and first-principles physics. To fill this gap, we introduce \textbf{T2VPhysBench}, a first-principled benchmark that systematically evaluates whether state-of-the-art text-to-video systems, both open-source and commercial, obey twelve core physical laws including Newtonian mechanics, conservation principles, and phenomenological effects. Our benchmark employs a rigorous human evaluation protocol and includes three targeted studies: (1) an overall compliance assessment showing that all models score below 0.60 on average in each law category; (2) a prompt-hint ablation revealing that even detailed, law-specific hints fail to remedy physics violations; and (3) a counterfactual robustness test demonstrating that models often generate videos that explicitly break physical rules when so instructed. The results expose persistent limitations in current architectures and offer concrete insights for guiding future research toward truly physics-aware video generation.
△ Less
Submitted 1 May, 2025;
originally announced May 2025.
-
GaLore 2: Large-Scale LLM Pre-Training by Gradient Low-Rank Projection
Authors:
DiJia Su,
Andrew Gu,
Jane Xu,
Yuandong Tian,
Jiawei Zhao
Abstract:
Large language models (LLMs) have revolutionized natural language understanding and generation but face significant memory bottlenecks during training. GaLore, Gradient Low-Rank Projection, addresses this issue by leveraging the inherent low-rank structure of weight gradients, enabling substantial memory savings without sacrificing performance. Recent works further extend GaLore from various aspec…
▽ More
Large language models (LLMs) have revolutionized natural language understanding and generation but face significant memory bottlenecks during training. GaLore, Gradient Low-Rank Projection, addresses this issue by leveraging the inherent low-rank structure of weight gradients, enabling substantial memory savings without sacrificing performance. Recent works further extend GaLore from various aspects, including low-bit quantization and higher-order tensor structures. However, there are several remaining challenges for GaLore, such as the computational overhead of SVD for subspace updates and the integration with state-of-the-art training parallelization strategies (e.g., FSDP). In this paper, we present GaLore 2, an efficient and scalable GaLore framework that addresses these challenges and incorporates recent advancements. In addition, we demonstrate the scalability of GaLore 2 by pre-training Llama 7B from scratch using up to 500 billion training tokens, highlighting its potential impact on real LLM pre-training scenarios.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
PolyTouch: A Robust Multi-Modal Tactile Sensor for Contact-rich Manipulation Using Tactile-Diffusion Policies
Authors:
Jialiang Zhao,
Naveen Kuppuswamy,
Siyuan Feng,
Benjamin Burchfiel,
Edward Adelson
Abstract:
Achieving robust dexterous manipulation in unstructured domestic environments remains a significant challenge in robotics. Even with state-of-the-art robot learning methods, haptic-oblivious control strategies (i.e. those relying only on external vision and/or proprioception) often fall short due to occlusions, visual complexities, and the need for precise contact interaction control. To address t…
▽ More
Achieving robust dexterous manipulation in unstructured domestic environments remains a significant challenge in robotics. Even with state-of-the-art robot learning methods, haptic-oblivious control strategies (i.e. those relying only on external vision and/or proprioception) often fall short due to occlusions, visual complexities, and the need for precise contact interaction control. To address these limitations, we introduce PolyTouch, a novel robot finger that integrates camera-based tactile sensing, acoustic sensing, and peripheral visual sensing into a single design that is compact and durable. PolyTouch provides high-resolution tactile feedback across multiple temporal scales, which is essential for efficiently learning complex manipulation tasks. Experiments demonstrate an at least 20-fold increase in lifespan over commercial tactile sensors, with a design that is both easy to manufacture and scalable. We then use this multi-modal tactile feedback along with visuo-proprioceptive observations to synthesize a tactile-diffusion policy from human demonstrations; the resulting contact-aware control policy significantly outperforms haptic-oblivious policies in multiple contact-aware manipulation policies. This paper highlights how effectively integrating multi-modal contact sensing can hasten the development of effective contact-aware manipulation policies, paving the way for more reliable and versatile domestic robots. More information can be found at https://polytouch.alanz.info/
△ Less
Submitted 27 April, 2025;
originally announced April 2025.
-
Effective Length Extrapolation via Dimension-Wise Positional Embeddings Manipulation
Authors:
Yi Lu,
Wanxu Zhao,
Xin Zhou,
Chenxin An,
Chenglong Wang,
Shuo Li,
Yuming Yang,
Jun Zhao,
Tao Ji,
Tao Gui,
Qi Zhang,
Xuanjing Huang
Abstract:
Large Language Models (LLMs) often struggle to process and generate coherent context when the number of input tokens exceeds the pre-trained length. Recent advancements in long-context extension have significantly expanded the context window of LLMs but require expensive overhead to train the large-scale models with longer context. In this work, we propose Dimension-Wise Positional Embeddings Mani…
▽ More
Large Language Models (LLMs) often struggle to process and generate coherent context when the number of input tokens exceeds the pre-trained length. Recent advancements in long-context extension have significantly expanded the context window of LLMs but require expensive overhead to train the large-scale models with longer context. In this work, we propose Dimension-Wise Positional Embeddings Manipulation (DPE), a training-free framework to extrapolate the context window of LLMs by diving into RoPE's different hidden dimensions. Instead of manipulating all dimensions equally, DPE detects the effective length for every dimension and finds the key dimensions for context extension. We reuse the original position indices with their embeddings from the pre-trained model and manipulate the key dimensions' position indices to their most effective lengths. In this way, DPE adjusts the pre-trained models with minimal modifications while ensuring that each dimension reaches its optimal state for extrapolation. DPE significantly surpasses well-known baselines such as YaRN and Self-Extend. DPE enables Llama3-8k 8B to support context windows of 128k tokens without continual training and integrates seamlessly with Flash Attention 2. In addition to its impressive extrapolation capability, DPE also dramatically improves the models' performance within training length, such as Llama3.1 70B, by over 18 points on popular long-context benchmarks RULER. When compared with commercial models, Llama 3.1 70B with DPE even achieves better performance than GPT-4-128K.
△ Less
Submitted 26 April, 2025;
originally announced April 2025.
-
A Microgravity Simulation Experimental Platform For Small Space Robots In Orbit
Authors:
Hang Luo,
Nanlin Zhou,
Haoxiang Zhang,
Kai Han,
Ning Zhao,
Zhiyuan Yang,
Jian Qi,
Sikai Zhao,
Jie Zhao,
Yanhe Zhu
Abstract:
This study describes the development and validation of a novel microgravity experimental platform that is mainly applied to small robots such as modular self-reconfigurable robots. This platform mainly consists of an air supply system, a microporous platform and glass. By supplying air to the microporous platform to form an air film, the influence of the weight of the air foot and the ventilation…
▽ More
This study describes the development and validation of a novel microgravity experimental platform that is mainly applied to small robots such as modular self-reconfigurable robots. This platform mainly consists of an air supply system, a microporous platform and glass. By supplying air to the microporous platform to form an air film, the influence of the weight of the air foot and the ventilation hose of traditional air-float platforms on microgravity experiments is solved. The contribution of this work is to provide a platform with less external interference for microgravity simulation experiments on small robots.
△ Less
Submitted 26 April, 2025;
originally announced April 2025.
-
A Cognitive-Mechanistic Human Reliability Analysis Framework: A Nuclear Power Plant Case Study
Authors:
Xingyu Xiao,
Peng Chen,
Jiejuan Tong,
Shunshun Liu,
Hongru Zhao,
Jun Zhao,
Qianqian Jia,
Jingang Liang,
Haitao Wang
Abstract:
Traditional human reliability analysis (HRA) methods, such as IDHEAS-ECA, rely on expert judgment and empirical rules that often overlook the cognitive underpinnings of human error. Moreover, conducting human-in-the-loop experiments for advanced nuclear power plants is increasingly impractical due to novel interfaces and limited operational data. This study proposes a cognitive-mechanistic framewo…
▽ More
Traditional human reliability analysis (HRA) methods, such as IDHEAS-ECA, rely on expert judgment and empirical rules that often overlook the cognitive underpinnings of human error. Moreover, conducting human-in-the-loop experiments for advanced nuclear power plants is increasingly impractical due to novel interfaces and limited operational data. This study proposes a cognitive-mechanistic framework (COGMIF) that enhances the IDHEAS-ECA methodology by integrating an ACT-R-based human digital twin (HDT) with TimeGAN-augmented simulation. The ACT-R model simulates operator cognition, including memory retrieval, goal-directed procedural reasoning, and perceptual-motor execution, under high-fidelity scenarios derived from a high-temperature gas-cooled reactor (HTGR) simulator. To overcome the resource constraints of large-scale cognitive modeling, TimeGAN is trained on ACT-R-generated time-series data to produce high-fidelity synthetic operator behavior datasets. These simulations are then used to drive IDHEAS-ECA assessments, enabling scalable, mechanism-informed estimation of human error probabilities (HEPs). Comparative analyses with SPAR-H and sensitivity assessments demonstrate the robustness and practical advantages of the proposed COGMIF. Finally, procedural features are mapped onto a Bayesian network to quantify the influence of contributing factors, revealing key drivers of operational risk. This work offers a credible and computationally efficient pathway to integrate cognitive theory into industrial HRA practices.
△ Less
Submitted 5 May, 2025; v1 submitted 24 April, 2025;
originally announced April 2025.
-
ClassComet: Exploring and Designing AI-generated Danmaku in Educational Videos to Enhance Online Learning
Authors:
Zipeng Ji,
Pengcheng An,
Jian Zhao
Abstract:
Danmaku, users' live comments synchronized with, and overlaying on videos, has recently shown potential in promoting online video-based learning. However, user-generated danmaku can be scarce-especially in newer or less viewed videos and its quality is unpredictable, limiting its educational impact. This paper explores how large multimodal models (LMM) can be leveraged to automatically generate ef…
▽ More
Danmaku, users' live comments synchronized with, and overlaying on videos, has recently shown potential in promoting online video-based learning. However, user-generated danmaku can be scarce-especially in newer or less viewed videos and its quality is unpredictable, limiting its educational impact. This paper explores how large multimodal models (LMM) can be leveraged to automatically generate effective, high-quality danmaku. We first conducted a formative study to identify the desirable characteristics of content- and emotion-related danmaku in educational videos. Based on the obtained insights, we developed ClassComet, an educational video platform with novel LMM-driven techniques for generating relevant types of danmaku to enhance video-based learning. Through user studies, we examined the quality of generated danmaku and their influence on learning experiences. The results indicate that our generated danmaku is comparable to human-created ones, and videos with both content- and emotion-related danmaku showed significant improvement in viewers' engagement and learning outcome.
△ Less
Submitted 25 April, 2025;
originally announced April 2025.
-
Memory Reviving, Continuing Learning and Beyond: Evaluation of Pre-trained Encoders and Decoders for Multimodal Machine Translation
Authors:
Zhuang Yu,
Shiliang Sun,
Jing Zhao,
Tengfei Song,
Hao Yang
Abstract:
Multimodal Machine Translation (MMT) aims to improve translation quality by leveraging auxiliary modalities such as images alongside textual input. While recent advances in large-scale pre-trained language and vision models have significantly benefited unimodal natural language processing tasks, their effectiveness and role in MMT remain underexplored. In this work, we conduct a systematic study o…
▽ More
Multimodal Machine Translation (MMT) aims to improve translation quality by leveraging auxiliary modalities such as images alongside textual input. While recent advances in large-scale pre-trained language and vision models have significantly benefited unimodal natural language processing tasks, their effectiveness and role in MMT remain underexplored. In this work, we conduct a systematic study on the impact of pre-trained encoders and decoders in multimodal translation models. Specifically, we analyze how different training strategies, from training from scratch to using pre-trained and partially frozen components, affect translation performance under a unified MMT framework. Experiments are carried out on the Multi30K and CoMMuTE dataset across English-German and English-French translation tasks. Our results reveal that pre-training plays a crucial yet asymmetrical role in multimodal settings: pre-trained decoders consistently yield more fluent and accurate outputs, while pre-trained encoders show varied effects depending on the quality of visual-text alignment. Furthermore, we provide insights into the interplay between modality fusion and pre-trained components, offering guidance for future architecture design in multimodal translation systems.
△ Less
Submitted 24 April, 2025;
originally announced April 2025.
-
Random Long-Context Access for Mamba via Hardware-aligned Hierarchical Sparse Attention
Authors:
Xiang Hu,
Jiaqi Leng,
Jun Zhao,
Kewei Tu,
Wei Wu
Abstract:
A key advantage of Recurrent Neural Networks (RNNs) over Transformers is their linear computational and space complexity enables faster training and inference for long sequences. However, RNNs are fundamentally unable to randomly access historical context, and simply integrating attention mechanisms may undermine their efficiency advantages. To overcome this limitation, we propose \textbf{H}ierarc…
▽ More
A key advantage of Recurrent Neural Networks (RNNs) over Transformers is their linear computational and space complexity enables faster training and inference for long sequences. However, RNNs are fundamentally unable to randomly access historical context, and simply integrating attention mechanisms may undermine their efficiency advantages. To overcome this limitation, we propose \textbf{H}ierarchical \textbf{S}parse \textbf{A}ttention (HSA), a novel attention mechanism that enhances RNNs with long-range random access flexibility while preserving their merits in efficiency and length generalization. HSA divides inputs into chunks, selecting the top-$k$ chunks and hierarchically aggregates information. The core innovation lies in learning token-to-chunk relevance based on fine-grained token-level information inside each chunk. This approach enhances the precision of chunk selection across both in-domain and out-of-domain context lengths. To make HSA efficient, we further introduce a hardware-aligned kernel design. By combining HSA with Mamba, we introduce RAMba, which achieves perfect accuracy in passkey retrieval across 64 million contexts despite pre-training on only 4K-length contexts, and significant improvements on various downstream tasks, with nearly constant memory footprint. These results show RAMba's huge potential in long-context modeling.
△ Less
Submitted 23 April, 2025;
originally announced April 2025.
-
Algorithmic Mirror: Designing an Interactive Tool to Promote Self-Reflection for YouTube Recommendations
Authors:
Yui Kondo,
Kevin Dunnell,
Qing Xiao,
Jun Zhao,
Luc Rocher
Abstract:
Big Data analytics and Artificial Intelligence systems derive non-intuitive and often unverifiable inferences about individuals' behaviors, preferences, and private lives. Drawing on diverse, feature-rich datasets of unpredictable value, these systems erode the intuitive connection between our actions and how we are perceived, diminishing control over our digital identities. While Explainable Arti…
▽ More
Big Data analytics and Artificial Intelligence systems derive non-intuitive and often unverifiable inferences about individuals' behaviors, preferences, and private lives. Drawing on diverse, feature-rich datasets of unpredictable value, these systems erode the intuitive connection between our actions and how we are perceived, diminishing control over our digital identities. While Explainable Artificial Intelligence scholars have attempted to explain the inner workings of algorithms, their visualizations frequently overwhelm end-users with complexity. This research introduces 'hypothetical inference', a novel approach that uses language models to simulate how algorithms might interpret users' digital footprints and infer personal characteristics without requiring access to proprietary platform algorithms. Through empirical studies with fourteen adult participants, we identified three key design opportunities to foster critical algorithmic literacy: (1) reassembling scattered digital footprints into a unified map, (2) simulating algorithmic inference through LLM-generated interpretations, and (3) incorporating temporal dimensions to visualize evolving patterns. This research lays the groundwork for tools that can help users recognize the influence of data on platforms and develop greater autonomy in increasingly algorithm-mediated digital environments.
△ Less
Submitted 23 April, 2025;
originally announced April 2025.
-
Universal Online Contention Resolution with Preselected Order
Authors:
Junyao Zhao
Abstract:
Online contention resolution scheme (OCRS) is a powerful technique for online decision making, which--in the case of matroids--given a matroid and a prior distribution of active elements, selects a subset of active elements that satisfies the matroid constraint in an online fashion. OCRS has been studied mostly for product distributions in the literature. Recently, universal OCRS, that works even…
▽ More
Online contention resolution scheme (OCRS) is a powerful technique for online decision making, which--in the case of matroids--given a matroid and a prior distribution of active elements, selects a subset of active elements that satisfies the matroid constraint in an online fashion. OCRS has been studied mostly for product distributions in the literature. Recently, universal OCRS, that works even for correlated distributions, has gained interest, because it naturally generalizes the classic notion, and its existence in the random-order arrival model turns out to be equivalent to the matroid secretary conjecture. However, currently very little is known about how to design universal OCRSs for any arrival model. In this work, we consider a natural and relatively flexible arrival model, where the OCRS is allowed to preselect (i.e., non-adaptively select) the arrival order of the elements, and within this model, we design simple and optimal universal OCRSs that are computationally efficient. In the course of deriving our OCRSs, we also discover an efficient reduction from universal online contention resolution to the matroid secretary problem for any arrival model, answering a question from Dughmi (2020).
△ Less
Submitted 22 April, 2025;
originally announced April 2025.
-
Exploiting Contextual Knowledge in LLMs through V-usable Information based Layer Enhancement
Authors:
Xiaowei Yuan,
Zhao Yang,
Ziyang Huang,
Yequan Wang,
Siqi Fan,
Yiming Ju,
Jun Zhao,
Kang Liu
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet they often struggle with context-faithfulness generations that properly reflect contextual knowledge. While existing approaches focus on enhancing the decoding strategies, they ignore the fundamental mechanism of how contextual information is processed within LLMs' internal states. As a result, LLMs remain…
▽ More
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet they often struggle with context-faithfulness generations that properly reflect contextual knowledge. While existing approaches focus on enhancing the decoding strategies, they ignore the fundamental mechanism of how contextual information is processed within LLMs' internal states. As a result, LLMs remain limited in their ability to fully leverage contextual knowledge. In this paper, we propose Context-aware Layer Enhancement (CaLE), a novel intervention method that enhances the utilization of contextual knowledge within LLMs' internal representations. By employing V-usable information analysis, CaLE strategically amplifies the growth of contextual information at an optimal layer, thereby enriching representations in the final layer. Our experiments demonstrate that CaLE effectively improves context-faithful generation in Question-Answering tasks, particularly in scenarios involving unknown or conflicting contextual knowledge.
△ Less
Submitted 22 April, 2025;
originally announced April 2025.
-
Audio-Visual Class-Incremental Learning for Fish Feeding intensity Assessment in Aquaculture
Authors:
Meng Cui,
Xianghu Yue,
Xinyuan Qian,
Jinzheng Zhao,
Haohe Liu,
Xubo Liu,
Daoliang Li,
Wenwu Wang
Abstract:
Fish Feeding Intensity Assessment (FFIA) is crucial in industrial aquaculture management. Recent multi-modal approaches have shown promise in improving FFIA robustness and efficiency. However, these methods face significant challenges when adapting to new fish species or environments due to catastrophic forgetting and the lack of suitable datasets. To address these limitations, we first introduce…
▽ More
Fish Feeding Intensity Assessment (FFIA) is crucial in industrial aquaculture management. Recent multi-modal approaches have shown promise in improving FFIA robustness and efficiency. However, these methods face significant challenges when adapting to new fish species or environments due to catastrophic forgetting and the lack of suitable datasets. To address these limitations, we first introduce AV-CIL-FFIA, a new dataset comprising 81,932 labelled audio-visual clips capturing feeding intensities across six different fish species in real aquaculture environments. Then, we pioneer audio-visual class incremental learning (CIL) for FFIA and demonstrate through benchmarking on AV-CIL-FFIA that it significantly outperforms single-modality methods. Existing CIL methods rely heavily on historical data. Exemplar-based approaches store raw samples, creating storage challenges, while exemplar-free methods avoid data storage but struggle to distinguish subtle feeding intensity variations across different fish species. To overcome these limitations, we introduce HAIL-FFIA, a novel audio-visual class-incremental learning framework that bridges this gap with a prototype-based approach that achieves exemplar-free efficiency while preserving essential knowledge through compact feature representations. Specifically, HAIL-FFIA employs hierarchical representation learning with a dual-path knowledge preservation mechanism that separates general intensity knowledge from fish-specific characteristics. Additionally, it features a dynamic modality balancing system that adaptively adjusts the importance of audio versus visual information based on feeding behaviour stages. Experimental results show that HAIL-FFIA is superior to SOTA methods on AV-CIL-FFIA, achieving higher accuracy with lower storage needs while effectively mitigating catastrophic forgetting in incremental fish species learning.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides
Authors:
Jinghua Zhao,
Yuhang Jia,
Shiyao Wang,
Jiaming Zhou,
Hui Wang,
Yong Qin
Abstract:
Incorporating visual modalities to assist Automatic Speech Recognition (ASR) tasks has led to significant improvements. However, existing Audio-Visual Speech Recognition (AVSR) datasets and methods typically rely solely on lip-reading information or speaking contextual video, neglecting the potential of combining these different valuable visual cues within the speaking context. In this paper, we r…
▽ More
Incorporating visual modalities to assist Automatic Speech Recognition (ASR) tasks has led to significant improvements. However, existing Audio-Visual Speech Recognition (AVSR) datasets and methods typically rely solely on lip-reading information or speaking contextual video, neglecting the potential of combining these different valuable visual cues within the speaking context. In this paper, we release a multimodal Chinese AVSR dataset, Chinese-LiPS, comprising 100 hours of speech, video, and corresponding manual transcription, with the visual modality encompassing both lip-reading information and the presentation slides used by the speaker. Based on Chinese-LiPS, we develop a simple yet effective pipeline, LiPS-AVSR, which leverages both lip-reading and presentation slide information as visual modalities for AVSR tasks. Experiments show that lip-reading and presentation slide information improve ASR performance by approximately 8\% and 25\%, respectively, with a combined performance improvement of about 35\%. The dataset is available at https://kiri0824.github.io/Chinese-LiPS/
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
Understanding Accuracy-Fairness Trade-offs in Re-ranking through Elasticity in Economics
Authors:
Chen Xu,
Jujia Zhao,
Wenjie Wang,
Liang Pang,
Jun Xu,
Tat-Seng Chua,
Maarten de Rijke
Abstract:
Fairness is an increasingly important factor in re-ranking tasks. Prior work has identified a trade-off between ranking accuracy and item fairness. However, the underlying mechanisms are still not fully understood. An analogy can be drawn between re-ranking and the dynamics of economic transactions. The accuracy-fairness trade-off parallels the coupling of the commodity tax transfer process. Fairn…
▽ More
Fairness is an increasingly important factor in re-ranking tasks. Prior work has identified a trade-off between ranking accuracy and item fairness. However, the underlying mechanisms are still not fully understood. An analogy can be drawn between re-ranking and the dynamics of economic transactions. The accuracy-fairness trade-off parallels the coupling of the commodity tax transfer process. Fairness considerations in re-ranking, similar to a commodity tax on suppliers, ultimately translate into a cost passed on to consumers. Analogously, item-side fairness constraints result in a decline in user-side accuracy. In economics, the extent to which commodity tax on the supplier (item fairness) transfers to commodity tax on users (accuracy loss) is formalized using the notion of elasticity. The re-ranking fairness-accuracy trade-off is similarly governed by the elasticity of utility between item groups. This insight underscores the limitations of current fair re-ranking evaluations, which often rely solely on a single fairness metric, hindering comprehensive assessment of fair re-ranking algorithms. Centered around the concept of elasticity, this work presents two significant contributions. We introduce the Elastic Fairness Curve (EF-Curve) as an evaluation framework. This framework enables a comparative analysis of algorithm performance across different elasticity levels, facilitating the selection of the most suitable approach. Furthermore, we propose ElasticRank, a fair re-ranking algorithm that employs elasticity calculations to adjust inter-item distances within a curved space. Experiments on three widely used ranking datasets demonstrate its effectiveness and efficiency.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
Transparentize the Internal and External Knowledge Utilization in LLMs with Trustworthy Citation
Authors:
Jiajun Shen,
Tong Zhou,
Yubo Chen,
Delai Qiu,
Shengping Liu,
Kang Liu,
Jun Zhao
Abstract:
While hallucinations of large language models could been alleviated through retrieval-augmented generation and citation generation, how the model utilizes internal knowledge is still opaque, and the trustworthiness of its generated answers remains questionable. In this work, we introduce Context-Prior Augmented Citation Generation task, requiring models to generate citations considering both exter…
▽ More
While hallucinations of large language models could been alleviated through retrieval-augmented generation and citation generation, how the model utilizes internal knowledge is still opaque, and the trustworthiness of its generated answers remains questionable. In this work, we introduce Context-Prior Augmented Citation Generation task, requiring models to generate citations considering both external and internal knowledge while providing trustworthy references, with 5 evaluation metrics focusing on 3 aspects: answer helpfulness, citation faithfulness, and trustworthiness. We introduce RAEL, the paradigm for our task, and also design INTRALIGN, an integrated method containing customary data generation and an alignment algorithm. Our experimental results show that our method achieves a better cross-scenario performance with regard to other baselines. Our extended experiments further reveal that retrieval quality, question types, and model knowledge have considerable influence on the trustworthiness in citation generation.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
Accelerating Visual Reinforcement Learning with Separate Primitive Policy for Peg-in-Hole Tasks
Authors:
Zichun Xu,
Zhaomin Wang,
Yuntao Li,
Lei Zhuang,
Zhiyuan Zhao,
Guocai Yang,
Jingdong Zhao
Abstract:
For peg-in-hole tasks, humans rely on binocular visual perception to locate the peg above the hole surface and then proceed with insertion. This paper draws insights from this behavior to enable agents to learn efficient assembly strategies through visual reinforcement learning. Hence, we propose a Separate Primitive Policy (S2P) to simultaneously learn how to derive location and insertion actions…
▽ More
For peg-in-hole tasks, humans rely on binocular visual perception to locate the peg above the hole surface and then proceed with insertion. This paper draws insights from this behavior to enable agents to learn efficient assembly strategies through visual reinforcement learning. Hence, we propose a Separate Primitive Policy (S2P) to simultaneously learn how to derive location and insertion actions. S2P is compatible with model-free reinforcement learning algorithms. Ten insertion tasks featuring different polygons are developed as benchmarks for evaluations. Simulation experiments show that S2P can boost the sample efficiency and success rate even with force constraints. Real-world experiments are also performed to verify the feasibility of S2P. Ablations are finally given to discuss the generalizability of S2P and some factors that affect its performance.
△ Less
Submitted 20 April, 2025;
originally announced April 2025.
-
NTIRE 2025 Challenge on Image Super-Resolution ($\times$4): Methods and Results
Authors:
Zheng Chen,
Kai Liu,
Jue Gong,
Jingkai Wang,
Lei Sun,
Zongwei Wu,
Radu Timofte,
Yulun Zhang,
Xiangyu Kong,
Xiaoxuan Yu,
Hyunhee Park,
Suejin Han,
Hakjae Jeon,
Dafeng Zhang,
Hyung-Ju Chun,
Donghun Ryou,
Inju Ha,
Bohyung Han,
Lu Zhao,
Yuyi Zhang,
Pengyu Yan,
Jiawei Hu,
Pengwei Liu,
Fengjun Guo,
Hongyuan Yu
, et al. (86 additional authors not shown)
Abstract:
This paper presents the NTIRE 2025 image super-resolution ($\times$4) challenge, one of the associated competitions of the 10th NTIRE Workshop at CVPR 2025. The challenge aims to recover high-resolution (HR) images from low-resolution (LR) counterparts generated through bicubic downsampling with a $\times$4 scaling factor. The objective is to develop effective network designs or solutions that ach…
▽ More
This paper presents the NTIRE 2025 image super-resolution ($\times$4) challenge, one of the associated competitions of the 10th NTIRE Workshop at CVPR 2025. The challenge aims to recover high-resolution (HR) images from low-resolution (LR) counterparts generated through bicubic downsampling with a $\times$4 scaling factor. The objective is to develop effective network designs or solutions that achieve state-of-the-art SR performance. To reflect the dual objectives of image SR research, the challenge includes two sub-tracks: (1) a restoration track, emphasizes pixel-wise accuracy and ranks submissions based on PSNR; (2) a perceptual track, focuses on visual realism and ranks results by a perceptual score. A total of 286 participants registered for the competition, with 25 teams submitting valid entries. This report summarizes the challenge design, datasets, evaluation protocol, the main results, and methods of each team. The challenge serves as a benchmark to advance the state of the art and foster progress in image SR.
△ Less
Submitted 28 April, 2025; v1 submitted 20 April, 2025;
originally announced April 2025.
-
AnywhereXR: On-the-fly 3D Environments as a Basis for Open Source Immersive Digital Twin Applications
Authors:
Alexander Klippel,
Bart Knuiman,
Jiayan Zhao,
Jan Oliver Wallgrün,
Jascha Grübel
Abstract:
Visualization has long been fundamental to human communication and decision-making. Today, we stand at the threshold of integrating veridical, high-fidelity visualizations into immersive digital environments, alongside digital twinning techniques. This convergence heralds powerful tools for communication, co-design, and participatory decision-making. Our paper delves into the development of lightw…
▽ More
Visualization has long been fundamental to human communication and decision-making. Today, we stand at the threshold of integrating veridical, high-fidelity visualizations into immersive digital environments, alongside digital twinning techniques. This convergence heralds powerful tools for communication, co-design, and participatory decision-making. Our paper delves into the development of lightweight open-source immersive digital twin visualisations, capitalizing on the evolution of immersive technologies, the wealth of spatial data available, and advancements in digital twinning. Coined AnywhereXR, this approach ultimately seeks to democratize access to spatial information at a global scale. Utilizing the Netherlands as our starting point, we envision expanding this methodology worldwide, leveraging open data and software to address pressing societal challenges across diverse domains.
△ Less
Submitted 18 April, 2025;
originally announced April 2025.
-
LoRA-Based Continual Learning with Constraints on Critical Parameter Changes
Authors:
Shimou Ling,
Liang Zhang,
Jiangwei Zhao,
Lili Pan,
Hongliang Li
Abstract:
LoRA-based continual learning represents a promising avenue for leveraging pre-trained models in downstream continual learning tasks. Recent studies have shown that orthogonal LoRA tuning effectively mitigates forgetting. However, this work unveils that under orthogonal LoRA tuning, the critical parameters for pre-tasks still change notably after learning post-tasks. To address this problem, we di…
▽ More
LoRA-based continual learning represents a promising avenue for leveraging pre-trained models in downstream continual learning tasks. Recent studies have shown that orthogonal LoRA tuning effectively mitigates forgetting. However, this work unveils that under orthogonal LoRA tuning, the critical parameters for pre-tasks still change notably after learning post-tasks. To address this problem, we directly propose freezing the most critical parameter matrices in the Vision Transformer (ViT) for pre-tasks before learning post-tasks. In addition, building on orthogonal LoRA tuning, we propose orthogonal LoRA composition (LoRAC) based on QR decomposition, which may further enhance the plasticity of our method. Elaborate ablation studies and extensive comparisons demonstrate the effectiveness of our proposed method. Our results indicate that our method achieves state-of-the-art (SOTA) performance on several well-known continual learning benchmarks. For instance, on the Split CIFAR-100 dataset, our method shows a 6.35\% improvement in accuracy and a 3.24\% reduction in forgetting compared to previous methods. Our code is available at https://github.com/learninginvision/LoRAC-IPC.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
All-in-One Transferring Image Compression from Human Perception to Multi-Machine Perception
Authors:
Jiancheng Zhao,
Xiang Ji,
Zhuoxiao Li,
Zunian Wan,
Weihang Ran,
Mingze Ma,
Muyao Niu,
Yifan Zhan,
Cheng-Ching Tseng,
Yinqiang Zheng
Abstract:
Efficiently transferring Learned Image Compression (LIC) model from human perception to machine perception is an emerging challenge in vision-centric representation learning. Existing approaches typically adapt LIC to downstream tasks in a single-task manner, which is inefficient, lacks task interaction, and results in multiple task-specific bitstreams. To address these limitations, we propose an…
▽ More
Efficiently transferring Learned Image Compression (LIC) model from human perception to machine perception is an emerging challenge in vision-centric representation learning. Existing approaches typically adapt LIC to downstream tasks in a single-task manner, which is inefficient, lacks task interaction, and results in multiple task-specific bitstreams. To address these limitations, we propose an asymmetric adaptor framework that supports multi-task adaptation within a single model. Our method introduces a shared adaptor to learn general semantic features and task-specific adaptors to preserve task-level distinctions. With only lightweight plug-in modules and a frozen base codec, our method achieves strong performance across multiple tasks while maintaining compression efficiency. Experiments on the PASCAL-Context benchmark demonstrate that our method outperforms both Fully Fine-Tuned and other Parameter Efficient Fine-Tuned (PEFT) baselines, and validating the effectiveness of multi-vision transferring.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
Tree-NeRV: A Tree-Structured Neural Representation for Efficient Non-Uniform Video Encoding
Authors:
Jiancheng Zhao,
Yifan Zhan,
Qingtian Zhu,
Mingze Ma,
Muyao Niu,
Zunian Wan,
Xiang Ji,
Yinqiang Zheng
Abstract:
Implicit Neural Representations for Videos (NeRV) have emerged as a powerful paradigm for video representation, enabling direct mappings from frame indices to video frames. However, existing NeRV-based methods do not fully exploit temporal redundancy, as they rely on uniform sampling along the temporal axis, leading to suboptimal rate-distortion (RD) performance. To address this limitation, we pro…
▽ More
Implicit Neural Representations for Videos (NeRV) have emerged as a powerful paradigm for video representation, enabling direct mappings from frame indices to video frames. However, existing NeRV-based methods do not fully exploit temporal redundancy, as they rely on uniform sampling along the temporal axis, leading to suboptimal rate-distortion (RD) performance. To address this limitation, we propose Tree-NeRV, a novel tree-structured feature representation for efficient and adaptive video encoding. Unlike conventional approaches, Tree-NeRV organizes feature representations within a Binary Search Tree (BST), enabling non-uniform sampling along the temporal axis. Additionally, we introduce an optimization-driven sampling strategy, dynamically allocating higher sampling density to regions with greater temporal variation. Extensive experiments demonstrate that Tree-NeRV achieves superior compression efficiency and reconstruction quality, outperforming prior uniform sampling-based methods. Code will be released.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.