-
Analogous supercritical crossovers in black holes and water
Authors:
Shoucheng Wang,
Xinyang Li,
Yuliang Jin,
Li Li
Abstract:
We investigate the supercritical crossovers for black hole thermodynamics in the supercritical regime beyond the critical point, where small and large black holes are indistinguishable from the conventional viewpoint. We establish a refined supercritical phase diagram that comprehensively characterizes small, large, and indistinguishable black hole phases, whose boundaries are defined by two super…
▽ More
We investigate the supercritical crossovers for black hole thermodynamics in the supercritical regime beyond the critical point, where small and large black holes are indistinguishable from the conventional viewpoint. We establish a refined supercritical phase diagram that comprehensively characterizes small, large, and indistinguishable black hole phases, whose boundaries are defined by two supercritical crossover lines. The universal scaling laws of the two crossover lines are fully verified using black hole thermodynamics in both the standard consideration and the extended thermodynamic phase space by treating the cosmological constant as a thermodynamic pressure. Remarkable analogies are observed when the supercritical phase diagrams of the two frameworks of black holes are compared to those corresponding to liquid-gas and liquid-liquid phase transitions. The present study can be extended to a variety of more complicated black hole backgrounds and provide valuable insights into the fundamental nature of black hole thermodynamics.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Adapting Vision-Language Foundation Model for Next Generation Medical Ultrasound Image Analysis
Authors:
Jingguo Qu,
Xinyang Han,
Tonghuan Xiao,
Jia Ai,
Juan Wu,
Tong Zhao,
Jing Qin,
Ann Dorothy King,
Winnie Chiu-Wing Chu,
Jing Cai,
Michael Tin-Cheung Ying
Abstract:
Medical ultrasonography is an essential imaging technique for examining superficial organs and tissues, including lymph nodes, breast, and thyroid. It employs high-frequency ultrasound waves to generate detailed images of the internal structures of the human body. However, manually contouring regions of interest in these images is a labor-intensive task that demands expertise and often results in…
▽ More
Medical ultrasonography is an essential imaging technique for examining superficial organs and tissues, including lymph nodes, breast, and thyroid. It employs high-frequency ultrasound waves to generate detailed images of the internal structures of the human body. However, manually contouring regions of interest in these images is a labor-intensive task that demands expertise and often results in inconsistent interpretations among individuals. Vision-language foundation models, which have excelled in various computer vision applications, present new opportunities for enhancing ultrasound image analysis. Yet, their performance is hindered by the significant differences between natural and medical imaging domains. This research seeks to overcome these challenges by developing domain adaptation methods for vision-language foundation models. In this study, we explore the fine-tuning pipeline for vision-language foundation models by utilizing large language model as text refiner with special-designed adaptation strategies and task-driven heads. Our approach has been extensively evaluated on six ultrasound datasets and two tasks: segmentation and classification. The experimental results show that our method can effectively improve the performance of vision-language foundation models for ultrasound image analysis, and outperform the existing state-of-the-art vision-language and pure foundation models. The source code of this study is available at https://github.com/jinggqu/NextGen-UIA.
△ Less
Submitted 10 June, 2025; v1 submitted 10 June, 2025;
originally announced June 2025.
-
PersonaAgent: When Large Language Model Agents Meet Personalization at Test Time
Authors:
Weizhi Zhang,
Xinyang Zhang,
Chenwei Zhang,
Liangwei Yang,
Jingbo Shang,
Zhepei Wei,
Henry Peng Zou,
Zijie Huang,
Zhengyang Wang,
Yifan Gao,
Xiaoman Pan,
Lian Xiong,
Jingguo Liu,
Philip S. Yu,
Xian Li
Abstract:
Large Language Model (LLM) empowered agents have recently emerged as advanced paradigms that exhibit impressive capabilities in a wide range of domains and tasks. Despite their potential, current LLM agents often adopt a one-size-fits-all approach, lacking the flexibility to respond to users' varying needs and preferences. This limitation motivates us to develop PersonaAgent, the first personalize…
▽ More
Large Language Model (LLM) empowered agents have recently emerged as advanced paradigms that exhibit impressive capabilities in a wide range of domains and tasks. Despite their potential, current LLM agents often adopt a one-size-fits-all approach, lacking the flexibility to respond to users' varying needs and preferences. This limitation motivates us to develop PersonaAgent, the first personalized LLM agent framework designed to address versatile personalization tasks. Specifically, PersonaAgent integrates two complementary components - a personalized memory module that includes episodic and semantic memory mechanisms; a personalized action module that enables the agent to perform tool actions tailored to the user. At the core, the persona (defined as unique system prompt for each user) functions as an intermediary: it leverages insights from personalized memory to control agent actions, while the outcomes of these actions in turn refine the memory. Based on the framework, we propose a test-time user-preference alignment strategy that simulate the latest n interactions to optimize the persona prompt, ensuring real-time user preference alignment through textual loss feedback between simulated and ground-truth responses. Experimental evaluations demonstrate that PersonaAgent significantly outperforms other baseline methods by not only personalizing the action space effectively but also scaling during test-time real-world applications. These results underscore the feasibility and potential of our approach in delivering tailored, dynamic user experiences.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
Route-and-Reason: Scaling Large Language Model Reasoning with Reinforced Model Router
Authors:
Chenyang Shao,
Xinyang Liu,
Yutang Lin,
Fengli Xu,
Yong Li
Abstract:
Multi-step reasoning has proven essential for enhancing the problem-solving capabilities of Large Language Models (LLMs) by decomposing complex tasks into intermediate steps, either explicitly or implicitly. Extending the reasoning chain at test time through deeper thought processes or broader exploration, can furthur improve performance, but often incurs substantial costs due to the explosion in…
▽ More
Multi-step reasoning has proven essential for enhancing the problem-solving capabilities of Large Language Models (LLMs) by decomposing complex tasks into intermediate steps, either explicitly or implicitly. Extending the reasoning chain at test time through deeper thought processes or broader exploration, can furthur improve performance, but often incurs substantial costs due to the explosion in token usage. Yet, many reasoning steps are relatively simple and can be handled by more efficient smaller-scale language models (SLMs). This motivates hybrid approaches that allocate subtasks across models of varying capacities. However, realizing such collaboration requires accurate task decomposition and difficulty-aware subtask allocation, which is challenging. To address this, we propose R2-Reasoner, a novel framework that enables collaborative reasoning across heterogeneous LLMs by dynamically routing sub-tasks based on estimated complexity. At the core of our framework is a Reinforced Model Router, composed of a task decomposer and a subtask allocator. The task decomposer segments complex input queries into logically ordered subtasks, while the subtask allocator assigns each subtask to the most appropriate model, ranging from lightweight SLMs to powerful LLMs, balancing accuracy and efficiency. To train this router, we introduce a staged pipeline that combines supervised fine-tuning on task-specific datasets with Group Relative Policy Optimization algorithm, enabling self-supervised refinement through iterative reinforcement learning. Extensive experiments across four challenging benchmarks demonstrate that R2-Reasoner reduces API costs by 86.85% while maintaining or surpassing baseline accuracy. Our framework paves the way for more cost-effective and adaptive LLM reasoning. The code is open-source at https://anonymous.4open.science/r/R2_Reasoner .
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and Challenging
Authors:
Zichen Tang,
Haihong E,
Ziyan Ma,
Haoyang He,
Jiacheng Liu,
Zhongjun Yang,
Zihua Rong,
Rongjin Li,
Kun Ji,
Qing Huang,
Xinyang Hu,
Yang Liu,
Qianhe Zheng
Abstract:
We introduce FinanceReasoning, a novel benchmark designed to evaluate the reasoning capabilities of large reasoning models (LRMs) in financial numerical reasoning problems. Compared to existing benchmarks, our work provides three key advancements. (1) Credibility: We update 15.6% of the questions from four public datasets, annotating 908 new questions with detailed Python solutions and rigorously…
▽ More
We introduce FinanceReasoning, a novel benchmark designed to evaluate the reasoning capabilities of large reasoning models (LRMs) in financial numerical reasoning problems. Compared to existing benchmarks, our work provides three key advancements. (1) Credibility: We update 15.6% of the questions from four public datasets, annotating 908 new questions with detailed Python solutions and rigorously refining evaluation standards. This enables an accurate assessment of the reasoning improvements of LRMs. (2) Comprehensiveness: FinanceReasoning covers 67.8% of financial concepts and formulas, significantly surpassing existing datasets. Additionally, we construct 3,133 Python-formatted functions, which enhances LRMs' financial reasoning capabilities through refined knowledge (e.g., 83.2% $\rightarrow$ 91.6% for GPT-4o). (3) Challenge: Models are required to apply multiple financial formulas for precise numerical reasoning on 238 Hard problems. The best-performing model (i.e., OpenAI o1 with PoT) achieves 89.1% accuracy, yet LRMs still face challenges in numerical precision. We demonstrate that combining Reasoner and Programmer models can effectively enhance LRMs' performance (e.g., 83.2% $\rightarrow$ 87.8% for DeepSeek-R1). Our work paves the way for future research on evaluating and improving LRMs in domain-specific complex reasoning tasks.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
Universal Reusability in Recommender Systems: The Case for Dataset- and Task-Independent Frameworks
Authors:
Tri Kurniawan Wijaya,
Xinyang Shao,
Gonzalo Fiz Pontiveros,
Edoardo D'Amico
Abstract:
Recommender systems are pivotal in delivering personalized experiences across industries, yet their adoption and scalability remain hindered by the need for extensive dataset- and task-specific configurations. Existing systems often require significant manual intervention, domain expertise, and engineering effort to adapt to new datasets or tasks, creating barriers to entry and limiting reusabilit…
▽ More
Recommender systems are pivotal in delivering personalized experiences across industries, yet their adoption and scalability remain hindered by the need for extensive dataset- and task-specific configurations. Existing systems often require significant manual intervention, domain expertise, and engineering effort to adapt to new datasets or tasks, creating barriers to entry and limiting reusability. In contrast, recent advancements in large language models (LLMs) have demonstrated the transformative potential of reusable systems, where a single model can handle diverse tasks without significant reconfiguration. Inspired by this paradigm, we propose the Dataset- and Task-Independent Recommender System (DTIRS), a framework aimed at maximizing the reusability of recommender systems while minimizing barriers to entry. Unlike LLMs, which achieve task generalization directly, DTIRS focuses on eliminating the need to rebuild or reconfigure recommendation pipelines for every new dataset or task, even though models may still need retraining on new data. By leveraging the novel Dataset Description Language (DsDL), DTIRS enables standardized dataset descriptions and explicit task definitions, allowing autonomous feature engineering, model selection, and optimization. This paper introduces the concept of DTIRS and establishes a roadmap for transitioning from Level-1 automation (dataset-agnostic but task-specific systems) to Level-2 automation (fully dataset- and task-independent systems). Achieving this paradigm would maximize code reusability and lower barriers to adoption. We discuss key challenges, including the trade-offs between generalization and specialization, computational overhead, and scalability, while presenting DsDL as a foundational tool for this vision.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
Voyager: Real-Time Splatting City-Scale 3D Gaussians on Your Phone
Authors:
Zheng Liu,
He Zhu,
Xinyang Li,
Yirun Wang,
Yujiao Shi,
Wei Li,
Jingwen Leng,
Minyi Guo,
Yu Feng
Abstract:
3D Gaussian Splatting (3DGS) is an emerging technique for photorealistic 3D scene rendering. However, rendering city-scale 3DGS scenes on mobile devices, e.g., your smartphones, remains a significant challenge due to the limited resources on mobile devices. A natural solution is to offload computation to the cloud; however, naively streaming rendered frames from the cloud to the client introduces…
▽ More
3D Gaussian Splatting (3DGS) is an emerging technique for photorealistic 3D scene rendering. However, rendering city-scale 3DGS scenes on mobile devices, e.g., your smartphones, remains a significant challenge due to the limited resources on mobile devices. A natural solution is to offload computation to the cloud; however, naively streaming rendered frames from the cloud to the client introduces high latency and requires bandwidth far beyond the capacity of current wireless networks.
In this paper, we propose an effective solution to enable city-scale 3DGS rendering on mobile devices. Our key insight is that, under normal user motion, the number of newly visible Gaussians per second remains roughly constant. Leveraging this, we stream only the necessary Gaussians to the client. Specifically, on the cloud side, we propose asynchronous level-of-detail search to identify the necessary Gaussians for the client. On the client side, we accelerate rendering via a lookup table-based rasterization. Combined with holistic runtime optimizations, our system can deliver low-latency, city-scale 3DGS rendering on mobile devices. Compared to existing solutions, Voyager achieves over 100$\times$ reduction on data transfer and up to 8.9$\times$ speedup while retaining comparable rendering quality.
△ Less
Submitted 3 June, 2025; v1 submitted 3 June, 2025;
originally announced June 2025.
-
Weak but influential: Nonlinear contributions of structural connectivity to human cognitive abilities and brain functions
Authors:
Rong Wang,
Zhao Chang,
Xuechun Liu,
Daniel Kristanto,
Étienne Gérard Guy Gartner,
Xinyang Liu,
Mianxin Liu,
Ying Wu,
Ming Lui,
Changsong Zhou
Abstract:
Diverse human cognitive abilities are rooted in brain structural connectivity which has weights spanning several orders of magnitude. However, due to false-positive challenges in tractography, weak connectivity has been often treated as noise and ignored - despite its prevalence across mammalian brains. Here we show that weak connectivity significantly predicts human cognitive abilities and suppor…
▽ More
Diverse human cognitive abilities are rooted in brain structural connectivity which has weights spanning several orders of magnitude. However, due to false-positive challenges in tractography, weak connectivity has been often treated as noise and ignored - despite its prevalence across mammalian brains. Here we show that weak connectivity significantly predicts human cognitive abilities and supports brain functions through amplification of its small weight in a nonlinear manner. Using the Human Connectome Project dataset (n=999) and multiple tractography algorithms, we constructed the whole-brain structural connectivity with heterogeneous weights of streamline numbers. We found that weak connectivity involves high individual variability and significantly predicts general cognitive ability and memory in individuals, and it is also critical for whole-brain dynamic simulation and structure-function coupling. Importantly, fusing two post-tractography filtering methods of streamlines potentially results in more reliable connectivity that preserves weak links and outperforms conventional thresholding in predicting cognitive abilities and functional connectivity. At the network level, weak connectivity expands the operational capacity of brain networks to enhance both global integration and fine-grained segregation, thereby supporting a functional balance essential for cognitive abilities. Finally, we identified a specific type of weak connectivity mainly linking visual/motor to limbic areas with negative gene co-expression, which has a disproportionately large impact on cognitive predictions and network dynamics.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Diff-FlowFSI: A GPU-Optimized Differentiable CFD Platform for High-Fidelity Turbulence and FSI Simulations
Authors:
Xiantao Fan,
Xinyang Liu,
Meng Wang,
Jian-Xun Wang
Abstract:
Turbulent flows and fluid-structure interactions (FSI) are ubiquitous in scientific and engineering applications, but their accurate and efficient simulation remains a major challenge due to strong nonlinearities, multiscale interactions, and high computational demands. Traditional CFD solvers, though effective, struggle with scalability and adaptability for tasks such as inverse modeling, optimiz…
▽ More
Turbulent flows and fluid-structure interactions (FSI) are ubiquitous in scientific and engineering applications, but their accurate and efficient simulation remains a major challenge due to strong nonlinearities, multiscale interactions, and high computational demands. Traditional CFD solvers, though effective, struggle with scalability and adaptability for tasks such as inverse modeling, optimization, and data assimilation. Recent advances in machine learning (ML) have inspired hybrid modeling approaches that integrate neural networks with physics-based solvers to enhance generalization and capture unresolved dynamics. However, realizing this integration requires solvers that are not only physically accurate but also differentiable and GPU-efficient. In this work, we introduce Diff-FlowFSI, a GPU-accelerated, fully differentiable CFD platform designed for high-fidelity turbulence and FSI simulations. Implemented in JAX, Diff-FlowFSI features a vectorized finite volume solver combined with the immersed boundary method to handle complex geometries and fluid-structure coupling. The platform enables GPU-enabled fast forward simulations, supports automatic differentiation for gradient-based inverse problems, and integrates seamlessly with deep learning components for hybrid neural-CFD modeling. We validate Diff-FlowFSI across a series of benchmark turbulence and FSI problems, demonstrating its capability to accelerate scientific computing at the intersection of physics and machine learning.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding
Authors:
David Ma,
Huaqing Yuan,
Xingjian Wang,
Qianbo Zang,
Tianci Liu,
Xinyang He,
Yanbin Wei,
Jiawei Guo,
Ni Jiahui,
Zhenzhu Yang,
Meng Cao,
Shanghaoran Quan,
Yizhi Li,
Wangchunshu Zhou,
Jiaheng Liu,
Wenhao Huang,
Ge Zhang,
Shiwen Ni,
Xiaojie Jin
Abstract:
Although long-video understanding demands that models capture hierarchical temporal information -- from clip (seconds) and shot (tens of seconds) to event (minutes) and story (hours) -- existing benchmarks either neglect this multi-scale design or scatter scale-specific questions across different videos, preventing direct comparison of model performance across timescales on the same content. To ad…
▽ More
Although long-video understanding demands that models capture hierarchical temporal information -- from clip (seconds) and shot (tens of seconds) to event (minutes) and story (hours) -- existing benchmarks either neglect this multi-scale design or scatter scale-specific questions across different videos, preventing direct comparison of model performance across timescales on the same content. To address this, we introduce ScaleLong, the first benchmark to disentangle these factors by embedding questions targeting four hierarchical timescales -- clip (seconds), shot (tens of seconds), event (minutes), and story (hours) -- all within the same video content. This within-content multi-timescale questioning design enables direct comparison of model performance across timescales on identical videos. ScaleLong features 269 long videos (avg.\ 86\,min) from 5 main categories and 36 sub-categories, with 4--8 carefully designed questions, including at least one question for each timescale. Evaluating 23 MLLMs reveals a U-shaped performance curve, with higher accuracy at the shortest and longest timescales and a dip at intermediate levels. Furthermore, ablation studies show that increased visual token capacity consistently enhances reasoning across all timescales. ScaleLong offers a fine-grained, multi-timescale benchmark for advancing MLLM capabilities in long-video understanding. The code and dataset are available https://github.com/multimodal-art-projection/ScaleLong.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Enhanced Light Extraction and Beam Focusing in GaN LEDs Using Hybrid Metasurface-Distributed Bragg Reflector Structures
Authors:
Hanbo Xu,
Xinyang Liu,
Lei Wang
Abstract:
This study presents an optimized hybrid design integrating a distributed Bragg reflector (DBR) and a TiO2 nanocylinder metasurface to enhance light extraction efficiency (LEE) and beam directionality(narrow divergence angle) in light-emitting diodes (LEDs) based on gallium nitride (GaN).Parametric simulations were used to identify an optimal device architecture.The resulting structure comprises a…
▽ More
This study presents an optimized hybrid design integrating a distributed Bragg reflector (DBR) and a TiO2 nanocylinder metasurface to enhance light extraction efficiency (LEE) and beam directionality(narrow divergence angle) in light-emitting diodes (LEDs) based on gallium nitride (GaN).Parametric simulations were used to identify an optimal device architecture.The resulting structure comprises a single-period DBR,which has a thickness of TiO2(dTiO2) equal to forty-six nm and a thickness of SiO2 equal to seventy-sevsen nm,beneath a periodic array of TiO2 nanocylinders (radius is approximately seventy-one nm,height is approximately one handred and eighty-five nm).The DBR reflects guided modes to minimize internal optical losses,while the TiO2 metasurface employs Mie resonance to collimate the emitted light.As a result,the hybrid LED achieves a simulated LEE of 25.67 percent and a beam divergence angle of only 5.7 degree,representing a significant improvement in both efficiency and emission directionality over conventional designs.These findings demonstrate a viable strategy to overcome light trapping and broad angular emission in GaN LEDs,paving the way for high-brightness,highly directional GaN micro-LEDs for advanced display and optical communication applications.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
A2Seek: Towards Reasoning-Centric Benchmark for Aerial Anomaly Understanding
Authors:
Mengjingcheng Mo,
Xinyang Tong,
Jiaxu Leng,
Mingpi Tan,
Jiankang Zheng,
Yiran Liu,
Haosheng Chen,
Ji Gan,
Weisheng Li,
Xinbo Gao
Abstract:
While unmanned aerial vehicles (UAVs) offer wide-area, high-altitude coverage for anomaly detection, they face challenges such as dynamic viewpoints, scale variations, and complex scenes. Existing datasets and methods, mainly designed for fixed ground-level views, struggle to adapt to these conditions, leading to significant performance drops in drone-view scenarios. To bridge this gap, we introdu…
▽ More
While unmanned aerial vehicles (UAVs) offer wide-area, high-altitude coverage for anomaly detection, they face challenges such as dynamic viewpoints, scale variations, and complex scenes. Existing datasets and methods, mainly designed for fixed ground-level views, struggle to adapt to these conditions, leading to significant performance drops in drone-view scenarios. To bridge this gap, we introduce A2Seek (Aerial Anomaly Seek), a large-scale, reasoning-centric benchmark dataset for aerial anomaly understanding. This dataset covers various scenarios and environmental conditions, providing high-resolution real-world aerial videos with detailed annotations, including anomaly categories, frame-level timestamps, region-level bounding boxes, and natural language explanations for causal reasoning. Building on this dataset, we propose A2Seek-R1, a novel reasoning framework that generalizes R1-style strategies to aerial anomaly understanding, enabling a deeper understanding of "Where" anomalies occur and "Why" they happen in aerial frames. To this end, A2Seek-R1 first employs a graph-of-thought (GoT)-guided supervised fine-tuning approach to activate the model's latent reasoning capabilities on A2Seek. Then, we introduce Aerial Group Relative Policy Optimization (A-GRPO) to design rule-based reward functions tailored to aerial scenarios. Furthermore, we propose a novel "seeking" mechanism that simulates UAV flight behavior by directing the model's attention to informative regions. Extensive experiments demonstrate that A2Seek-R1 achieves up to a 22.04% improvement in AP for prediction accuracy and a 13.9% gain in mIoU for anomaly localization, exhibiting strong generalization across complex environments and out-of-distribution scenarios. Our dataset and code will be released at https://hayneyday.github.io/A2Seek/.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
Improving Research Idea Generation Through Data: An Empirical Investigation in Social Science
Authors:
Xiao Liu,
Xinyi Dong,
Xinyang Gao,
Yansong Feng,
Xun Pang
Abstract:
Recent advancements in large language models (LLMs) have shown promise in generating novel research ideas. However, these ideas often face challenges related to feasibility and expected effectiveness. This paper explores how augmenting LLMs with relevant data during the idea generation process can enhance the quality of generated ideas. We introduce two ways of incorporating data: (1) providing me…
▽ More
Recent advancements in large language models (LLMs) have shown promise in generating novel research ideas. However, these ideas often face challenges related to feasibility and expected effectiveness. This paper explores how augmenting LLMs with relevant data during the idea generation process can enhance the quality of generated ideas. We introduce two ways of incorporating data: (1) providing metadata during the idea generation stage to guide LLMs toward feasible directions, and (2) adding automatic validation during the idea selection stage to assess the empirical plausibility of hypotheses within ideas. We conduct experiments in the social science domain, specifically with climate negotiation topics, and find that metadata improves the feasibility of generated ideas by 20%, while automatic validation improves the overall quality of selected ideas by 7%. A human study shows that LLM-generated ideas, along with their related data and validation processes, inspire researchers to propose research ideas with higher quality. Our work highlights the potential of data-driven research idea generation, and underscores the practical utility of LLM-assisted ideation in real-world academic settings.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
NTIRE 2025 challenge on Text to Image Generation Model Quality Assessment
Authors:
Shuhao Han,
Haotian Fan,
Fangyuan Kong,
Wenjie Liao,
Chunle Guo,
Chongyi Li,
Radu Timofte,
Liang Li,
Tao Li,
Junhui Cui,
Yunqiu Wang,
Yang Tai,
Jingwei Sun,
Jianhui Sun,
Xinli Yue,
Tianyi Wang,
Huan Hou,
Junda Lu,
Xinyang Huang,
Zitang Zhou,
Zijian Zhang,
Xuhui Zheng,
Xuecheng Wu,
Chong Peng,
Xuezhi Cao
, et al. (90 additional authors not shown)
Abstract:
This paper reports on the NTIRE 2025 challenge on Text to Image (T2I) generation model quality assessment, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. The aim of this challenge is to address the fine-grained quality assessment of text-to-image generation models. This challenge evaluates text-to-image models from two aspe…
▽ More
This paper reports on the NTIRE 2025 challenge on Text to Image (T2I) generation model quality assessment, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. The aim of this challenge is to address the fine-grained quality assessment of text-to-image generation models. This challenge evaluates text-to-image models from two aspects: image-text alignment and image structural distortion detection, and is divided into the alignment track and the structural track. The alignment track uses the EvalMuse-40K, which contains around 40K AI-Generated Images (AIGIs) generated by 20 popular generative models. The alignment track has a total of 371 registered participants. A total of 1,883 submissions are received in the development phase, and 507 submissions are received in the test phase. Finally, 12 participating teams submitted their models and fact sheets. The structure track uses the EvalMuse-Structure, which contains 10,000 AI-Generated Images (AIGIs) with corresponding structural distortion mask. A total of 211 participants have registered in the structure track. A total of 1155 submissions are received in the development phase, and 487 submissions are received in the test phase. Finally, 8 participating teams submitted their models and fact sheets. Almost all methods have achieved better results than baseline methods, and the winning methods in both tracks have demonstrated superior prediction performance on T2I model quality assessment.
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
TRAIL: Transferable Robust Adversarial Images via Latent diffusion
Authors:
Yuhao Xue,
Zhifei Zhang,
Xinyang Jiang,
Yifei Shen,
Junyao Gao,
Wentao Gu,
Jiale Zhao,
Miaojing Shi,
Cairong Zhao
Abstract:
Adversarial attacks exploiting unrestricted natural perturbations present severe security risks to deep learning systems, yet their transferability across models remains limited due to distribution mismatches between generated adversarial features and real-world data. While recent works utilize pre-trained diffusion models as adversarial priors, they still encounter challenges due to the distribut…
▽ More
Adversarial attacks exploiting unrestricted natural perturbations present severe security risks to deep learning systems, yet their transferability across models remains limited due to distribution mismatches between generated adversarial features and real-world data. While recent works utilize pre-trained diffusion models as adversarial priors, they still encounter challenges due to the distribution shift between the distribution of ideal adversarial samples and the natural image distribution learned by the diffusion model. To address the challenge, we propose Transferable Robust Adversarial Images via Latent Diffusion (TRAIL), a test-time adaptation framework that enables the model to generate images from a distribution of images with adversarial features and closely resembles the target images. To mitigate the distribution shift, during attacks, TRAIL updates the diffusion U-Net's weights by combining adversarial objectives (to mislead victim models) and perceptual constraints (to preserve image realism). The adapted model then generates adversarial samples through iterative noise injection and denoising guided by these objectives. Experiments demonstrate that TRAIL significantly outperforms state-of-the-art methods in cross-model attack transferability, validating that distribution-aligned adversarial feature synthesis is critical for practical black-box attacks.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
The Application of Deep Learning for Lymph Node Segmentation: A Systematic Review
Authors:
Jingguo Qu,
Xinyang Han,
Man-Lik Chui,
Yao Pu,
Simon Takadiyi Gunda,
Ziman Chen,
Jing Qin,
Ann Dorothy King,
Winnie Chiu-Wing Chu,
Jing Cai,
Michael Tin-Cheung Ying
Abstract:
Automatic lymph node segmentation is the cornerstone for advances in computer vision tasks for early detection and staging of cancer. Traditional segmentation methods are constrained by manual delineation and variability in operator proficiency, limiting their ability to achieve high accuracy. The introduction of deep learning technologies offers new possibilities for improving the accuracy of lym…
▽ More
Automatic lymph node segmentation is the cornerstone for advances in computer vision tasks for early detection and staging of cancer. Traditional segmentation methods are constrained by manual delineation and variability in operator proficiency, limiting their ability to achieve high accuracy. The introduction of deep learning technologies offers new possibilities for improving the accuracy of lymph node image analysis. This study evaluates the application of deep learning in lymph node segmentation and discusses the methodologies of various deep learning architectures such as convolutional neural networks, encoder-decoder networks, and transformers in analyzing medical imaging data across different modalities. Despite the advancements, it still confronts challenges like the shape diversity of lymph nodes, the scarcity of accurately labeled datasets, and the inadequate development of methods that are robust and generalizable across different imaging modalities. To the best of our knowledge, this is the first study that provides a comprehensive overview of the application of deep learning techniques in lymph node segmentation task. Furthermore, this study also explores potential future research directions, including multimodal fusion techniques, transfer learning, and the use of large-scale pre-trained models to overcome current limitations while enhancing cancer diagnosis and treatment planning strategies.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
WaterDrum: Watermarking for Data-centric Unlearning Metric
Authors:
Xinyang Lu,
Xinyuan Niu,
Gregory Kang Ruey Lau,
Bui Thi Cam Nhung,
Rachael Hwee Ling Sim,
Fanyu Wen,
Chuan-Sheng Foo,
See-Kiong Ng,
Bryan Kian Hsiang Low
Abstract:
Large language model (LLM) unlearning is critical in real-world applications where it is necessary to efficiently remove the influence of private, copyrighted, or harmful data from some users. However, existing utility-centric unlearning metrics (based on model utility) may fail to accurately evaluate the extent of unlearning in realistic settings such as when (a) the forget and retain set have se…
▽ More
Large language model (LLM) unlearning is critical in real-world applications where it is necessary to efficiently remove the influence of private, copyrighted, or harmful data from some users. However, existing utility-centric unlearning metrics (based on model utility) may fail to accurately evaluate the extent of unlearning in realistic settings such as when (a) the forget and retain set have semantically similar content, (b) retraining the model from scratch on the retain set is impractical, and/or (c) the model owner can improve the unlearning metric without directly performing unlearning on the LLM. This paper presents the first data-centric unlearning metric for LLMs called WaterDrum that exploits robust text watermarking for overcoming these limitations. We also introduce new benchmark datasets for LLM unlearning that contain varying levels of similar data points and can be used to rigorously evaluate unlearning algorithms using WaterDrum. Our code is available at https://github.com/lululu008/WaterDrum and our new benchmark datasets are released at https://huggingface.co/datasets/Glow-AI/WaterDrum-Ax.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
Label-efficient Single Photon Images Classification via Active Learning
Authors:
Zili Zhang,
Ziting Wen,
Yiheng Qiang,
Hongzhou Dong,
Wenle Dong,
Xinyang Li,
Xiaofan Wang,
Xiaoqiang Ren
Abstract:
Single-photon LiDAR achieves high-precision 3D imaging in extreme environments through quantum-level photon detection technology. Current research primarily focuses on reconstructing 3D scenes from sparse photon events, whereas the semantic interpretation of single-photon images remains underexplored, due to high annotation costs and inefficient labeling strategies. This paper presents the first a…
▽ More
Single-photon LiDAR achieves high-precision 3D imaging in extreme environments through quantum-level photon detection technology. Current research primarily focuses on reconstructing 3D scenes from sparse photon events, whereas the semantic interpretation of single-photon images remains underexplored, due to high annotation costs and inefficient labeling strategies. This paper presents the first active learning framework for single-photon image classification. The core contribution is an imaging condition-aware sampling strategy that integrates synthetic augmentation to model variability across imaging conditions. By identifying samples where the model is both uncertain and sensitive to these conditions, the proposed method selectively annotates only the most informative examples. Experiments on both synthetic and real-world datasets show that our approach outperforms all baselines and achieves high classification accuracy with significantly fewer labeled samples. Specifically, our approach achieves 97% accuracy on synthetic single-photon data using only 1.5% labeled samples. On real-world data, we maintain 90.63% accuracy with just 8% labeled samples, which is 4.51% higher than the best-performing baseline. This illustrates that active learning enables the same level of classification performance on single-photon images as on classical images, opening doors to large-scale integration of single-photon data in real-world applications.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation
Authors:
Can Cui,
Pengxiang Ding,
Wenxuan Song,
Shuanghao Bai,
Xinyang Tong,
Zirui Ge,
Runze Suo,
Wanqi Zhou,
Yang Liu,
Bofang Jia,
Han Zhao,
Siteng Huang,
Donglin Wang
Abstract:
Dual-system VLA (Vision-Language-Action) architectures have become a hot topic in embodied intelligence research, but there is a lack of sufficient open-source work for further performance analysis and optimization. To address this problem, this paper will summarize and compare the structural designs of existing dual-system architectures, and conduct systematic empirical evaluations on the core de…
▽ More
Dual-system VLA (Vision-Language-Action) architectures have become a hot topic in embodied intelligence research, but there is a lack of sufficient open-source work for further performance analysis and optimization. To address this problem, this paper will summarize and compare the structural designs of existing dual-system architectures, and conduct systematic empirical evaluations on the core design elements of existing dual-system architectures. Ultimately, it will provide a low-cost open-source model for further exploration. Of course, this project will continue to update with more experimental conclusions and open-source models with improved performance for everyone to choose from. Project page: https://openhelix-robot.github.io/.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
Handling Imbalanced Pseudolabels for Vision-Language Models with Concept Alignment and Confusion-Aware Calibrated Margin
Authors:
Yuchen Wang,
Xuefeng Bai,
Xiucheng Li,
Weili Guan,
Liqiang Nie,
Xinyang Chen
Abstract:
Adapting vision-language models (VLMs) to downstream tasks with pseudolabels has gained increasing attention. A major obstacle is that the pseudolabels generated by VLMs tend to be imbalanced, leading to inferior performance. While existing methods have explored various strategies to address this, the underlying causes of imbalance remain insufficiently investigated. To fill this gap, we delve int…
▽ More
Adapting vision-language models (VLMs) to downstream tasks with pseudolabels has gained increasing attention. A major obstacle is that the pseudolabels generated by VLMs tend to be imbalanced, leading to inferior performance. While existing methods have explored various strategies to address this, the underlying causes of imbalance remain insufficiently investigated. To fill this gap, we delve into imbalanced pseudolabels and identify two primary contributing factors: concept mismatch and concept confusion. To mitigate these two issues, we propose a novel framework incorporating concept alignment and confusion-aware calibrated margin mechanisms. The core of our approach lies in enhancing underperforming classes and promoting balanced predictions across categories, thus mitigating imbalance. Extensive experiments on six benchmark datasets with three learning paradigms demonstrate that the proposed method effectively enhances the accuracy and balance of pseudolabels, achieving a relative improvement of 6.29% over the SoTA method. Our code is avaliable at https://anonymous.4open.science/r/CAP-C642/
△ Less
Submitted 4 May, 2025;
originally announced May 2025.
-
Computation of Capacity-Distortion-Cost Functions for Continuous Memoryless Channels
Authors:
Xinyang Li,
Ziyou Tang,
Vlad C. Andrei,
Ullrich J. Mönich,
Fan Liu,
Holger Boche
Abstract:
This paper aims at computing the capacity-distortion-cost (CDC) function for continuous memoryless channels, which is defined as the supremum of the mutual information between channel input and output, constrained by an input cost and an expected distortion of estimating channel state. Solving the optimization problem is challenging because the input distribution does not lie in a finite-dimension…
▽ More
This paper aims at computing the capacity-distortion-cost (CDC) function for continuous memoryless channels, which is defined as the supremum of the mutual information between channel input and output, constrained by an input cost and an expected distortion of estimating channel state. Solving the optimization problem is challenging because the input distribution does not lie in a finite-dimensional Euclidean space and the optimal estimation function has no closed form in general. We propose to adopt the Wasserstein proximal point method and parametric models such as neural networks (NNs) to update the input distribution and estimation function alternately. To implement it in practice, the importance sampling (IS) technique is used to calculate integrals numerically, and the Wasserstein gradient descent is approximated by pushing forward particles. The algorithm is then applied to an integrated sensing and communications (ISAC) system, validating theoretical results at minimum and maximum distortion as well as the random-deterministic trade-off.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
SynergyAmodal: Deocclude Anything with Text Control
Authors:
Xinyang Li,
Chengjie Yi,
Jiawei Lai,
Mingbao Lin,
Yansong Qu,
Shengchuan Zhang,
Liujuan Cao
Abstract:
Image deocclusion (or amodal completion) aims to recover the invisible regions (\ie, shape and appearance) of occluded instances in images. Despite recent advances, the scarcity of high-quality data that balances diversity, plausibility, and fidelity remains a major obstacle. To address this challenge, we identify three critical elements: leveraging in-the-wild image data for diversity, incorporat…
▽ More
Image deocclusion (or amodal completion) aims to recover the invisible regions (\ie, shape and appearance) of occluded instances in images. Despite recent advances, the scarcity of high-quality data that balances diversity, plausibility, and fidelity remains a major obstacle. To address this challenge, we identify three critical elements: leveraging in-the-wild image data for diversity, incorporating human expertise for plausibility, and utilizing generative priors for fidelity. We propose SynergyAmodal, a novel framework for co-synthesizing in-the-wild amodal datasets with comprehensive shape and appearance annotations, which integrates these elements through a tripartite data-human-model collaboration. First, we design an occlusion-grounded self-supervised learning algorithm to harness the diversity of in-the-wild image data, fine-tuning an inpainting diffusion model into a partial completion diffusion model. Second, we establish a co-synthesis pipeline to iteratively filter, refine, select, and annotate the initial deocclusion results of the partial completion diffusion model, ensuring plausibility and fidelity through human expert guidance and prior model constraints. This pipeline generates a high-quality paired amodal dataset with extensive category and scale diversity, comprising approximately 16K pairs. Finally, we train a full completion diffusion model on the synthesized dataset, incorporating text prompts as conditioning signals. Extensive experiments demonstrate the effectiveness of our framework in achieving zero-shot generalization and textual controllability. Our code, dataset, and models will be made publicly available at https://github.com/imlixinyang/SynergyAmodal.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
Metasurface-Assisted Adaptive Quantum Phase Contrast Imaging
Authors:
Xiaojing Feng,
Juanzi He,
Xingyu Liu,
Xiaoshu Zhu,
Yifan Zhou,
Xinyang Feng,
Shuming Wang
Abstract:
Quantum imaging employs the nonclassical correlation of photons to break through the noise limitation of classical imaging, realizing high sensitivity, high SNR imaging and multifunctional image processing. To enhance the flexibility and imaging performance of the optical systems, metasurfaces composed of subwavelength structural units provide a powerful optimization approach, enabling advanced ap…
▽ More
Quantum imaging employs the nonclassical correlation of photons to break through the noise limitation of classical imaging, realizing high sensitivity, high SNR imaging and multifunctional image processing. To enhance the flexibility and imaging performance of the optical systems, metasurfaces composed of subwavelength structural units provide a powerful optimization approach, enabling advanced applications in quantum state modulation and high-precision imaging. Conventional phase contrast imaging is fundamentally constrained by its single-phase modulation scheme, precluding adaptive switching between imaging modalities. Therefore, the development of high-contrast imaging techniques that can be used in any combination of phases has been a challenge in the field of optical imaging. Here, we propose a novel imaging scheme combining a polarization-entangled light source and a polarization multiplexed metasurface, which realizes remotely switchable bright-dark phase contrast imaging, demonstrating the flexibility and high integration of the system. Experiments show the system can realize high contrast and high SNR imaging under low phase gradient conditions (phase difference as low as π/5) and exhibit excellent phase resolution. In addition, the system is suitable for imaging biological samples under low-throughput light conditions, providing an efficient and non-destructive shooting solution for biomedical imaging and promoting the development of phase-sensitive imaging technology.
△ Less
Submitted 26 April, 2025;
originally announced April 2025.
-
Training-Free Hierarchical Scene Understanding for Gaussian Splatting with Superpoint Graphs
Authors:
Shaohui Dai,
Yansong Qu,
Zheyan Li,
Xinyang Li,
Shengchuan Zhang,
Liujuan Cao
Abstract:
Bridging natural language and 3D geometry is a crucial step toward flexible, language-driven scene understanding. While recent advances in 3D Gaussian Splatting (3DGS) have enabled fast and high-quality scene reconstruction, research has also explored incorporating open-vocabulary understanding into 3DGS. However, most existing methods require iterative optimization over per-view 2D semantic featu…
▽ More
Bridging natural language and 3D geometry is a crucial step toward flexible, language-driven scene understanding. While recent advances in 3D Gaussian Splatting (3DGS) have enabled fast and high-quality scene reconstruction, research has also explored incorporating open-vocabulary understanding into 3DGS. However, most existing methods require iterative optimization over per-view 2D semantic feature maps, which not only results in inefficiencies but also leads to inconsistent 3D semantics across views. To address these limitations, we introduce a training-free framework that constructs a superpoint graph directly from Gaussian primitives. The superpoint graph partitions the scene into spatially compact and semantically coherent regions, forming view-consistent 3D entities and providing a structured foundation for open-vocabulary understanding. Based on the graph structure, we design an efficient reprojection strategy that lifts 2D semantic features onto the superpoints, avoiding costly multi-view iterative training. The resulting representation ensures strong 3D semantic coherence and naturally supports hierarchical understanding, enabling both coarse- and fine-grained open-vocabulary perception within a unified semantic field. Extensive experiments demonstrate that our method achieves state-of-the-art open-vocabulary segmentation performance, with semantic field reconstruction completed over $30\times$ faster. Our code will be available at https://github.com/Atrovast/THGS.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
Analysis of the MICCAI Brain Tumor Segmentation -- Metastases (BraTS-METS) 2025 Lighthouse Challenge: Brain Metastasis Segmentation on Pre- and Post-treatment MRI
Authors:
Nazanin Maleki,
Raisa Amiruddin,
Ahmed W. Moawad,
Nikolay Yordanov,
Athanasios Gkampenis,
Pascal Fehringer,
Fabian Umeh,
Crystal Chukwurah,
Fatima Memon,
Bojan Petrovic,
Justin Cramer,
Mark Krycia,
Elizabeth B. Shrickel,
Ichiro Ikuta,
Gerard Thompson,
Lorenna Vidal,
Vilma Kosovic,
Adam E. Goldman-Yassen,
Virginia Hill,
Tiffany So,
Sedra Mhana,
Albara Alotaibi,
Nathan Page,
Prisha Bhatia,
Yasaman Sharifi
, et al. (218 additional authors not shown)
Abstract:
Despite continuous advancements in cancer treatment, brain metastatic disease remains a significant complication of primary cancer and is associated with an unfavorable prognosis. One approach for improving diagnosis, management, and outcomes is to implement algorithms based on artificial intelligence for the automated segmentation of both pre- and post-treatment MRI brain images. Such algorithms…
▽ More
Despite continuous advancements in cancer treatment, brain metastatic disease remains a significant complication of primary cancer and is associated with an unfavorable prognosis. One approach for improving diagnosis, management, and outcomes is to implement algorithms based on artificial intelligence for the automated segmentation of both pre- and post-treatment MRI brain images. Such algorithms rely on volumetric criteria for lesion identification and treatment response assessment, which are still not available in clinical practice. Therefore, it is critical to establish tools for rapid volumetric segmentations methods that can be translated to clinical practice and that are trained on high quality annotated data. The BraTS-METS 2025 Lighthouse Challenge aims to address this critical need by establishing inter-rater and intra-rater variability in dataset annotation by generating high quality annotated datasets from four individual instances of segmentation by neuroradiologists while being recorded on video (two instances doing "from scratch" and two instances after AI pre-segmentation). This high-quality annotated dataset will be used for testing phase in 2025 Lighthouse challenge and will be publicly released at the completion of the challenge. The 2025 Lighthouse challenge will also release the 2023 and 2024 segmented datasets that were annotated using an established pipeline of pre-segmentation, student annotation, two neuroradiologists checking, and one neuroradiologist finalizing the process. It builds upon its previous edition by including post-treatment cases in the dataset. Using these high-quality annotated datasets, the 2025 Lighthouse challenge plans to test benchmark algorithms for automated segmentation of pre-and post-treatment brain metastases (BM), trained on diverse and multi-institutional datasets of MRI images obtained from patients with brain metastases.
△ Less
Submitted 6 May, 2025; v1 submitted 16 April, 2025;
originally announced April 2025.
-
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender
Authors:
Weixiang Zhao,
Jiahe Guo,
Yulin Hu,
Yang Deng,
An Zhang,
Xingyu Sui,
Xinyang Han,
Yanyan Zhao,
Bing Qin,
Tat-Seng Chua,
Ting Liu
Abstract:
Despite extensive efforts in safety alignment, large language models (LLMs) remain vulnerable to jailbreak attacks. Activation steering offers a training-free defense method but relies on fixed steering coefficients, resulting in suboptimal protection and increased false rejections of benign inputs. To address this, we propose AdaSteer, an adaptive activation steering method that dynamically adjus…
▽ More
Despite extensive efforts in safety alignment, large language models (LLMs) remain vulnerable to jailbreak attacks. Activation steering offers a training-free defense method but relies on fixed steering coefficients, resulting in suboptimal protection and increased false rejections of benign inputs. To address this, we propose AdaSteer, an adaptive activation steering method that dynamically adjusts model behavior based on input characteristics. We identify two key properties: Rejection Law (R-Law), which shows that stronger steering is needed for jailbreak inputs opposing the rejection direction, and Harmfulness Law (H-Law), which differentiates adversarial and benign inputs. AdaSteer steers input representations along both the Rejection Direction (RD) and Harmfulness Direction (HD), with adaptive coefficients learned via logistic regression, ensuring robust jailbreak defense while preserving benign input handling. Experiments on LLaMA-3.1, Gemma-2, and Qwen2.5 show that AdaSteer outperforms baseline methods across multiple jailbreak attacks with minimal impact on utility. Our results highlight the potential of interpretable model internals for real-time, flexible safety enforcement in LLMs.
△ Less
Submitted 13 April, 2025;
originally announced April 2025.
-
An LLM-Driven Multi-Agent Debate System for Mendelian Diseases
Authors:
Xinyang Zhou,
Yongyong Ren,
Qianqian Zhao,
Daoyi Huang,
Xinbo Wang,
Tingting Zhao,
Zhixing Zhu,
Wenyuan He,
Shuyuan Li,
Yan Xu,
Yu Sun,
Yongguo Yu,
Shengnan Wu,
Jian Wang,
Guangjun Yu,
Dake He,
Bo Ban,
Hui Lu
Abstract:
Accurate diagnosis of Mendelian diseases is crucial for precision therapy and assistance in preimplantation genetic diagnosis. However, existing methods often fall short of clinical standards or depend on extensive datasets to build pretrained machine learning models. To address this, we introduce an innovative LLM-Driven multi-agent debate system (MD2GPS) with natural language explanations of the…
▽ More
Accurate diagnosis of Mendelian diseases is crucial for precision therapy and assistance in preimplantation genetic diagnosis. However, existing methods often fall short of clinical standards or depend on extensive datasets to build pretrained machine learning models. To address this, we introduce an innovative LLM-Driven multi-agent debate system (MD2GPS) with natural language explanations of the diagnostic results. It utilizes a language model to transform results from data-driven and knowledge-driven agents into natural language, then fostering a debate between these two specialized agents. This system has been tested on 1,185 samples across four independent datasets, enhancing the TOP1 accuracy from 42.9% to 66% on average. Additionally, in a challenging cohort of 72 cases, MD2GPS identified potential pathogenic genes in 12 patients, reducing the diagnostic time by 90%. The methods within each module of this multi-agent debate system are also replaceable, facilitating its adaptation for diagnosing and researching other complex diseases.
△ Less
Submitted 11 April, 2025; v1 submitted 10 April, 2025;
originally announced April 2025.
-
Strategic learning for disturbance rejection in multi-agent systems: Nash and Minmax in graphical games
Authors:
Xinyang Wang,
Martin Guay,
Shimin Wang,
Hongwei Zhang
Abstract:
This article investigates the optimal control problem with disturbance rejection for discrete-time multi-agent systems under cooperative and non-cooperative graphical games frameworks. Given the practical challenges of obtaining accurate models, Q-function-based policy iteration methods are proposed to seek the Nash equilibrium solution for the cooperative graphical game and the distributed minmax…
▽ More
This article investigates the optimal control problem with disturbance rejection for discrete-time multi-agent systems under cooperative and non-cooperative graphical games frameworks. Given the practical challenges of obtaining accurate models, Q-function-based policy iteration methods are proposed to seek the Nash equilibrium solution for the cooperative graphical game and the distributed minmax solution for the non-cooperative graphical game. To implement these methods online, two reinforcement learning frameworks are developed, an actor-disturber-critic structure for the cooperative graphical game and an actor-adversary-disturber-critic structure for the non-cooperative graphical game. The stability of the proposed methods is rigorously analyzed, and simulation results are provided to illustrate the effectiveness of the proposed methods.
△ Less
Submitted 10 April, 2025;
originally announced April 2025.
-
Assess Space-Based Solar Power in European-Scale Power System Decarbonization
Authors:
Xinyang Che,
Lijun Liu,
Wei He
Abstract:
Meeting net-zero targets remains formidable as terrestrial renewables grapple with intermittency and regional variability. Here, we integrate space-based solar power (SBSP) -- a potential near-constant, orbital solar technology -- into a high-resolution, Europe-wide capacity-expansion and dispatch model to quantify its contribution under net-zero constraints. We examine two advanced SBSP designs:…
▽ More
Meeting net-zero targets remains formidable as terrestrial renewables grapple with intermittency and regional variability. Here, we integrate space-based solar power (SBSP) -- a potential near-constant, orbital solar technology -- into a high-resolution, Europe-wide capacity-expansion and dispatch model to quantify its contribution under net-zero constraints. We examine two advanced SBSP designs: (1) a near-baseload, low Technology Readiness Level (TRL) concept (heliostat-based Representative Design RD1) and (2) a partially intermittent, higher-TRL concept (planar-based RD2), both drawing on NASA's 2050 cost and performance projections. Our results show that RD1 can reduce total system costs by 7--15%, displace up to 80% of intermittent wind and solar, and cut battery usage by over 70%, if it meets its forecast cost reductions -- though long-duration storage (e.g., hydrogen) remains essential for seasonal balancing. By contrast, RD2 is economically unattractive at its projected 2050 costs. Through extensive sensitivity analyses, we identify cost thresholds at which SBSP shifts from cost-prohibitive to complementary and ultimately to a dominant baseload technology. Specifically, RD1 becomes complementary at roughly 14x and dominant at 9x the 2050 solar PV capital cost, benefiting from its continuous power generation. Meanwhile, RD2 must achieve even lower cost levels (9x to be complementary and 6x to dominate) and would rely on short-duration storage to mitigate its partial intermittency. These findings provide quantified techno-economic benchmarks and reveal alternative net-zero pathways, offering critical guidance for policymakers and industry stakeholders seeking large-scale, centrally coordinated renewable solutions with non- or low-intermittency.
△ Less
Submitted 7 April, 2025;
originally announced April 2025.
-
Decorated phases in triblock copolymers: zeroth- and first-order analysis
Authors:
Stanley Alama,
Lia Bronsard,
Xinyang Lu,
Chong Wang
Abstract:
We study a two-dimensional inhibitory ternary system characterized by a free energy functional which combines an interface short-range interaction energy promoting micro-domain growth with a Coulomb-type long-range interaction energy which prevents micro-domains from unlimited spreading. Here we consider a scenario in which two species are dominant and one species is vanishingly small. In this sce…
▽ More
We study a two-dimensional inhibitory ternary system characterized by a free energy functional which combines an interface short-range interaction energy promoting micro-domain growth with a Coulomb-type long-range interaction energy which prevents micro-domains from unlimited spreading. Here we consider a scenario in which two species are dominant and one species is vanishingly small. In this scenario two energy levels are distinguished: the zeroth-order energy encodes information on the optimal arrangement of the dominant constituents, while the first-order energy gives the shape of the vanishing constituent. This first-order energy also shows that, for any optimal configuration, the vanishing phase must lie on the boundary between the two dominant constituents and form lens clusters also known as vesica piscis.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
MoK-RAG: Mixture of Knowledge Paths Enhanced Retrieval-Augmented Generation for Embodied AI Environments
Authors:
Zhengsheng Guo,
Linwei Zheng,
Xinyang Chen,
Xuefeng Bai,
Kehai Chen,
Min Zhang
Abstract:
While human cognition inherently retrieves information from diverse and specialized knowledge sources during decision-making processes, current Retrieval-Augmented Generation (RAG) systems typically operate through single-source knowledge retrieval, leading to a cognitive-algorithmic discrepancy. To bridge this gap, we introduce MoK-RAG, a novel multi-source RAG framework that implements a mixture…
▽ More
While human cognition inherently retrieves information from diverse and specialized knowledge sources during decision-making processes, current Retrieval-Augmented Generation (RAG) systems typically operate through single-source knowledge retrieval, leading to a cognitive-algorithmic discrepancy. To bridge this gap, we introduce MoK-RAG, a novel multi-source RAG framework that implements a mixture of knowledge paths enhanced retrieval mechanism through functional partitioning of a large language model (LLM) corpus into distinct sections, enabling retrieval from multiple specialized knowledge paths. Applied to the generation of 3D simulated environments, our proposed MoK-RAG3D enhances this paradigm by partitioning 3D assets into distinct sections and organizing them based on a hierarchical knowledge tree structure. Different from previous methods that only use manual evaluation, we pioneered the introduction of automated evaluation methods for 3D scenes. Both automatic and human evaluations in our experiments demonstrate that MoK-RAG3D can assist Embodied AI agents in generating diverse scenes.
△ Less
Submitted 18 March, 2025;
originally announced March 2025.
-
A counterexample of the Fredholm of Toeplitz operator
Authors:
Hua Liu,
Xinyang Zhang
Abstract:
In this paper we study the essential spectra of the Toeplitz operator on the Hardy space $H^1$. We give a counterexample to show that the Toeplitz operator with symbol is not Fredholm, which gives a counterexample to the conjecture by J.A. Virtanen J A in 2006.
In this paper we study the essential spectra of the Toeplitz operator on the Hardy space $H^1$. We give a counterexample to show that the Toeplitz operator with symbol is not Fredholm, which gives a counterexample to the conjecture by J.A. Virtanen J A in 2006.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
MoRE: Unlocking Scalability in Reinforcement Learning for Quadruped Vision-Language-Action Models
Authors:
Han Zhao,
Wenxuan Song,
Donglin Wang,
Xinyang Tong,
Pengxiang Ding,
Xuelian Cheng,
Zongyuan Ge
Abstract:
Developing versatile quadruped robots that can smoothly perform various actions and tasks in real-world environments remains a significant challenge. This paper introduces a novel vision-language-action (VLA) model, mixture of robotic experts (MoRE), for quadruped robots that aim to introduce reinforcement learning (RL) for fine-tuning large-scale VLA models with a large amount of mixed-quality da…
▽ More
Developing versatile quadruped robots that can smoothly perform various actions and tasks in real-world environments remains a significant challenge. This paper introduces a novel vision-language-action (VLA) model, mixture of robotic experts (MoRE), for quadruped robots that aim to introduce reinforcement learning (RL) for fine-tuning large-scale VLA models with a large amount of mixed-quality data. MoRE integrates multiple low-rank adaptation modules as distinct experts within a dense multi-modal large language model (MLLM), forming a sparse-activated mixture-of-experts model. This design enables the model to effectively adapt to a wide array of downstream tasks. Moreover, we employ a reinforcement learning-based training objective to train our model as a Q-function after deeply exploring the structural properties of our tasks. Effective learning from automatically collected mixed-quality data enhances data efficiency and model performance. Extensive experiments demonstrate that MoRE outperforms all baselines across six different skills and exhibits superior generalization capabilities in out-of-distribution scenarios. We further validate our method in real-world scenarios, confirming the practicality of our approach and laying a solid foundation for future research on multi-task learning in quadruped robots.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
AnomalyPainter: Vision-Language-Diffusion Synergy for Zero-Shot Realistic and Diverse Industrial Anomaly Synthesis
Authors:
Zhangyu Lai,
Yilin Lu,
Xinyang Li,
Jianghang Lin,
Yansong Qu,
Liujuan Cao,
Ming Li,
Rongrong Ji
Abstract:
While existing anomaly synthesis methods have made remarkable progress, achieving both realism and diversity in synthesis remains a major obstacle. To address this, we propose AnomalyPainter, a zero-shot framework that breaks the diversity-realism trade-off dilemma through synergizing Vision Language Large Model (VLLM), Latent Diffusion Model (LDM), and our newly introduced texture library Tex-9K.…
▽ More
While existing anomaly synthesis methods have made remarkable progress, achieving both realism and diversity in synthesis remains a major obstacle. To address this, we propose AnomalyPainter, a zero-shot framework that breaks the diversity-realism trade-off dilemma through synergizing Vision Language Large Model (VLLM), Latent Diffusion Model (LDM), and our newly introduced texture library Tex-9K. Tex-9K is a professional texture library containing 75 categories and 8,792 texture assets crafted for diverse anomaly synthesis. Leveraging VLLM's general knowledge, reasonable anomaly text descriptions are generated for each industrial object and matched with relevant diverse textures from Tex-9K. These textures then guide the LDM via ControlNet to paint on normal images. Furthermore, we introduce Texture-Aware Latent Init to stabilize the natural-image-trained ControlNet for industrial images. Extensive experiments show that AnomalyPainter outperforms existing methods in realism, diversity, and generalization, achieving superior downstream performance.
△ Less
Submitted 11 March, 2025; v1 submitted 10 March, 2025;
originally announced March 2025.
-
Exploring Interpretability for Visual Prompt Tuning with Hierarchical Concepts
Authors:
Yubin Wang,
Xinyang Jiang,
De Cheng,
Xiangqian Zhao,
Zilong Wang,
Dongsheng Li,
Cairong Zhao
Abstract:
Visual prompt tuning offers significant advantages for adapting pre-trained visual foundation models to specific tasks. However, current research provides limited insight into the interpretability of this approach, which is essential for enhancing AI reliability and enabling AI-driven knowledge discovery. In this paper, rather than learning abstract prompt embeddings, we propose the first framewor…
▽ More
Visual prompt tuning offers significant advantages for adapting pre-trained visual foundation models to specific tasks. However, current research provides limited insight into the interpretability of this approach, which is essential for enhancing AI reliability and enabling AI-driven knowledge discovery. In this paper, rather than learning abstract prompt embeddings, we propose the first framework, named Interpretable Visual Prompt Tuning (IVPT), to explore interpretability for visual prompts, by introducing hierarchical concept prototypes. Specifically, visual prompts are linked to human-understandable semantic concepts, represented as a set of category-agnostic prototypes, each corresponding to a specific region of the image. Then, IVPT aggregates features from these regions to generate interpretable prompts, which are structured hierarchically to explain visual prompts at different granularities. Comprehensive qualitative and quantitative evaluations on fine-grained classification benchmarks show its superior interpretability and performance over conventional visual prompt tuning methods and existing interpretable methods.
△ Less
Submitted 8 March, 2025;
originally announced March 2025.
-
Chain of Strategy Optimization Makes Large Language Models Better Emotional Supporter
Authors:
Weixiang Zhao,
Xingyu Sui,
Xinyang Han,
Yang Deng,
Yulin Hu,
Jiahe Guo,
Libo Qin,
Qianyun Du,
Shijin Wang,
Yanyan Zhao,
Bing Qin,
Ting Liu
Abstract:
The growing emotional stress in modern society has increased the demand for Emotional Support Conversations (ESC). While Large Language Models (LLMs) show promise for ESC, they face two key challenges: (1) low strategy selection accuracy, and (2) preference bias, limiting their adaptability to emotional needs of users. Existing supervised fine-tuning (SFT) struggles to address these issues, as it…
▽ More
The growing emotional stress in modern society has increased the demand for Emotional Support Conversations (ESC). While Large Language Models (LLMs) show promise for ESC, they face two key challenges: (1) low strategy selection accuracy, and (2) preference bias, limiting their adaptability to emotional needs of users. Existing supervised fine-tuning (SFT) struggles to address these issues, as it rigidly trains models on single gold-standard responses without modeling nuanced strategy trade-offs. To overcome these limitations, we propose Chain-of-Strategy Optimization (CSO), a novel approach that optimizes strategy selection preferences at each dialogue turn. We first leverage Monte Carlo Tree Search to construct ESC-Pro, a high-quality preference dataset with turn-level strategy-response pairs. Training on ESC-Pro with CSO improves both strategy accuracy and bias mitigation, enabling LLMs to generate more empathetic and contextually appropriate responses. Experiments on LLaMA-3.1-8B, Gemma-2-9B, and Qwen2.5-7B demonstrate that CSO outperforms standard SFT, highlighting the efficacy of fine-grained, turn-level preference modeling in ESC.
△ Less
Submitted 7 March, 2025;
originally announced March 2025.
-
Probing the couplings of an axion-like particle with leptons via three-lepton final state processes at future $e^{-}p$ colliders
Authors:
Chong-Xing Yue,
Xin-Yang Li,
Mei-Shu-Yu Wang,
Yang-Yang Bu
Abstract:
The axion-like particle (ALP) is one of the best motivated particles beyond the Standard Model (SM). We explore the possibility of detecting the couplings of ALP with leptons via three-lepton final state processes $e^- p \to e^- j a~(a \to \ell^+ \ell^-)$ at the LHeC (FCC-eh). For completeness, we investigate the cases where the ALP decays not only into electron and muon pairs but also into tau pa…
▽ More
The axion-like particle (ALP) is one of the best motivated particles beyond the Standard Model (SM). We explore the possibility of detecting the couplings of ALP with leptons via three-lepton final state processes $e^- p \to e^- j a~(a \to \ell^+ \ell^-)$ at the LHeC (FCC-eh). For completeness, we investigate the cases where the ALP decays not only into electron and muon pairs but also into tau pairs, and perform a detailed simulation for the decays of taus. We find that the prospective sensitivities of the LHeC (FCC-eh) with the center-of-mass energy $\sqrt{s}=1.3~(3.5)$ TeV and the integrated luminosity $\mathcal{L}=1~ (2)$ ab$^{-1}$ to the ALP-lepton couplings are not only stronger than some existing bounds of the LHC and LEP but also complementary to the expected bounds of the CEPC and FCC-ee.
△ Less
Submitted 2 April, 2025; v1 submitted 4 March, 2025;
originally announced March 2025.
-
Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation
Authors:
Tiansheng Wen,
Yifei Wang,
Zequn Zeng,
Zhong Peng,
Yudi Su,
Xinyang Liu,
Bo Chen,
Hongwei Liu,
Stefanie Jegelka,
Chenyu You
Abstract:
Many large-scale systems rely on high-quality deep representations (embeddings) to facilitate tasks like retrieval, search, and generative modeling. Matryoshka Representation Learning (MRL) recently emerged as a solution for adaptive embedding lengths, but it requires full model retraining and suffers from noticeable performance degradations at short lengths. In this paper, we show that sparse cod…
▽ More
Many large-scale systems rely on high-quality deep representations (embeddings) to facilitate tasks like retrieval, search, and generative modeling. Matryoshka Representation Learning (MRL) recently emerged as a solution for adaptive embedding lengths, but it requires full model retraining and suffers from noticeable performance degradations at short lengths. In this paper, we show that sparse coding offers a compelling alternative for achieving adaptive representation with minimal overhead and higher fidelity. We propose Contrastive Sparse Representation (CSR), a method that sparsifies pre-trained embeddings into a high-dimensional but selectively activated feature space. By leveraging lightweight autoencoding and task-aware contrastive objectives, CSR preserves semantic quality while allowing flexible, cost-effective inference at different sparsity levels. Extensive experiments on image, text, and multimodal benchmarks demonstrate that CSR consistently outperforms MRL in terms of both accuracy and retrieval speed-often by large margins-while also cutting training time to a fraction of that required by MRL. Our results establish sparse coding as a powerful paradigm for adaptive representation learning in real-world applications where efficiency and fidelity are both paramount. Code is available at https://github.com/neilwen987/CSR_Adaptive_Rep
△ Less
Submitted 19 May, 2025; v1 submitted 3 March, 2025;
originally announced March 2025.
-
Evolving High-Quality Rendering and Reconstruction in a Unified Framework with Contribution-Adaptive Regularization
Authors:
You Shen,
Zhipeng Zhang,
Xinyang Li,
Yansong Qu,
Yu Lin,
Shengchuan Zhang,
Liujuan Cao
Abstract:
Representing 3D scenes from multiview images is a core challenge in computer vision and graphics, which requires both precise rendering and accurate reconstruction. Recently, 3D Gaussian Splatting (3DGS) has garnered significant attention for its high-quality rendering and fast inference speed. Yet, due to the unstructured and irregular nature of Gaussian point clouds, ensuring accurate geometry r…
▽ More
Representing 3D scenes from multiview images is a core challenge in computer vision and graphics, which requires both precise rendering and accurate reconstruction. Recently, 3D Gaussian Splatting (3DGS) has garnered significant attention for its high-quality rendering and fast inference speed. Yet, due to the unstructured and irregular nature of Gaussian point clouds, ensuring accurate geometry reconstruction remains difficult. Existing methods primarily focus on geometry regularization, with common approaches including primitive-based and dual-model frameworks. However, the former suffers from inherent conflicts between rendering and reconstruction, while the latter is computationally and storage-intensive. To address these challenges, we propose CarGS, a unified model leveraging Contribution-adaptive regularization to achieve simultaneous, high-quality rendering and surface reconstruction. The essence of our framework is learning adaptive contribution for Gaussian primitives by squeezing the knowledge from geometry regularization into a compact MLP. Additionally, we introduce a geometry-guided densification strategy with clues from both normals and Signed Distance Fields (SDF) to improve the capability of capturing high-frequency details. Our design improves the mutual learning of the two tasks, meanwhile its unified structure does not require separate models as in dual-model based approaches, guaranteeing efficiency. Extensive experiments demonstrate the ability to achieve state-of-the-art (SOTA) results in both rendering fidelity and reconstruction accuracy while maintaining real-time speed and minimal storage size.
△ Less
Submitted 2 March, 2025;
originally announced March 2025.
-
Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs
Authors:
Weixiang Zhao,
Yulin Hu,
Yang Deng,
Jiahe Guo,
Xingyu Sui,
Xinyang Han,
An Zhang,
Yanyan Zhao,
Bing Qin,
Tat-Seng Chua,
Ting Liu
Abstract:
Role-playing enables large language models (LLMs) to engage users in immersive and personalized interactions, but it also introduces significant safety risks. Existing role-play fine-tuning techniques improve role adaptability but may degrade safety performance, particularly for villainous characters. In this work, we conduct the first comprehensive assessment of role-play fine-tuning risks by tra…
▽ More
Role-playing enables large language models (LLMs) to engage users in immersive and personalized interactions, but it also introduces significant safety risks. Existing role-play fine-tuning techniques improve role adaptability but may degrade safety performance, particularly for villainous characters. In this work, we conduct the first comprehensive assessment of role-play fine-tuning risks by training 95 role-specific LLMs using RoleBench. Our experiments reveal that role-play fine-tuning leads to a noticeable decline in safety performance, with safety risks varying based on character traits. To tackle this challenge, we propose Safety-Aware Role-Play Fine-Tuning (SaRFT), a novel method designed to balance role-playing capabilities and safety. Extensive experiments on LLaMA-3-8B-Instruct, Gemma-2-9B-it, and Qwen2.5-7B-Instruct demonstrate that SaRFT consistently outperforms state-of-the-art baselines under both LoRA and full-parameter fine-tuning settings. Our findings highlight the necessity of role-adaptive safety measures and provide insights into mitigating role-specific safety risks in role-playing LLMs.
△ Less
Submitted 27 May, 2025; v1 submitted 28 February, 2025;
originally announced February 2025.
-
END: Early Noise Dropping for Efficient and Effective Context Denoising
Authors:
Hongye Jin,
Pei Chen,
Jingfeng Yang,
Zhengyang Wang,
Meng Jiang,
Yifan Gao,
Binxuan Huang,
Xinyang Zhang,
Zheng Li,
Tianyi Liu,
Huasheng Li,
Bing Yin
Abstract:
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, they are often distracted by irrelevant or noisy context in input sequences that degrades output quality. This problem affects both long- and short-context scenarios, such as retrieval-augmented generation, table question-answering, and in-context learning. We re…
▽ More
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, they are often distracted by irrelevant or noisy context in input sequences that degrades output quality. This problem affects both long- and short-context scenarios, such as retrieval-augmented generation, table question-answering, and in-context learning. We reveal that LLMs can implicitly identify whether input sequences contain useful information at early layers, prior to token generation. Leveraging this insight, we introduce Early Noise Dropping (\textsc{END}), a novel approach to mitigate this issue without requiring fine-tuning the LLMs. \textsc{END} segments input sequences into chunks and employs a linear prober on the early layers of LLMs to differentiate between informative and noisy chunks. By discarding noisy chunks early in the process, \textsc{END} preserves critical information, reduces distraction, and lowers computational overhead. Extensive experiments demonstrate that \textsc{END} significantly improves both performance and efficiency across different LLMs on multiple evaluation datasets. Furthermore, by investigating LLMs' implicit understanding to the input with the prober, this work also deepens understanding of how LLMs do reasoning with contexts internally.
△ Less
Submitted 25 March, 2025; v1 submitted 26 February, 2025;
originally announced February 2025.
-
Optimal Stochastic Trace Estimation in Generative Modeling
Authors:
Xinyang Liu,
Hengrong Du,
Wei Deng,
Ruqi Zhang
Abstract:
Hutchinson estimators are widely employed in training divergence-based likelihoods for diffusion models to ensure optimal transport (OT) properties. However, this estimator often suffers from high variance and scalability concerns. To address these challenges, we investigate Hutch++, an optimal stochastic trace estimator for generative models, designed to minimize training variance while maintaini…
▽ More
Hutchinson estimators are widely employed in training divergence-based likelihoods for diffusion models to ensure optimal transport (OT) properties. However, this estimator often suffers from high variance and scalability concerns. To address these challenges, we investigate Hutch++, an optimal stochastic trace estimator for generative models, designed to minimize training variance while maintaining transport optimality. Hutch++ is particularly effective for handling ill-conditioned matrices with large condition numbers, which commonly arise when high-dimensional data exhibits a low-dimensional structure. To mitigate the need for frequent and costly QR decompositions, we propose practical schemes that balance frequency and accuracy, backed by theoretical guarantees. Our analysis demonstrates that Hutch++ leads to generations of higher quality. Furthermore, this method exhibits effective variance reduction in various applications, including simulations, conditional time series forecasts, and image generation.
△ Less
Submitted 25 February, 2025;
originally announced February 2025.
-
Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration
Authors:
Pengxiang Ding,
Jianfei Ma,
Xinyang Tong,
Binghong Zou,
Xinxin Luo,
Yiguo Fan,
Ting Wang,
Hongchao Lu,
Panzhong Mo,
Jinxin Liu,
Yuefan Wang,
Huaicheng Zhou,
Wenshuo Feng,
Jiacheng Liu,
Siteng Huang,
Donglin Wang
Abstract:
This paper addresses the limitations of current humanoid robot control frameworks, which primarily rely on reactive mechanisms and lack autonomous interaction capabilities due to data scarcity. We propose Humanoid-VLA, a novel framework that integrates language understanding, egocentric scene perception, and motion control, enabling universal humanoid control. Humanoid-VLA begins with language-mot…
▽ More
This paper addresses the limitations of current humanoid robot control frameworks, which primarily rely on reactive mechanisms and lack autonomous interaction capabilities due to data scarcity. We propose Humanoid-VLA, a novel framework that integrates language understanding, egocentric scene perception, and motion control, enabling universal humanoid control. Humanoid-VLA begins with language-motion pre-alignment using non-egocentric human motion datasets paired with textual descriptions, allowing the model to learn universal motion patterns and action semantics. We then incorporate egocentric visual context through a parameter efficient video-conditioned fine-tuning, enabling context-aware motion generation. Furthermore, we introduce a self-supervised data augmentation strategy that automatically generates pseudoannotations directly derived from motion data. This process converts raw motion sequences into informative question-answer pairs, facilitating the effective use of large-scale unlabeled video data. Built upon whole-body control architectures, extensive experiments show that Humanoid-VLA achieves object interaction and environment exploration tasks with enhanced contextual awareness, demonstrating a more human-like capacity for adaptive and intelligent engagement.
△ Less
Submitted 21 February, 2025; v1 submitted 20 February, 2025;
originally announced February 2025.
-
VoLUT: Efficient Volumetric streaming enhanced by LUT-based super-resolution
Authors:
Chendong Wang,
Anlan Zhang,
Yifan Yang,
Lili Qiu,
Yuqing Yang,
Xinyang Jiang,
Feng Qian,
Suman Banerjee
Abstract:
3D volumetric video provides immersive experience and is gaining traction in digital media. Despite its rising popularity, the streaming of volumetric video content poses significant challenges due to the high data bandwidth requirement. A natural approach to mitigate the bandwidth issue is to reduce the volumetric video's data rate by downsampling the content prior to transmission. The video can…
▽ More
3D volumetric video provides immersive experience and is gaining traction in digital media. Despite its rising popularity, the streaming of volumetric video content poses significant challenges due to the high data bandwidth requirement. A natural approach to mitigate the bandwidth issue is to reduce the volumetric video's data rate by downsampling the content prior to transmission. The video can then be upsampled at the receiver's end using a super-resolution (SR) algorithm to reconstruct the high-resolution details. While super-resolution techniques have been extensively explored and advanced for 2D video content, there is limited work on SR algorithms tailored for volumetric videos.
To address this gap and the growing need for efficient volumetric video streaming, we have developed VoLUT with a new SR algorithm specifically designed for volumetric content. Our algorithm uniquely harnesses the power of lookup tables (LUTs) to facilitate the efficient and accurate upscaling of low-resolution volumetric data. The use of LUTs enables our algorithm to quickly reference precomputed high-resolution values, thereby significantly reducing the computational complexity and time required for upscaling. We further apply adaptive video bit rate algorithm (ABR) to dynamically determine the downsampling rate according to the network condition and stream the selected video rate to the receiver. Compared to related work, VoLUT is the first to enable high-quality 3D SR on commodity mobile devices at line-rate. Our evaluation shows VoLUT can reduce bandwidth usage by 70% , boost QoE by 36.7% for volumetric video streaming and achieve
3D SR speed-up with no quality compromise.
△ Less
Submitted 17 February, 2025;
originally announced February 2025.
-
Learning to Predict Global Atrial Fibrillation Dynamics from Sparse Measurements
Authors:
Alexander Jenkins,
Andrea Cini,
Joseph Barker,
Alexander Sharp,
Arunashis Sau,
Varun Valentine,
Srushti Valasang,
Xinyang Li,
Tom Wong,
Timothy Betts,
Danilo Mandic,
Cesare Alippi,
Fu Siong Ng
Abstract:
Catheter ablation of Atrial Fibrillation (AF) consists of a one-size-fits-all treatment with limited success in persistent AF. This may be due to our inability to map the dynamics of AF with the limited resolution and coverage provided by sequential contact mapping catheters, preventing effective patient phenotyping for personalised, targeted ablation. Here we introduce FibMap, a graph recurrent n…
▽ More
Catheter ablation of Atrial Fibrillation (AF) consists of a one-size-fits-all treatment with limited success in persistent AF. This may be due to our inability to map the dynamics of AF with the limited resolution and coverage provided by sequential contact mapping catheters, preventing effective patient phenotyping for personalised, targeted ablation. Here we introduce FibMap, a graph recurrent neural network model that reconstructs global AF dynamics from sparse measurements. Trained and validated on 51 non-contact whole atria recordings, FibMap reconstructs whole atria dynamics from 10% surface coverage, achieving a 210% lower mean absolute error and an order of magnitude higher performance in tracking phase singularities compared to baseline methods. Clinical utility of FibMap is demonstrated on real-world contact mapping recordings, achieving reconstruction fidelity comparable to non-contact mapping. FibMap's state-spaces and patient-specific parameters offer insights for electrophenotyping AF. Integrating FibMap into clinical practice could enable personalised AF care and improve outcomes.
△ Less
Submitted 14 February, 2025; v1 submitted 13 February, 2025;
originally announced February 2025.
-
Calibration of Multiple Asynchronous Microphone Arrays using Hybrid TDOA
Authors:
Chengjie Zhang,
Wenda Pan,
Xinyang Han,
He Kong
Abstract:
Accurate calibration of acoustic sensing systems made of multiple asynchronous microphone arrays is essential for satisfactory performance in sound source localization and tracking. State-of-the-art calibration methods for this type of system rely on the time difference of arrival and direction of arrival measurements among the microphone arrays (denoted as TDOA-M and DOA, respectively). In this p…
▽ More
Accurate calibration of acoustic sensing systems made of multiple asynchronous microphone arrays is essential for satisfactory performance in sound source localization and tracking. State-of-the-art calibration methods for this type of system rely on the time difference of arrival and direction of arrival measurements among the microphone arrays (denoted as TDOA-M and DOA, respectively). In this paper, to enhance calibration accuracy, we propose to incorporate the time difference of arrival measurements between adjacent sound events (TDOAS) with respect to the microphone arrays. More specifically, we propose a two-stage calibration approach, including an initial value estimation (IVE) procedure and the final joint optimization step. The IVE stage first initializes all parameters except for microphone array orientations, using hybrid TDOA (i.e., TDOAM and TDOA-S), odometer data from a moving robot carrying a speaker, and DOA. Subsequently, microphone orientations are estimated through the iterative closest point method. The final joint optimization step estimates multiple microphone array locations, orientations, time offsets, clock drift rates, and sound source locations simultaneously. Both simulation and experiment results show that for scenarios with low or moderate TDOA noise levels, our approach outperforms existing methods in terms of accuracy. All code and data are available at https://github.com/AISLABsustech/Hybrid-TDOA-Multi-Calib.
△ Less
Submitted 10 February, 2025;
originally announced February 2025.
-
Swift: Rethinking RDMA Control Plane for Elastic Computing
Authors:
Junxue Zhang,
Han Tian,
Xinyang Huang,
Wenxue Li,
Kaiqiang Xu,
Dian Shen,
Yong Wang,
Kai Chen
Abstract:
Elastic computing enables dynamic scaling to meet workload demands, and Remote Direct Memory Access (RDMA) enhances this by providing high-throughput, low-latency network communication. However, integrating RDMA into elastic computing remains a challenge, particularly in control plane operations for RDMA connection setup.
This paper revisits the assumptions of prior work on high-performance RDMA…
▽ More
Elastic computing enables dynamic scaling to meet workload demands, and Remote Direct Memory Access (RDMA) enhances this by providing high-throughput, low-latency network communication. However, integrating RDMA into elastic computing remains a challenge, particularly in control plane operations for RDMA connection setup.
This paper revisits the assumptions of prior work on high-performance RDMA for elastic computing, and reveals that extreme microsecond-level control plane optimizations are often unnecessary. By challenging the conventional beliefs on the slowness of user-space RDMA control plane and the difficulty of user-space RDMA resource sharing, we uncover new design opportunities. Our key insight is that user-space RDMA connection setup can be significantly improved with caching, while RDMA resources can be efficiently shared among processes using fork. In light of this, we propose Swift, a simple yet effective solution that co-designs RDMA with a serverless framework to optimize performance for elastic computing. At its very core, Swift handles cold and warm serverless requests by swiftly initializing the RDMA control plane with cache-optimized libibverbs, and manages fork requests by leveraging the RDMA's fork capability. Implemented with OpenWhisk, Swift delivers 30.56-46.50% higher average throughput and 18.55-37.21% lower latency, at a cost of 6.5% control plane overhead, compared to prior solutions.
△ Less
Submitted 31 January, 2025;
originally announced January 2025.
-
Drag Your Gaussian: Effective Drag-Based Editing with Score Distillation for 3D Gaussian Splatting
Authors:
Yansong Qu,
Dian Chen,
Xinyang Li,
Xiaofan Li,
Shengchuan Zhang,
Liujuan Cao,
Rongrong Ji
Abstract:
Recent advancements in 3D scene editing have been propelled by the rapid development of generative models. Existing methods typically utilize generative models to perform text-guided editing on 3D representations, such as 3D Gaussian Splatting (3DGS). However, these methods are often limited to texture modifications and fail when addressing geometric changes, such as editing a character's head to…
▽ More
Recent advancements in 3D scene editing have been propelled by the rapid development of generative models. Existing methods typically utilize generative models to perform text-guided editing on 3D representations, such as 3D Gaussian Splatting (3DGS). However, these methods are often limited to texture modifications and fail when addressing geometric changes, such as editing a character's head to turn around. Moreover, such methods lack accurate control over the spatial position of editing results, as language struggles to precisely describe the extent of edits. To overcome these limitations, we introduce DYG, an effective 3D drag-based editing method for 3D Gaussian Splatting. It enables users to conveniently specify the desired editing region and the desired dragging direction through the input of 3D masks and pairs of control points, thereby enabling precise control over the extent of editing. DYG integrates the strengths of the implicit triplane representation to establish the geometric scaffold of the editing results, effectively overcoming suboptimal editing outcomes caused by the sparsity of 3DGS in the desired editing regions. Additionally, we incorporate a drag-based Latent Diffusion Model into our method through the proposed Drag-SDS loss function, enabling flexible, multi-view consistent, and fine-grained editing. Extensive experiments demonstrate that DYG conducts effective drag-based editing guided by control point prompts, surpassing other baselines in terms of editing effect and quality, both qualitatively and quantitatively. Visit our project page at https://quyans.github.io/Drag-Your-Gaussian.
△ Less
Submitted 25 May, 2025; v1 submitted 30 January, 2025;
originally announced January 2025.
-
Learning-Enhanced Safeguard Control for High-Relative-Degree Systems: Robust Optimization under Disturbances and Faults
Authors:
Xinyang Wang,
Hongwei Zhang,
Shimin Wang,
Wei Xiao,
Martin Guay
Abstract:
Merely pursuing performance may adversely affect the safety, while a conservative policy for safe exploration will degrade the performance. How to balance the safety and performance in learning-based control problems is an interesting yet challenging issue. This paper aims to enhance system performance with safety guarantee in solving the reinforcement learning (RL)-based optimal control problems…
▽ More
Merely pursuing performance may adversely affect the safety, while a conservative policy for safe exploration will degrade the performance. How to balance the safety and performance in learning-based control problems is an interesting yet challenging issue. This paper aims to enhance system performance with safety guarantee in solving the reinforcement learning (RL)-based optimal control problems of nonlinear systems subject to high-relative-degree state constraints and unknown time-varying disturbance/actuator faults. First, to combine control barrier functions (CBFs) with RL, a new type of CBFs, termed high-order reciprocal control barrier function (HO-RCBF) is proposed to deal with high-relative-degree constraints during the learning process. Then, the concept of gradient similarity is proposed to quantify the relationship between the gradient of safety and the gradient of performance. Finally, gradient manipulation and adaptive mechanisms are introduced in the safe RL framework to enhance the performance with a safety guarantee. Two simulation examples illustrate that the proposed safe RL framework can address high-relative-degree constraint, enhance safety robustness and improve system performance.
△ Less
Submitted 25 January, 2025;
originally announced January 2025.
-
SE(3)-Based Trajectory Optimization and Target Tracking in UAV-Enabled ISAC Systems
Authors:
Dongxiao Xu,
Xinyang Li,
Vlad C. Andrei,
Moritz Wiese,
Ullrich J. Moenich,
Holger Boche
Abstract:
This paper presents a novel approach to enhance sensing capabilities in UAV-enabled MIMO-OFDM ISAC systems by leveraging UAV mobility as a mono-static radar. By integrating uniform planar arrays (UPAs) and modeling the UAV dynamics in $SE(3)$, we address key challenges such as 3D space sensing and trajectory design. We propose a target tracking scheme using extended Kalman filtering (EKF) in…
▽ More
This paper presents a novel approach to enhance sensing capabilities in UAV-enabled MIMO-OFDM ISAC systems by leveraging UAV mobility as a mono-static radar. By integrating uniform planar arrays (UPAs) and modeling the UAV dynamics in $SE(3)$, we address key challenges such as 3D space sensing and trajectory design. We propose a target tracking scheme using extended Kalman filtering (EKF) in $SE(3)$, along with trajectory optimization based on the conditional Posterior Cramer-Rao bound (CPCRB). Numerical results demonstrate the effectiveness of the proposed trajectory design in enhancing performance of target tracking and physical parameter estimation in UAV-enabled MIMO-OFDM ISAC systems.
△ Less
Submitted 29 April, 2025; v1 submitted 20 January, 2025;
originally announced January 2025.