Search | arXiv e-print repository

Wormhole Solutions and pre-inflationary in $F(R, T)$ Gravity with Axion Fields

Authors: Guo-He Li, Yeqi Fang, Yuchi Wu, Jun Tao

Abstract: In this study, we investigate wormhole solutions and inflationary initial conditions in the coupled axion-inflaton/scalar field system within $F(R,T)$ gravity. Notably, the Euclidean action is reduced by approximately $10^5$ compared with the GR case. In Euclidean AdS spacetime, we construct Euclidean (semi-)wormhole geometries that naturally set inflationary initial conditions. To enhance the pro… ▽ More In this study, we investigate wormhole solutions and inflationary initial conditions in the coupled axion-inflaton/scalar field system within $F(R,T)$ gravity. Notably, the Euclidean action is reduced by approximately $10^5$ compared with the GR case. In Euclidean AdS spacetime, we construct Euclidean (semi-)wormhole geometries that naturally set inflationary initial conditions. To enhance the probability of sustained inflation and address the short inflation duration issue in the no-boundary proposal, we constrain $λ$ to $λ< -κ$ by analyzing the properties of the Euclidean action. This constraint causes the probability weight $ P \propto e^{-S_E} $ to favor high-potential regions, achieving sufficient inflation while maintaining a nearly Gaussian perturbation spectrum. The results suggest that matter-geometry coupling in $F(R,T)$ gravity provides a novel mechanism for reconciling quantum cosmology with observational requirements. △ Less

Submitted 10 June, 2025; originally announced June 2025.

Comments: 15 pages, 9 figures

arXiv:2506.07935 [pdf, ps, other]

Diffusion of Responsibility in Collective Decision Making

Authors: Pavel Naumov, Jia Tao

Abstract: The term "diffusion of responsibility'' refers to situations in which multiple agents share responsibility for an outcome, obscuring individual accountability. This paper examines this frequently undesirable phenomenon in the context of collective decision-making mechanisms. The work shows that if a decision is made by two agents, then the only way to avoid diffusion of responsibility is for one… ▽ More The term "diffusion of responsibility'' refers to situations in which multiple agents share responsibility for an outcome, obscuring individual accountability. This paper examines this frequently undesirable phenomenon in the context of collective decision-making mechanisms. The work shows that if a decision is made by two agents, then the only way to avoid diffusion of responsibility is for one agent to act as a "dictator'', making the decision unilaterally. In scenarios with more than two agents, any diffusion-free mechanism is an "elected dictatorship'' where the agents elect a single agent to make a unilateral decision. The technical results are obtained by defining a bisimulation of decision-making mechanisms, proving that bisimulation preserves responsibility-related properties, and establishing the results for a smallest bisimular mechanism. △ Less

Submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.03880 [pdf, ps, other]

RadialRouter: Structured Representation for Efficient and Robust Large Language Models Routing

Authors: Ruihan Jin, Pengpeng Shao, Zhengqi Wen, Jinyang Wu, Mingkuan Feng, Shuai Zhang, Jianhua Tao

Abstract: The rapid advancements in large language models (LLMs) have led to the emergence of routing techniques, which aim to efficiently select the optimal LLM from diverse candidates to tackle specific tasks, optimizing performance while reducing costs. Current LLM routing methods are limited in effectiveness due to insufficient exploration of the intrinsic connection between user queries and the charact… ▽ More The rapid advancements in large language models (LLMs) have led to the emergence of routing techniques, which aim to efficiently select the optimal LLM from diverse candidates to tackle specific tasks, optimizing performance while reducing costs. Current LLM routing methods are limited in effectiveness due to insufficient exploration of the intrinsic connection between user queries and the characteristics of LLMs. To address this issue, in this paper, we present RadialRouter, a novel framework for LLM routing which employs a lightweight Transformer-based backbone with a radial structure named RadialFormer to articulate the query-LLMs relationship. The optimal LLM selection is performed based on the final states of RadialFormer. The pipeline is further refined by an objective function that combines Kullback-Leibler divergence with the query-query contrastive loss to enhance robustness. Experimental results on RouterBench show that RadialRouter significantly outperforms existing routing methods by 9.2\% and 5.8\% in the Balance and Cost First scenarios, respectively. Additionally, its adaptability toward different performance-cost trade-offs and the dynamic LLM pool demonstrates practical application potential. △ Less

Submitted 4 June, 2025; originally announced June 2025.

arXiv:2506.02931 [pdf, ps, other]

ThinkTank: A Framework for Generalizing Domain-Specific AI Agent Systems into Universal Collaborative Intelligence Platforms

Authors: Praneet Sai Madhu Surabhi, Dheeraj Reddy Mudireddy, Jian Tao

Abstract: This paper presents ThinkTank, a comprehensive and scalable framework designed to transform specialized AI agent systems into versatile collaborative intelligence platforms capable of supporting complex problem-solving across diverse domains. ThinkTank systematically generalizes agent roles, meeting structures, and knowledge integration mechanisms by adapting proven scientific collaboration method… ▽ More This paper presents ThinkTank, a comprehensive and scalable framework designed to transform specialized AI agent systems into versatile collaborative intelligence platforms capable of supporting complex problem-solving across diverse domains. ThinkTank systematically generalizes agent roles, meeting structures, and knowledge integration mechanisms by adapting proven scientific collaboration methodologies. Through role abstraction, generalization of meeting types for iterative collaboration, and the integration of Retrieval-Augmented Generation with advanced knowledge storage, the framework facilitates expertise creation and robust knowledge sharing. ThinkTank enables organizations to leverage collaborative AI for knowledge-intensive tasks while ensuring data privacy and security through local deployment, utilizing frameworks like Ollama with models such as Llama3.1. The ThinkTank framework is designed to deliver significant advantages in cost-effectiveness, data security, scalability, and competitive positioning compared to cloud-based alternatives, establishing it as a universal platform for AI-driven collaborative problem-solving. The ThinkTank code is available at https://github.com/taugroup/ThinkTank △ Less

Submitted 3 June, 2025; originally announced June 2025.

arXiv:2506.02707 [pdf]

Unit Commitment with Cost-Oriented Temporal Resolution

Authors: Junyi Tao, Ran Li, Salvador Pineda

Abstract: Time-adaptive unit commitment (UC) has recently been investigated to reduce the scheduling costs by flexibly varying the temporal resolution, which is usually determined by clustering the net load patterns. However, there exists a misalignment between cost and net load patterns due to the discrete start-up costs and out-of-merit-order dispatch triggered by ramping and other constraints. The optima… ▽ More Time-adaptive unit commitment (UC) has recently been investigated to reduce the scheduling costs by flexibly varying the temporal resolution, which is usually determined by clustering the net load patterns. However, there exists a misalignment between cost and net load patterns due to the discrete start-up costs and out-of-merit-order dispatch triggered by ramping and other constraints. The optimal time-adaptive resolution cannot be completely captured by clustering-based method. This paper proposes a cost-oriented method to address this misalignment by a novel bilevel optimization approach that is efficiently solved through a heuristic greedy algorithm. The impact of varying temporal resolution on the final scheduling costs are tested, based on which the temporal resolution is heuristically updated, achieving significant cost reduction without increasing the number of temporal periods. Subsequently, an improved discretized Adam optimization method together with offline warm start and online refinement strategy is proposed to efficiently search for the better temporal resolution configuration. Results show that the proposed cost-oriented UC temporal resolution determination method achieves enhanced cost efficiency. △ Less

Submitted 3 June, 2025; originally announced June 2025.

arXiv:2506.00578 [pdf, other]

doi 10.1007/s10409-025-25314-x

Event-based multi-view photogrammetry for high-dynamic, high-velocity target measurement

Authors: Taihang Lei, Banglei Guan, Minzu Liang, Xiangyu Li, Jianbing Liu, Jing Tao, Yang Shang, Qifeng Yu

Abstract: The characterization of mechanical properties for high-dynamic, high-velocity target motion is essential in industries. It provides crucial data for validating weapon systems and precision manufacturing processes etc. However, existing measurement methods face challenges such as limited dynamic range, discontinuous observations, and high costs. This paper presents a new approach leveraging an even… ▽ More The characterization of mechanical properties for high-dynamic, high-velocity target motion is essential in industries. It provides crucial data for validating weapon systems and precision manufacturing processes etc. However, existing measurement methods face challenges such as limited dynamic range, discontinuous observations, and high costs. This paper presents a new approach leveraging an event-based multi-view photogrammetric system, which aims to address the aforementioned challenges. First, the monotonicity in the spatiotemporal distribution of events is leveraged to extract the target's leading-edge features, eliminating the tailing effect that complicates motion measurements. Then, reprojection error is used to associate events with the target's trajectory, providing more data than traditional intersection methods. Finally, a target velocity decay model is employed to fit the data, enabling accurate motion measurements via ours multi-view data joint computation. In a light gas gun fragment test, the proposed method showed a measurement deviation of 4.47% compared to the electromagnetic speedometer. △ Less

Submitted 31 May, 2025; originally announced June 2025.

Comments: 9 pages, 9 figures, 1 table. This paper was accepted by Acta Mechanica Sinica (Date:30.May 2025)

arXiv:2506.00375 [pdf, ps, other]

RPRA-ADD: Forgery Trace Enhancement-Driven Audio Deepfake Detection

Authors: Ruibo Fu, Xiaopeng Wang, Zhengqi Wen, Jianhua Tao, Yuankun Xie, Zhiyong Wang, Chunyu Qiang, Xuefei Liu, Cunhang Fan, Chenxing Li, Guanjun Li

Abstract: Existing methods for deepfake audio detection have demonstrated some effectiveness. However, they still face challenges in generalizing to new forgery techniques and evolving attack patterns. This limitation mainly arises because the models rely heavily on the distribution of the training data and fail to learn a decision boundary that captures the essential characteristics of forgeries. Additiona… ▽ More Existing methods for deepfake audio detection have demonstrated some effectiveness. However, they still face challenges in generalizing to new forgery techniques and evolving attack patterns. This limitation mainly arises because the models rely heavily on the distribution of the training data and fail to learn a decision boundary that captures the essential characteristics of forgeries. Additionally, relying solely on a classification loss makes it difficult to capture the intrinsic differences between real and fake audio. In this paper, we propose the RPRA-ADD, an integrated Reconstruction-Perception-Reinforcement-Attention networks based forgery trace enhancement-driven robust audio deepfake detection framework. First, we propose a Global-Local Forgery Perception (GLFP) module for enhancing the acoustic perception capacity of forgery traces. To significantly reinforce the feature space distribution differences between real and fake audio, the Multi-stage Dispersed Enhancement Loss (MDEL) is designed, which implements a dispersal strategy in multi-stage feature spaces. Furthermore, in order to enhance feature awareness towards forgery traces, the Fake Trace Focused Attention (FTFA) mechanism is introduced to adjust attention weights dynamically according to the reconstruction discrepancy matrix. Visualization experiments not only demonstrate that FTFA improves attention to voice segments, but also enhance the generalization capability. Experimental results demonstrate that the proposed method achieves state-of-the-art performance on 4 benchmark datasets, including ASVspoof2019, ASVspoof2021, CodecFake, and FakeSound, achieving over 20% performance improvement. In addition, it outperforms existing methods in rigorous 3*3 cross-domain evaluations across Speech, Sound, and Singing, demonstrating strong generalization capability across diverse audio domains. △ Less

Submitted 31 May, 2025; originally announced June 2025.

arXiv:2505.19287 [pdf, ps, other]

svc: An R package for Spatially Varying Coefficient Models

Authors: Justice Akuoko-Frimpong, Edward Shao, Jonathan Ta

Abstract: Traditional regression models assume stationary relationships between predictors and responses, failing to capture the spatial heterogeneity present in many environmental, epidemiological, and ecological processes. To address this limitation, we develop a scalable Bayesian framework for spatially varying coefficient (SVC) models, implemented in the \pkg{svc} R package (available at https://github.… ▽ More Traditional regression models assume stationary relationships between predictors and responses, failing to capture the spatial heterogeneity present in many environmental, epidemiological, and ecological processes. To address this limitation, we develop a scalable Bayesian framework for spatially varying coefficient (SVC) models, implemented in the \pkg{svc} R package (available at https://github.com/jdta95/svc), which allows regression coefficients to vary smoothly over space. Our approach combines three key computational innovations: (1) a subset Gaussian process approximation that reduces the computational burden from $O(n^3)$ to $O(m^3)$ with $m<n$, while maintaining predictive accuracy; (2) a robust adaptive Metropolis (RAM) algorithm that automatically tunes proposal distributions for efficient MCMC sampling of spatial range parameters; and (3) optimized linear algebra operations leveraging precomputed distance matrices and Cholesky decompositions to accelerate covariance calculations. We present the model's theoretical foundation, prior specification, and Gibbs sampling algorithm, with a focus on practical implementation for large spatial datasets. Simulation studies demonstrate that our method outperforms existing approaches in computational efficiency while maintaining competitive estimation accuracy. We illustrate its application in an analysis of land surface temperature (LST) data, revealing spatially varying effects of vegetation and emissivity that would be obscured by traditional regression techniques. The \pkg{svc} package provides researchers with a flexible, efficient tool for uncovering and quantifying nonstationary spatial relationships across diverse scientific domains. △ Less

Submitted 25 May, 2025; originally announced May 2025.

arXiv:2505.18232 [pdf, ps, other]

ELDeR: Getting Efficient LLMs through Data-Driven Regularized Layer-wise Pruning

Authors: Mingkuan Feng, Jinyang Wu, Siyuan Liu, Shuai Zhang, Hongjian Fang, Ruihan Jin, Feihu Che, Pengpeng Shao, Zhengqi Wen, Jianhua Tao

Abstract: The deployment of Large language models (LLMs) in many fields is largely hindered by their high computational and memory costs. Recent studies suggest that LLMs exhibit sparsity, which can be used for pruning. Previous pruning methods typically follow a prune-then-finetune paradigm. Since the pruned parts still contain valuable information, statically removing them without updating the remaining p… ▽ More The deployment of Large language models (LLMs) in many fields is largely hindered by their high computational and memory costs. Recent studies suggest that LLMs exhibit sparsity, which can be used for pruning. Previous pruning methods typically follow a prune-then-finetune paradigm. Since the pruned parts still contain valuable information, statically removing them without updating the remaining parameters often results in irreversible performance degradation, requiring costly recovery fine-tuning (RFT) to maintain performance. To address this, we propose a novel paradigm: first apply regularization, then prune. Based on this paradigm, we propose ELDeR: Getting Efficient LLMs through Data-Driven Regularized Layer-wise Pruning. We multiply the output of each transformer layer by an initial weight, then we iteratively learn the weights of each transformer layer by using a small amount of data in a simple way. After that, we apply regularization to the difference between the output and input of the layers with smaller weights, forcing the information to be transferred to the remaining layers. Compared with direct pruning, ELDeR reduces the information loss caused by direct parameter removal, thus better preserving the model's language modeling ability. Experimental results show that ELDeR achieves superior performance compared with powerful layer-wise structured pruning methods, while greatly reducing RFT computational costs. Since ELDeR is a layer-wise pruning method, its end-to-end acceleration effect is obvious, making it a promising technique for efficient LLMs. △ Less

Submitted 23 May, 2025; originally announced May 2025.

arXiv:2505.15692 [pdf, other]

Thought-Augmented Policy Optimization: Bridging External Guidance and Internal Capabilities

Authors: Jinyang Wu, Chonghua Liao, Mingkuan Feng, Shuai Zhang, Zhengqi Wen, Pengpeng Shao, Huazhe Xu, Jianhua Tao

Abstract: Reinforcement learning (RL) has emerged as an effective method for training reasoning models. However, existing RL approaches typically bias the model's output distribution toward reward-maximizing paths without introducing external knowledge. This limits their exploration capacity and results in a narrower reasoning capability boundary compared to base models. To address this limitation, we propo… ▽ More Reinforcement learning (RL) has emerged as an effective method for training reasoning models. However, existing RL approaches typically bias the model's output distribution toward reward-maximizing paths without introducing external knowledge. This limits their exploration capacity and results in a narrower reasoning capability boundary compared to base models. To address this limitation, we propose TAPO (Thought-Augmented Policy Optimization), a novel framework that augments RL by incorporating external high-level guidance ("thought patterns"). By adaptively integrating structured thoughts during training, TAPO effectively balances model-internal exploration and external guidance exploitation. Extensive experiments show that our approach significantly outperforms GRPO by 99% on AIME, 41% on AMC, and 17% on Minerva Math. Notably, these high-level thought patterns, abstracted from only 500 prior samples, generalize effectively across various tasks and models. This highlights TAPO's potential for broader applications across multiple tasks and domains. Our further analysis reveals that introducing external guidance produces powerful reasoning models with superior explainability of inference behavior and enhanced output readability. △ Less

Submitted 26 May, 2025; v1 submitted 21 May, 2025; originally announced May 2025.

arXiv:2505.15210 [pdf, other]

Deliberation on Priors: Trustworthy Reasoning of Large Language Models on Knowledge Graphs

Authors: Jie Ma, Ning Qu, Zhitao Gao, Rui Xing, Jun Liu, Hongbin Pei, Jiang Xie, Linyun Song, Pinghui Wang, Jing Tao, Zhou Su

Abstract: Knowledge graph-based retrieval-augmented generation seeks to mitigate hallucinations in Large Language Models (LLMs) caused by insufficient or outdated knowledge. However, existing methods often fail to fully exploit the prior knowledge embedded in knowledge graphs (KGs), particularly their structural information and explicit or implicit constraints. The former can enhance the faithfulness of LLM… ▽ More Knowledge graph-based retrieval-augmented generation seeks to mitigate hallucinations in Large Language Models (LLMs) caused by insufficient or outdated knowledge. However, existing methods often fail to fully exploit the prior knowledge embedded in knowledge graphs (KGs), particularly their structural information and explicit or implicit constraints. The former can enhance the faithfulness of LLMs' reasoning, while the latter can improve the reliability of response generation. Motivated by these, we propose a trustworthy reasoning framework, termed Deliberation over Priors (DP), which sufficiently utilizes the priors contained in KGs. Specifically, DP adopts a progressive knowledge distillation strategy that integrates structural priors into LLMs through a combination of supervised fine-tuning and Kahneman-Tversky optimization, thereby improving the faithfulness of relation path generation. Furthermore, our framework employs a reasoning-introspection strategy, which guides LLMs to perform refined reasoning verification based on extracted constraint priors, ensuring the reliability of response generation. Extensive experiments on three benchmark datasets demonstrate that DP achieves new state-of-the-art performance, especially a Hit@1 improvement of 13% on the ComplexWebQuestions dataset, and generates highly trustworthy responses. We also conduct various analyses to verify its flexibility and practicality. The code is available at https://github.com/reml-group/Deliberation-on-Priors. △ Less

Submitted 21 May, 2025; originally announced May 2025.

Comments: Under Review

ACM Class: I.2.4

arXiv:2505.14135 [pdf, other]

Hunyuan-Game: Industrial-grade Intelligent Game Creation Model

Authors: Ruihuang Li, Caijin Zhou, Shoujian Zheng, Jianxiang Lu, Jiabin Huang, Comi Chen, Junshu Tang, Guangzheng Xu, Jiale Tao, Hongmei Wang, Donghao Li, Wenqing Yu, Senbo Wang, Zhimin Li, Yetshuan Shi, Haoyu Yang, Yukun Wang, Wenxun Dai, Jiaqi Li, Linqing Wang, Qixun Wang, Zhiyong Xu, Yingfang Zhang, Jiangfeng Xiong, Weijie Kong , et al. (33 additional authors not shown)

Abstract: Intelligent game creation represents a transformative advancement in game development, utilizing generative artificial intelligence to dynamically generate and enhance game content. Despite notable progress in generative models, the comprehensive synthesis of high-quality game assets, including both images and videos, remains a challenging frontier. To create high-fidelity game content that simult… ▽ More Intelligent game creation represents a transformative advancement in game development, utilizing generative artificial intelligence to dynamically generate and enhance game content. Despite notable progress in generative models, the comprehensive synthesis of high-quality game assets, including both images and videos, remains a challenging frontier. To create high-fidelity game content that simultaneously aligns with player preferences and significantly boosts designer efficiency, we present Hunyuan-Game, an innovative project designed to revolutionize intelligent game production. Hunyuan-Game encompasses two primary branches: image generation and video generation. The image generation component is built upon a vast dataset comprising billions of game images, leading to the development of a group of customized image generation models tailored for game scenarios: (1) General Text-to-Image Generation. (2) Game Visual Effects Generation, involving text-to-effect and reference image-based game visual effect generation. (3) Transparent Image Generation for characters, scenes, and game visual effects. (4) Game Character Generation based on sketches, black-and-white images, and white models. The video generation component is built upon a comprehensive dataset of millions of game and anime videos, leading to the development of five core algorithmic models, each targeting critical pain points in game development and having robust adaptation to diverse game video scenarios: (1) Image-to-Video Generation. (2) 360 A/T Pose Avatar Video Synthesis. (3) Dynamic Illustration Generation. (4) Generative Video Super-Resolution. (5) Interactive Game Video Generation. These image and video generation models not only exhibit high-level aesthetic expression but also deeply integrate domain-specific knowledge, establishing a systematic understanding of diverse game and anime art styles. △ Less

Submitted 28 May, 2025; v1 submitted 20 May, 2025; originally announced May 2025.

arXiv:2505.11770 [pdf, ps, other]

Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

Authors: Jing Huang, Junyi Tao, Thomas Icard, Diyi Yang, Christopher Potts

Abstract: Interpretability research now offers a variety of techniques for identifying abstract internal mechanisms in neural networks. Can such techniques be used to predict how models will behave on out-of-distribution examples? In this work, we provide a positive answer to this question. Through a diverse set of language modeling tasks--including symbol manipulation, knowledge retrieval, and instruction… ▽ More Interpretability research now offers a variety of techniques for identifying abstract internal mechanisms in neural networks. Can such techniques be used to predict how models will behave on out-of-distribution examples? In this work, we provide a positive answer to this question. Through a diverse set of language modeling tasks--including symbol manipulation, knowledge retrieval, and instruction following--we show that the most robust features for correctness prediction are those that play a distinctive causal role in the model's behavior. Specifically, we propose two methods that leverage causal mechanisms to predict the correctness of model outputs: counterfactual simulation (checking whether key causal variables are realized) and value probing (using the values of those variables to make predictions). Both achieve high AUC-ROC in distribution and outperform methods that rely on causal-agnostic features in out-of-distribution settings, where predicting model behaviors is more crucial. Our work thus highlights a novel and significant application for internal causal analysis of language models. △ Less

Submitted 16 May, 2025; originally announced May 2025.

Comments: ICML 2025

arXiv:2505.11733 [pdf, ps, other]

MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports

Authors: Kevin Wu, Eric Wu, Rahul Thapa, Kevin Wei, Angela Zhang, Arvind Suresh, Jacqueline J. Tao, Min Woo Sun, Alejandro Lozano, James Zou

Abstract: Doctors and patients alike increasingly use Large Language Models (LLMs) to diagnose clinical cases. However, unlike domains such as math or coding, where correctness can be objectively defined by the final answer, medical diagnosis requires both the outcome and the reasoning process to be accurate. Currently, widely used medical benchmarks like MedQA and MMLU assess only accuracy in the final ans… ▽ More Doctors and patients alike increasingly use Large Language Models (LLMs) to diagnose clinical cases. However, unlike domains such as math or coding, where correctness can be objectively defined by the final answer, medical diagnosis requires both the outcome and the reasoning process to be accurate. Currently, widely used medical benchmarks like MedQA and MMLU assess only accuracy in the final answer, overlooking the quality and faithfulness of the clinical reasoning process. To address this limitation, we introduce MedCaseReasoning, the first open-access dataset for evaluating LLMs on their ability to align with clinician-authored diagnostic reasoning. The dataset includes 14,489 diagnostic question-and-answer cases, each paired with detailed reasoning statements derived from open-access medical case reports. We evaluate state-of-the-art reasoning LLMs on MedCaseReasoning and find significant shortcomings in their diagnoses and reasoning: for instance, the top-performing open-source model, DeepSeek-R1, achieves only 48% 10-shot diagnostic accuracy and mentions only 64% of the clinician reasoning statements (recall). However, we demonstrate that fine-tuning LLMs on the reasoning traces derived from MedCaseReasoning significantly improves diagnostic accuracy and clinical reasoning recall by an average relative gain of 29% and 41%, respectively. The open-source dataset, code, and models are available at https://github.com/kevinwu23/Stanford-MedCaseReasoning. △ Less

Submitted 20 May, 2025; v1 submitted 16 May, 2025; originally announced May 2025.

arXiv:2505.11079 [pdf, ps, other]

$\mathcal{A}LLM4ADD$: Unlocking the Capabilities of Audio Large Language Models for Audio Deepfake Detection

Authors: Hao Gu, Jiangyan Yi, Chenglong Wang, Jianhua Tao, Zheng Lian, Jiayi He, Yong Ren, Yujie Chen, Zhengqi Wen

Abstract: Audio deepfake detection (ADD) has grown increasingly important due to the rise of high-fidelity audio generative models and their potential for misuse. Given that audio large language models (ALLMs) have made significant progress in various audio processing tasks, a heuristic question arises: Can ALLMs be leveraged to solve ADD?. In this paper, we first conduct a comprehensive zero-shot evaluatio… ▽ More Audio deepfake detection (ADD) has grown increasingly important due to the rise of high-fidelity audio generative models and their potential for misuse. Given that audio large language models (ALLMs) have made significant progress in various audio processing tasks, a heuristic question arises: Can ALLMs be leveraged to solve ADD?. In this paper, we first conduct a comprehensive zero-shot evaluation of ALLMs on ADD, revealing their ineffectiveness in detecting fake audio. To enhance their performance, we propose $\mathcal{A}LLM4ADD$, an ALLM-driven framework for ADD. Specifically, we reformulate ADD task as an audio question answering problem, prompting the model with the question: "Is this audio fake or real?". We then perform supervised fine-tuning to enable the ALLM to assess the authenticity of query audio. Extensive experiments are conducted to demonstrate that our ALLM-based method can achieve superior performance in fake audio detection, particularly in data-scarce scenarios. As a pioneering study, we anticipate that this work will inspire the research community to leverage ALLMs to develop more effective ADD systems. △ Less

Submitted 16 May, 2025; originally announced May 2025.

arXiv:2505.11044 [pdf, ps, other]

Exploration by Random Distribution Distillation

Authors: Zhirui Fang, Kai Yang, Jian Tao, Jiafei Lyu, Lusong Li, Li Shen, Xiu Li

Abstract: Exploration remains a critical challenge in online reinforcement learning, as an agent must effectively explore unknown environments to achieve high returns. Currently, the main exploration algorithms are primarily count-based methods and curiosity-based methods, with prediction-error methods being a prominent example. In this paper, we propose a novel method called \textbf{R}andom \textbf{D}istri… ▽ More Exploration remains a critical challenge in online reinforcement learning, as an agent must effectively explore unknown environments to achieve high returns. Currently, the main exploration algorithms are primarily count-based methods and curiosity-based methods, with prediction-error methods being a prominent example. In this paper, we propose a novel method called \textbf{R}andom \textbf{D}istribution \textbf{D}istillation (RDD), which samples the output of a target network from a normal distribution. RDD facilitates a more extensive exploration by explicitly treating the difference between the prediction network and the target network as an intrinsic reward. Furthermore, by introducing randomness into the output of the target network for a given state and modeling it as a sample from a normal distribution, intrinsic rewards are bounded by two key components: a pseudo-count term ensuring proper exploration decay and a discrepancy term accounting for predictor convergence. We demonstrate that RDD effectively unifies both count-based and prediction-error approaches. It retains the advantages of prediction-error methods in high-dimensional spaces, while also implementing an intrinsic reward decay mode akin to the pseudo-count method. In the experimental section, RDD is compared with more advanced methods in a series of environments. Both theoretical analysis and experimental results confirm the effectiveness of our approach in improving online exploration for reinforcement learning tasks. △ Less

Submitted 16 May, 2025; originally announced May 2025.

arXiv:2505.09592 [pdf, ps, other]

Quantum-State-Controlled Collisions of Ultracold Polyatomic Molecules

Authors: Nathaniel B. Vilas, Paige Robichaud, Christian Hallas, Junheng Tao, Loïc Anderegg, Grace K. Li, Hana Lampson, Lucie D. Augustovičová, John L. Bohn, John M. Doyle

Abstract: Collisions between ultracold calcium monohydroxide (CaOH) molecules are realized and studied. Inelastic collision rate constants are measured for CaOH prepared in ground and excited vibrational states, and the electric field dependence of these rates is measured for molecules in single quantum states of the parity-doubled bending mode. Theoretical calculations of collision rate coefficients are pe… ▽ More Collisions between ultracold calcium monohydroxide (CaOH) molecules are realized and studied. Inelastic collision rate constants are measured for CaOH prepared in ground and excited vibrational states, and the electric field dependence of these rates is measured for molecules in single quantum states of the parity-doubled bending mode. Theoretical calculations of collision rate coefficients are performed and found to agree with measured values. The lowest collisional loss rates are for states with repulsive long-range potentials that shield ultracold molecules from loss channels at short distance. These results unveil the collisional behavior of parity doublet molecules in the ultracold regime, and lay the foundation for future experiments to evaporatively cool polyatomic molecules to quantum degeneracy. △ Less

Submitted 14 May, 2025; originally announced May 2025.

Comments: 22 pages, 10 figures

arXiv:2505.06312 [pdf, other]

Responsibility Gap in Collective Decision Making

Authors: Pavel Naumov, Jia Tao

Abstract: The responsibility gap is a set of outcomes of a collective decision-making mechanism in which no single agent is individually responsible. In general, when designing a decision-making process, it is desirable to minimise the gap. The paper proposes a concept of an elected dictatorship. It shows that, in a perfect information setting, the gap is empty if and only if the mechanism is an elected d… ▽ More The responsibility gap is a set of outcomes of a collective decision-making mechanism in which no single agent is individually responsible. In general, when designing a decision-making process, it is desirable to minimise the gap. The paper proposes a concept of an elected dictatorship. It shows that, in a perfect information setting, the gap is empty if and only if the mechanism is an elected dictatorship. It also proves that in an imperfect information setting, the class of gap-free mechanisms is positioned strictly between two variations of the class of elected dictatorships. △ Less

Submitted 8 May, 2025; originally announced May 2025.

Comments: full version of an IJCAI-25 paper

arXiv:2504.19423 [pdf, other]

MER 2025: When Affective Computing Meets Large Language Models

Authors: Zheng Lian, Rui Liu, Kele Xu, Bin Liu, Xuefei Liu, Yazhou Zhang, Xin Liu, Yong Li, Zebang Cheng, Haolin Zuo, Ziyang Ma, Xiaojiang Peng, Xie Chen, Ya Li, Erik Cambria, Guoying Zhao, Björn W. Schuller, Jianhua Tao

Abstract: MER2025 is the third year of our MER series of challenges, aiming to bring together researchers in the affective computing community to explore emerging trends and future directions in the field. Previously, MER2023 focused on multi-label learning, noise robustness, and semi-supervised learning, while MER2024 introduced a new track dedicated to open-vocabulary emotion recognition. This year, MER20… ▽ More MER2025 is the third year of our MER series of challenges, aiming to bring together researchers in the affective computing community to explore emerging trends and future directions in the field. Previously, MER2023 focused on multi-label learning, noise robustness, and semi-supervised learning, while MER2024 introduced a new track dedicated to open-vocabulary emotion recognition. This year, MER2025 centers on the theme "When Affective Computing Meets Large Language Models (LLMs)".We aim to shift the paradigm from traditional categorical frameworks reliant on predefined emotion taxonomies to LLM-driven generative methods, offering innovative solutions for more accurate and reliable emotion understanding. The challenge features four tracks: MER-SEMI focuses on fixed categorical emotion recognition enhanced by semi-supervised learning; MER-FG explores fine-grained emotions, expanding recognition from basic to nuanced emotional states; MER-DES incorporates multimodal cues (beyond emotion words) into predictions to enhance model interpretability; MER-PR investigates whether emotion prediction results can improve personality recognition performance. For the first three tracks, baseline code is available at MERTools, and datasets can be accessed via Hugging Face. For the last track, the dataset and baseline code are available on GitHub. △ Less

Submitted 29 April, 2025; v1 submitted 27 April, 2025; originally announced April 2025.

arXiv:2504.14465 [pdf, ps, other]

The Onset of Metastable Turbulence in Pipe Flow

Authors: Jiashun Guan, Jianjun Tao

Abstract: The onset of turbulence in pipe flow has been a fundamental challenge in physics, applied mathematics, and engineering for over 140 years. To date, the precursor of this laminar-turbulent transition is recognized as transient turbulent spots or puffs, but their defining characteristics - longevity, abrupt relaminarization, and super-exponential lifetime scaling - have been lack of first-principles… ▽ More The onset of turbulence in pipe flow has been a fundamental challenge in physics, applied mathematics, and engineering for over 140 years. To date, the precursor of this laminar-turbulent transition is recognized as transient turbulent spots or puffs, but their defining characteristics - longevity, abrupt relaminarization, and super-exponential lifetime scaling - have been lack of first-principles explanations. By combining extensive computer simulations, theory, and verifications with experimental data, we identify distinct puff relaminarizations separated by a critical Reynolds number, which are defined by a noisy saddle-node bifurcation derived from the Navier-Stokes equations. Below the critical number, the mean lifetime of puff follows a square-root scaling law, representing an intrinsically deterministic decay dominated by the critical slowing down. Above the critical value, the bifurcation's node branch creates a potential well stabilizing the turbulence, while the saddle branch mediates stochastic barrier-crossing events that drive memoryless decay - a hallmark of metastable states. Accordingly, the mean lifetimes are solved theoretically and can be fitted super-exponentially. By quantifying the deterministic and stochastic components in the kinetic energy equation, the lifetime statistics of puff are analyzed in a unified framework across low-to-moderate Reynolds number regimes, uncovering the mechanisms governing the transition to metastable turbulence in pipe flows. △ Less

Submitted 9 June, 2025; v1 submitted 19 April, 2025; originally announced April 2025.

Comments: 22 pages, 7 figures

arXiv:2504.12395 [pdf, other]

InstantCharacter: Personalize Any Characters with a Scalable Diffusion Transformer Framework

Authors: Jiale Tao, Yanbing Zhang, Qixun Wang, Yiji Cheng, Haofan Wang, Xu Bai, Zhengguang Zhou, Ruihuang Li, Linqing Wang, Chunyu Wang, Qin Lin, Qinglin Lu

Abstract: Current learning-based subject customization approaches, predominantly relying on U-Net architectures, suffer from limited generalization ability and compromised image quality. Meanwhile, optimization-based methods require subject-specific fine-tuning, which inevitably degrades textual controllability. To address these challenges, we propose InstantCharacter, a scalable framework for character cus… ▽ More Current learning-based subject customization approaches, predominantly relying on U-Net architectures, suffer from limited generalization ability and compromised image quality. Meanwhile, optimization-based methods require subject-specific fine-tuning, which inevitably degrades textual controllability. To address these challenges, we propose InstantCharacter, a scalable framework for character customization built upon a foundation diffusion transformer. InstantCharacter demonstrates three fundamental advantages: first, it achieves open-domain personalization across diverse character appearances, poses, and styles while maintaining high-fidelity results. Second, the framework introduces a scalable adapter with stacked transformer encoders, which effectively processes open-domain character features and seamlessly interacts with the latent space of modern diffusion transformers. Third, to effectively train the framework, we construct a large-scale character dataset containing 10-million-level samples. The dataset is systematically organized into paired (multi-view character) and unpaired (text-image combinations) subsets. This dual-data structure enables simultaneous optimization of identity consistency and textual editability through distinct learning pathways. Qualitative experiments demonstrate the advanced capabilities of InstantCharacter in generating high-fidelity, text-controllable, and character-consistent images, setting a new benchmark for character-driven image generation. Our source code is available at https://github.com/Tencent/InstantCharacter. △ Less

Submitted 16 April, 2025; originally announced April 2025.

Comments: Tech Report. Code is available at https://github.com/Tencent/InstantCharacter

arXiv:2504.08614 [pdf, other]

Imaginary gauge potentials in a non-Hermitian spin-orbit coupled quantum gas

Authors: Junheng Tao, Emmanuel Mercado-Gutierrez, Mingshu Zhao, Ian Spielman

Abstract: In 1996, Hatano and Nelson proposed a non-Hermitian lattice model containing an imaginary Peierls phase [Phys. Rev. Lett. 77 570-573 (1996)], which subsequent analyses revealed to be an instance of a new class of topological systems. Here, we experimentally realize a continuum analog to this model containing an imaginary gauge potential using a homogeneous spin-orbit coupled Bose-Einstein condensa… ▽ More In 1996, Hatano and Nelson proposed a non-Hermitian lattice model containing an imaginary Peierls phase [Phys. Rev. Lett. 77 570-573 (1996)], which subsequent analyses revealed to be an instance of a new class of topological systems. Here, we experimentally realize a continuum analog to this model containing an imaginary gauge potential using a homogeneous spin-orbit coupled Bose-Einstein condensate (BEC). Non-Hermiticity is introduced by adding tunable spin-dependent loss via microwave coupling to a subspace with spontaneous emission. We demonstrate that the resulting Heisenberg equations of motion for position and momentum depend explicitly on the system's phase-space distribution. First, we observe collective nonreciprocal transport in real space, with a "self-acceleration" that decreases with the BEC's spatial extent, consistent with non-Hermitian Gross-Pitaevskii simulations. We then examine localized edge states: the relatively strong interactions in our BEC suppress the formation of topological edge states, yielding instead highly excited states localized by an interplay between self-acceleration and wavefunction spreading. Finally, we confirm that our non-Hermitian description remains valid at all times by comparing to a multi-level master-equation treatment. △ Less

Submitted 11 April, 2025; originally announced April 2025.

arXiv:2504.05197 [pdf, other]

P2Mark: Plug-and-play Parameter-level Watermarking for Neural Speech Generation

Authors: Yong Ren, Jiangyan Yi, Tao Wang, Jianhua Tao, Zheng Lian, Zhengqi Wen, Chenxing Li, Ruibo Fu, Ye Bai, Xiaohui Zhang

Abstract: Neural speech generation (NSG) has rapidly advanced as a key component of artificial intelligence-generated content, enabling the generation of high-quality, highly realistic speech for diverse applications. This development increases the risk of technique misuse and threatens social security. Audio watermarking can embed imperceptible marks into generated audio, providing a promising approach for… ▽ More Neural speech generation (NSG) has rapidly advanced as a key component of artificial intelligence-generated content, enabling the generation of high-quality, highly realistic speech for diverse applications. This development increases the risk of technique misuse and threatens social security. Audio watermarking can embed imperceptible marks into generated audio, providing a promising approach for secure NSG usage. However, current audio watermarking methods are mainly applied at the audio-level or feature-level, which are not suitable for open-sourced scenarios where source codes and model weights are released. To address this limitation, we propose a Plug-and-play Parameter-level WaterMarking (P2Mark) method for NSG. Specifically, we embed watermarks into the released model weights, offering a reliable solution for proactively tracing and protecting model copyrights in open-source scenarios. During training, we introduce a lightweight watermark adapter into the pre-trained model, allowing watermark information to be merged into the model via this adapter. This design ensures both the flexibility to modify the watermark before model release and the security of embedding the watermark within model parameters after model release. Meanwhile, we propose a gradient orthogonal projection optimization strategy to ensure the quality of the generated audio and the accuracy of watermark preservation. Experimental results on two mainstream waveform decoders in NSG (i.e., vocoder and codec) demonstrate that P2Mark achieves comparable performance to state-of-the-art audio watermarking methods that are not applicable to open-source white-box protection scenarios, in terms of watermark extraction accuracy, watermark imperceptibility, and robustness. △ Less

Submitted 5 May, 2025; v1 submitted 7 April, 2025; originally announced April 2025.

arXiv:2504.00487 [pdf, other]

FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning

Authors: Jie Ma, Zhitao Gao, Qi Chai, Jun Liu, Pinghui Wang, Jing Tao, Zhou Su

Abstract: Audio-Visual Question Answering (AVQA) is a challenging multimodal reasoning task requiring intelligent systems to answer natural language queries based on paired audio-video inputs accurately. However, existing AVQA approaches often suffer from overfitting to dataset biases, leading to poor robustness. Moreover, current datasets may not effectively diagnose these methods. To address these challen… ▽ More Audio-Visual Question Answering (AVQA) is a challenging multimodal reasoning task requiring intelligent systems to answer natural language queries based on paired audio-video inputs accurately. However, existing AVQA approaches often suffer from overfitting to dataset biases, leading to poor robustness. Moreover, current datasets may not effectively diagnose these methods. To address these challenges, we first introduce a novel dataset, FortisAVQA, constructed in two stages: (1) rephrasing questions in the test split of the public MUSIC-AVQA dataset and (2) introducing distribution shifts across questions. The first stage expands the test space with greater diversity, while the second enables a refined robustness evaluation across rare, frequent, and overall question distributions. Second, we introduce a robust Multimodal Audio-Visual Epistemic Network (MAVEN) that leverages a multifaceted cycle collaborative debiasing strategy to mitigate bias learning. Experimental results demonstrate that our architecture achieves state-of-the-art performance on FortisAVQA, with a notable improvement of 7.81\%. Extensive ablation studies on both datasets validate the effectiveness of our debiasing components. Additionally, our evaluation reveals the limited robustness of existing multimodal QA methods. We also verify the plug-and-play capability of our strategy by integrating it with various baseline models across both datasets. Our dataset and code are available at https://github.com/reml-group/fortisavqa. △ Less

Submitted 2 April, 2025; v1 submitted 1 April, 2025; originally announced April 2025.

Comments: Under Review

ACM Class: H.5.1; I.2.4

arXiv:2503.22724 [pdf, other]

A Spatial-temporal Deep Probabilistic Diffusion Model for Reliable Hail Nowcasting with Radar Echo Extrapolation

Authors: Haonan Shi, Long Tian, Jie Tao, Yufei Li, Liming Wang, Xiyang Liu

Abstract: Hail nowcasting is a considerable contributor to meteorological disasters and there is a great need to mitigate its socioeconomic effects through precise forecast that has high resolution, long lead times and local details with large landscapes. Existing medium-range weather forecasting methods primarily rely on changes in upper air currents and cloud layers to predict precipitation events, such a… ▽ More Hail nowcasting is a considerable contributor to meteorological disasters and there is a great need to mitigate its socioeconomic effects through precise forecast that has high resolution, long lead times and local details with large landscapes. Existing medium-range weather forecasting methods primarily rely on changes in upper air currents and cloud layers to predict precipitation events, such as heavy rainfall, which are unsuitable for hail nowcasting since it is mainly caused by low-altitude local strong convection associated with terrains. Additionally, radar captures the status of low cloud layers, such as water vapor, droplets, and ice crystals, providing rich signals suitable for hail nowcasting. To this end, we introduce a Spatial-Temporal gEnerAtive Model called SteamCast for hail nowcasting with radar echo extrapolation, it is a deep probabilistic diffusion model based on spatial-temporal representations including radar echoes as well as their position/time embeddings, which we trained on historical reanalysis archive from Yan'an Meteorological Bureau in China, where the crop yield like apple suffers greatly from hail damage. Considering the short-term nature of hail, SteamCast provides 30-minute nowcasts at 6-minute intervals for a single radar reflectivity variable, across 9 different vertical angles, on a latitude-longitude grid with approximately 1 km * 1 km resolution per pixel in Yan'an City, China. By successfully fusing the spatial-temporal features of radar echoes, SteamCast delivers competitive, and in some cases superior, results compared to other deep learning-based models such as PredRNN and VMRNN. △ Less

Submitted 26 March, 2025; originally announced March 2025.

arXiv:2503.18246 [pdf, other]

ZECO: ZeroFusion Guided 3D MRI Conditional Generation

Authors: Feiran Wang, Bin Duan, Jiachen Tao, Nikhil Sharma, Dawen Cai, Yan Yan

Abstract: Medical image segmentation is crucial for enhancing diagnostic accuracy and treatment planning in Magnetic Resonance Imaging (MRI). However, acquiring precise lesion masks for segmentation model training demands specialized expertise and significant time investment, leading to a small dataset scale in clinical practice. In this paper, we present ZECO, a ZeroFusion guided 3D MRI conditional generat… ▽ More Medical image segmentation is crucial for enhancing diagnostic accuracy and treatment planning in Magnetic Resonance Imaging (MRI). However, acquiring precise lesion masks for segmentation model training demands specialized expertise and significant time investment, leading to a small dataset scale in clinical practice. In this paper, we present ZECO, a ZeroFusion guided 3D MRI conditional generation framework that extracts, compresses, and generates high-fidelity MRI images with corresponding 3D segmentation masks to mitigate data scarcity. To effectively capture inter-slice relationships within volumes, we introduce a Spatial Transformation Module that encodes MRI images into a compact latent space for the diffusion process. Moving beyond unconditional generation, our novel ZeroFusion method progressively maps 3D masks to MRI images in latent space, enabling robust training on limited datasets while avoiding overfitting. ZECO outperforms state-of-the-art models in both quantitative and qualitative evaluations on Brain MRI datasets across various modalities, showcasing its exceptional capability in synthesizing high-quality MRI images conditioned on segmentation masks. △ Less

Submitted 23 March, 2025; originally announced March 2025.

Comments: Project page: \url{https://brack-wang.github.io/ZECO_web/}; Github Code: \url{https://github.com/Brack-Wang/ZECO}

arXiv:2503.14359 [pdf, other]

ImViD: Immersive Volumetric Videos for Enhanced VR Engagement

Authors: Zhengxian Yang, Shi Pan, Shengqi Wang, Haoxiang Wang, Li Lin, Guanjun Li, Zhengqi Wen, Borong Lin, Jianhua Tao, Tao Yu

Abstract: User engagement is greatly enhanced by fully immersive multi-modal experiences that combine visual and auditory stimuli. Consequently, the next frontier in VR/AR technologies lies in immersive volumetric videos with complete scene capture, large 6-DoF interaction space, multi-modal feedback, and high resolution & frame-rate contents. To stimulate the reconstruction of immersive volumetric videos,… ▽ More User engagement is greatly enhanced by fully immersive multi-modal experiences that combine visual and auditory stimuli. Consequently, the next frontier in VR/AR technologies lies in immersive volumetric videos with complete scene capture, large 6-DoF interaction space, multi-modal feedback, and high resolution & frame-rate contents. To stimulate the reconstruction of immersive volumetric videos, we introduce ImViD, a multi-view, multi-modal dataset featuring complete space-oriented data capture and various indoor/outdoor scenarios. Our capture rig supports multi-view video-audio capture while on the move, a capability absent in existing datasets, significantly enhancing the completeness, flexibility, and efficiency of data capture. The captured multi-view videos (with synchronized audios) are in 5K resolution at 60FPS, lasting from 1-5 minutes, and include rich foreground-background elements, and complex dynamics. We benchmark existing methods using our dataset and establish a base pipeline for constructing immersive volumetric videos from multi-view audiovisual inputs for 6-DoF multi-modal immersive VR experiences. The benchmark and the reconstruction and interaction results demonstrate the effectiveness of our dataset and baseline method, which we believe will stimulate future research on immersive volumetric video production. △ Less

Submitted 18 March, 2025; originally announced March 2025.

Comments: Accepted by CVPR 2025

arXiv:2503.13991 [pdf, other]

GraphTEN: Graph Enhanced Texture Encoding Network

Authors: Bo Peng, Jintao Chen, Mufeng Yao, Chenhao Zhang, Jianghui Zhang, Mingmin Chi, Jiang Tao

Abstract: Texture recognition is a fundamental problem in computer vision and pattern recognition. Recent progress leverages feature aggregation into discriminative descriptions based on convolutional neural networks (CNNs). However, modeling non-local context relations through visual primitives remains challenging due to the variability and randomness of texture primitives in spatial distributions. In this… ▽ More Texture recognition is a fundamental problem in computer vision and pattern recognition. Recent progress leverages feature aggregation into discriminative descriptions based on convolutional neural networks (CNNs). However, modeling non-local context relations through visual primitives remains challenging due to the variability and randomness of texture primitives in spatial distributions. In this paper, we propose a graph-enhanced texture encoding network (GraphTEN) designed to capture both local and global features of texture primitives. GraphTEN models global associations through fully connected graphs and captures cross-scale dependencies of texture primitives via bipartite graphs. Additionally, we introduce a patch encoding module that utilizes a codebook to achieve an orderless representation of texture by encoding multi-scale patch features into a unified feature space. The proposed GraphTEN achieves superior performance compared to state-of-the-art methods across five publicly available datasets. △ Less

Submitted 18 March, 2025; originally announced March 2025.

Comments: 6 pages, 7 figures, conference paper

MSC Class: 68T45 ACM Class: I.2.10; I.4.7

arXiv:2503.09962 [pdf, other]

Modeling Thousands of Human Annotators for Generalizable Text-to-Image Person Re-identification

Authors: Jiayu Jiang, Changxing Ding, Wentao Tan, Junhong Wang, Jin Tao, Xiangmin Xu

Abstract: Text-to-image person re-identification (ReID) aims to retrieve the images of an interested person based on textual descriptions. One main challenge for this task is the high cost in manually annotating large-scale databases, which affects the generalization ability of ReID models. Recent works handle this problem by leveraging Multi-modal Large Language Models (MLLMs) to describe pedestrian images… ▽ More Text-to-image person re-identification (ReID) aims to retrieve the images of an interested person based on textual descriptions. One main challenge for this task is the high cost in manually annotating large-scale databases, which affects the generalization ability of ReID models. Recent works handle this problem by leveraging Multi-modal Large Language Models (MLLMs) to describe pedestrian images automatically. However, the captions produced by MLLMs lack diversity in description styles. To address this issue, we propose a Human Annotator Modeling (HAM) approach to enable MLLMs to mimic the description styles of thousands of human annotators. Specifically, we first extract style features from human textual descriptions and perform clustering on them. This allows us to group textual descriptions with similar styles into the same cluster. Then, we employ a prompt to represent each of these clusters and apply prompt learning to mimic the description styles of different human annotators. Furthermore, we define a style feature space and perform uniform sampling in this space to obtain more diverse clustering prototypes, which further enriches the diversity of the MLLM-generated captions. Finally, we adopt HAM to automatically annotate a massive-scale database for text-to-image ReID. Extensive experiments on this database demonstrate that it significantly improves the generalization ability of ReID models. △ Less

Submitted 12 March, 2025; originally announced March 2025.

Comments: CVPR 2025. Project website: https://github.com/sssaury/HAM

arXiv:2503.08596 [pdf, other]

X-Field: A Physically Grounded Representation for 3D X-ray Reconstruction

Authors: Feiran Wang, Jiachen Tao, Junyi Wu, Haoxuan Wang, Bin Duan, Kai Wang, Zongxin Yang, Yan Yan

Abstract: X-ray imaging is indispensable in medical diagnostics, yet its use is tightly regulated due to potential health risks. To mitigate radiation exposure, recent research focuses on generating novel views from sparse inputs and reconstructing Computed Tomography (CT) volumes, borrowing representations from the 3D reconstruction area. However, these representations originally target visible light imagi… ▽ More X-ray imaging is indispensable in medical diagnostics, yet its use is tightly regulated due to potential health risks. To mitigate radiation exposure, recent research focuses on generating novel views from sparse inputs and reconstructing Computed Tomography (CT) volumes, borrowing representations from the 3D reconstruction area. However, these representations originally target visible light imaging that emphasizes reflection and scattering effects, while neglecting penetration and attenuation properties of X-ray imaging. In this paper, we introduce X-Field, the first 3D representation specifically designed for X-ray imaging, rooted in the energy absorption rates across different materials. To accurately model diverse materials within internal structures, we employ 3D ellipsoids with distinct attenuation coefficients. To estimate each material's energy absorption of X-rays, we devise an efficient path partitioning algorithm accounting for complex ellipsoid intersections. We further propose hybrid progressive initialization to refine the geometric accuracy of X-Filed and incorporate material-based optimization to enhance model fitting along material boundaries. Experiments show that X-Field achieves superior visual fidelity on both real-world human organ and synthetic object datasets, outperforming state-of-the-art methods in X-ray Novel View Synthesis and CT Reconstruction. △ Less

Submitted 11 March, 2025; originally announced March 2025.

Comments: Project Page: \url{https://brack-wang.github.io/XField/}, Github Code: \url{https://github.com/Brack-Wang/X-Field}

arXiv:2503.08131 [pdf, ps, other]

Large Scale Multi-Task Bayesian Optimization with Large Language Models

Authors: Yimeng Zeng, Natalie Maus, Haydn Thomas Jones, Jeffrey Tao, Fangping Wan, Marcelo Der Torossian Torres, Cesar de la Fuente-Nunez, Ryan Marcus, Osbert Bastani, Jacob R. Gardner

Abstract: In multi-task Bayesian optimization, the goal is to leverage experience from optimizing existing tasks to improve the efficiency of optimizing new ones. While approaches using multi-task Gaussian processes or deep kernel transfer exist, the performance improvement is marginal when scaling beyond a moderate number of tasks. We introduce a novel approach leveraging large language models (LLMs) to le… ▽ More In multi-task Bayesian optimization, the goal is to leverage experience from optimizing existing tasks to improve the efficiency of optimizing new ones. While approaches using multi-task Gaussian processes or deep kernel transfer exist, the performance improvement is marginal when scaling beyond a moderate number of tasks. We introduce a novel approach leveraging large language models (LLMs) to learn from, and improve upon, previous optimization trajectories, scaling to approximately 1500 distinct tasks. Specifically, we propose a feedback loop in which an LLM is fine-tuned on the high quality solutions to specific tasks found by Bayesian optimization (BO). This LLM is then used to generate initialization points for future BO searches for new tasks. The trajectories of these new searches provide additional training data for fine-tuning the LLM, completing the loop. We evaluate our method on two distinct domains: database query optimization and antimicrobial peptide design. Results demonstrate that our approach creates a positive feedback loop, where the LLM's generated initializations gradually improve, leading to better optimization performance. As this feedback loop continues, we find that the LLM is eventually able to generate solutions to new tasks in just a few shots that are better than the solutions produced by "from scratch" by Bayesian optimization while simultaneously requiring significantly fewer oracle calls. △ Less

Submitted 12 June, 2025; v1 submitted 11 March, 2025; originally announced March 2025.

arXiv:2503.07886 [pdf, other]

Experimental Study on the Rotation-induced Reduction of Penetration Resistance in Sand

Authors: Yong Tang, Yi Zhong, Julian Tao

Abstract: Soil-dwelling organisms have evolved diverse strategies for efficient subterranean movement. For example, the seeds of Erodium cicutarium and Pelargonium species employ continuous rotational motion for self-burial, while the angled worm lizard Agamodon angeliceps tunnels by oscillating its head around its trunk's axis. These rotational movements significantly reduce penetration resistance. This st… ▽ More Soil-dwelling organisms have evolved diverse strategies for efficient subterranean movement. For example, the seeds of Erodium cicutarium and Pelargonium species employ continuous rotational motion for self-burial, while the angled worm lizard Agamodon angeliceps tunnels by oscillating its head around its trunk's axis. These rotational movements significantly reduce penetration resistance. This study presents comprehensive experiments investigating the effects of various factors on rotational penetration forces and energy consumption. Results reveal that force reduction follow an approximately hyperbolic decay with the tangential-to-axial velocity ratio ($u$). Penetrator geometry, particularly roundness and conical tip shape, is found to significantly influence reduction at low velocity ratios, whereas relative density and material type exhibit moderate impact. Reduction is also observed to increase with interfacial friction angle but decreases with confining pressure and depth. Energy consumption analysis shows that while penetration force-related energy decreases with $u$, total energy consumption increases due to rotational torque. For self-burrowing robot designs, lower velocity ratios are recommended to balance penetration force reduction and energy efficiency effectively. △ Less

Submitted 6 April, 2025; v1 submitted 10 March, 2025; originally announced March 2025.

Comments: 17 pages, 18 figures

arXiv:2502.18549 [pdf, other]

Target Defense with Multiple Defenders and an Agile Attacker via Residual Policy Learning

Authors: Jiyue Tao, Tongsheng Shen, Dexin Zhao, Feitian Zhang

Abstract: The target defense problem involves intercepting an attacker before it reaches a designated target region using one or more defenders. This letter focuses on a particularly challenging scenario in which the attacker is more agile than the defenders, significantly increasing the difficulty of effective interception. To address this challenge, we propose a novel residual policy framework that integr… ▽ More The target defense problem involves intercepting an attacker before it reaches a designated target region using one or more defenders. This letter focuses on a particularly challenging scenario in which the attacker is more agile than the defenders, significantly increasing the difficulty of effective interception. To address this challenge, we propose a novel residual policy framework that integrates deep reinforcement learning (DRL) with the force-based Boids model. In this framework, the Boids model serves as a baseline policy, while DRL learns a residual policy to refine and optimize the defenders' actions. Simulation experiments demonstrate that the proposed method consistently outperforms traditional interception policies, whether learned via vanilla DRL or fine-tuned from force-based methods. Moreover, the learned policy exhibits strong scalability and adaptability, effectively handling scenarios with varying numbers of defenders and attackers with different agility levels. △ Less

Submitted 25 February, 2025; originally announced February 2025.

arXiv:2502.17475 [pdf, other]

ECG-Expert-QA: A Benchmark for Evaluating Medical Large Language Models in Heart Disease Diagnosis

Authors: Xu Wang, Jiaju Kang, Puyu Han, Yubao Zhao, Qian Liu, Liwenfei He, Lingqiong Zhang, Lingyun Dai, Yongcheng Wang, Jie Tao

Abstract: We present ECG-Expert-QA, a comprehensive multimodal dataset for evaluating diagnostic capabilities in electrocardiogram (ECG) interpretation. It combines real-world clinical ECG data with systematically generated synthetic cases, covering 12 essential diagnostic tasks and totaling 47,211 expert-validated QA pairs. These encompass diverse clinical scenarios, from basic rhythm recognition to comple… ▽ More We present ECG-Expert-QA, a comprehensive multimodal dataset for evaluating diagnostic capabilities in electrocardiogram (ECG) interpretation. It combines real-world clinical ECG data with systematically generated synthetic cases, covering 12 essential diagnostic tasks and totaling 47,211 expert-validated QA pairs. These encompass diverse clinical scenarios, from basic rhythm recognition to complex diagnoses involving rare conditions and temporal changes. A key innovation is the support for multi-turn dialogues, enabling the development of conversational medical AI systems that emulate clinician-patient or interprofessional interactions. This allows for more realistic assessment of AI models' clinical reasoning, diagnostic accuracy, and knowledge integration. Constructed through a knowledge-guided framework with strict quality control, ECG-Expert-QA ensures linguistic and clinical consistency, making it a high-quality resource for advancing AI-assisted ECG interpretation. It challenges models with tasks like identifying subtle ischemic changes and interpreting complex arrhythmias in context-rich scenarios. To promote research transparency and collaboration, the dataset, accompanying code, and prompts are publicly released at https://github.com/Zaozzz/ECG-Expert-QA △ Less

Submitted 7 April, 2025; v1 submitted 16 February, 2025; originally announced February 2025.

arXiv:2502.13917 [pdf, ps, other]

TESS 2: A Large-Scale Generalist Diffusion Language Model

Authors: Jaesung Tae, Hamish Ivison, Sachin Kumar, Arman Cohan

Abstract: We introduce TESS 2, a general instruction-following diffusion language model that outperforms contemporary instruction-tuned diffusion models, as well as matches and sometimes exceeds strong autoregressive (AR) models. We train TESS 2 by first adapting a strong AR model via continued pretraining with the usual cross-entropy as diffusion loss, and then performing further instruction tuning. We fin… ▽ More We introduce TESS 2, a general instruction-following diffusion language model that outperforms contemporary instruction-tuned diffusion models, as well as matches and sometimes exceeds strong autoregressive (AR) models. We train TESS 2 by first adapting a strong AR model via continued pretraining with the usual cross-entropy as diffusion loss, and then performing further instruction tuning. We find that adaptation training as well as the choice of the base model is crucial for training good instruction-following diffusion models. We further propose reward guidance, a novel and modular inference-time guidance procedure to align model outputs without needing to train the underlying model. Finally, we show that TESS 2 further improves with increased inference-time compute, highlighting the utility of diffusion LMs in having fine-grained controllability over the amount of compute used at inference time. Code and models are available at https://github.com/hamishivi/tess-2. △ Less

Submitted 31 May, 2025; v1 submitted 19 February, 2025; originally announced February 2025.

Comments: ACL 2025 camera-ready

arXiv:2502.11342 [pdf, other]

Revisiting the charge-density-wave superlattice of 1$T$-TiSe$_2$

Authors: Wei Wang, Patrick Liu, Lijun Wu, Jing Tao, Genda Gu, Alfred Zong, Yimei Zhu

Abstract: A number of intriguing phenomena, including exciton condensation, orbital ordering, and emergence of chirality, have been proposed to accompany charge-density-wave (CDW) formation in the layered transition metal dichalcogenide 1$T$-TiSe$_2$. Explaining these effects relies on knowledge of the atomic displacement pattern underlying the CDW, yet structural proposals based on spatially-averaging bulk… ▽ More A number of intriguing phenomena, including exciton condensation, orbital ordering, and emergence of chirality, have been proposed to accompany charge-density-wave (CDW) formation in the layered transition metal dichalcogenide 1$T$-TiSe$_2$. Explaining these effects relies on knowledge of the atomic displacement pattern underlying the CDW, yet structural proposals based on spatially-averaging bulk crystal diffraction and surface-dependent scanning tunneling microscopy have remained inconsistent. Here, we revisit the CDW superlattice structure with selected-area electron diffraction, a bulk-sensitive probe capable of capturing sub-micrometer spatial variations while maintaining high momentum resolution. We resolved two distinct, spatially separated CDW phases characterized by different interlayer ordering. In both phases, previously reported atomic displacement patterns fail to account for the observed extinction rules. Instead, our analysis reveals a new superlattice structure, which features a large number of nearly degenerate CDW domains. These findings not only provide a new basis for understanding the gyrotropic electronic order and metastability in 1$T$-TiSe$_2$, they also underscore the importance of bulk-sensitive mesoscopic techniques in investigating materials that host unconventional superlattices. △ Less

Submitted 16 February, 2025; originally announced February 2025.

arXiv:2502.05809 [pdf]

Achieving electrode smoothing by controlling the nucleation phase of metal deposition through polymer-substrate binding

Authors: Ying Xia, Duo Song, Mingyi Zhang, Zheming Wang, Chenyang Shi, Jingshan S. Du, Sun Hae Ra Shin, Mark H. Engelhard, Praveen K. Thallapally, Christine A. Orme, Jinhui Tao, Maria L. Sushko, James. J. De Yoreo, Jun Liu

Abstract: Polymer additives [like polyethylene oxide (PEO)] are widely used for smooth electrode deposition in aqueous zinc and a number of other battery systems currently investigated for energy storage applications. However, the precise mechanism by which they regulate morphology and suppress dendrite formation remains unclear. In this study, we address this knowledge gap by using in-situ electrochemical… ▽ More Polymer additives [like polyethylene oxide (PEO)] are widely used for smooth electrode deposition in aqueous zinc and a number of other battery systems currently investigated for energy storage applications. However, the precise mechanism by which they regulate morphology and suppress dendrite formation remains unclear. In this study, we address this knowledge gap by using in-situ electrochemical atomic force microscopy (EC-AFM) to directly observe the interfacial evolution during Zn electrodeposition and polymer adsorption on copper (Cu) substrates in the presence of varying concentrations of ZnSO4 and PEO. Contrary to previous literature assumptions which emphasize the binding to the growing Zn crystal surfaces or Zn2+ ions, our results demonstrate that PEO smooths Zn films by promoting nucleation of (002)-oriented Zn platelets through interactions with the Cu substrate. Density functional theory (DFT) simulations support this finding by showing that PEO adsorption on Cu modifies the interfacial energy of Zn/Cu/electrolyte interfaces, favoring the stabilization of Zn (002) on the Cu substrate, as well as confines Zn electrodeposition to a narrow near-surface region. These findings elucidate a novel design principle for electrode smoothing, emphasizing the importance of substrate selection paired with polymer additives that exhibit an attractive interaction with the substrate, but minimal interaction with growing crystals, offering a mechanistic perspective for improved battery performance. △ Less

Submitted 16 February, 2025; v1 submitted 9 February, 2025; originally announced February 2025.

arXiv:2502.05256 [pdf, other]

Learned Offline Query Planning via Bayesian Optimization

Authors: Jeffrey Tao, Natalie Maus, Haydn Jones, Yimeng Zeng, Jacob R. Gardner, Ryan Marcus

Abstract: Analytics database workloads often contain queries that are executed repeatedly. Existing optimization techniques generally prioritize keeping optimization cost low, normally well below the time it takes to execute a single instance of a query. If a given query is going to be executed thousands of times, could it be worth investing significantly more optimization time? In contrast to traditional o… ▽ More Analytics database workloads often contain queries that are executed repeatedly. Existing optimization techniques generally prioritize keeping optimization cost low, normally well below the time it takes to execute a single instance of a query. If a given query is going to be executed thousands of times, could it be worth investing significantly more optimization time? In contrast to traditional online query optimizers, we propose an offline query optimizer that searches a wide variety of plans and incorporates query execution as a primitive. Our offline query optimizer combines variational auto-encoders with Bayesian optimization to find optimized plans for a given query. We compare our technique to the optimal plans possible with PostgreSQL and recent RL-based systems over several datasets, and show that our technique finds faster query plans. △ Less

Submitted 7 February, 2025; originally announced February 2025.

arXiv:2502.02339 [pdf, ps, other]

Boosting Multimodal Reasoning with Automated Structured Thinking

Authors: Jinyang Wu, Mingkuan Feng, Shuai Zhang, Fangrui Lv, Ruihan Jin, Feihu Che, Zengqi Wen, Jianhua Tao

Abstract: Multimodal large language models excel across diverse domains but struggle with complex visual reasoning tasks. Current approaches aim to incorporate structured thinking via two strategies: explicit search methods and post-training techniques. However, both approaches face significant limitations: Search-based methods suffer from computational inefficiency due to extensive solution space explorati… ▽ More Multimodal large language models excel across diverse domains but struggle with complex visual reasoning tasks. Current approaches aim to incorporate structured thinking via two strategies: explicit search methods and post-training techniques. However, both approaches face significant limitations: Search-based methods suffer from computational inefficiency due to extensive solution space exploration, while post-training methods require substantial data, computational resources, and often encounter training instability. To address these limitations, we propose AStar, an \textbf{A}utomated \textbf{S}tructured \textbf{t}hinking paradigm for multimod\textbf{a}l \textbf{r}easoning. Our method introduces "thought cards", a lightweight library of high-level reasoning patterns abstracted from 500 prior samples using Monte Carlo Tree Search. For each test problem, AStar adaptively retrieves the optimal thought cards and seamlessly integrates these external explicit guidelines with the model's internal implicit reasoning capabilities. Extensive experiments demonstrate AStar's effectiveness and efficiency: using only 500 prior samples and a 7B backbone, our training-free framework achieves 53.9$\%$ accuracy on MathVerse (surpassing GPT-4o's 50.2%) and 32.7% on MathVision (versus GPT-4o's 30.4%). Further analysis reveals that AStar generalizes beyond multimodal reasoning to visual perception and understanding domains, and serves as a plug-and-play test-time inference method compatible with mainstream post-training techniques like GRPO. △ Less

Submitted 30 May, 2025; v1 submitted 4 February, 2025; originally announced February 2025.

arXiv:2501.17905 [pdf, other]

DReSS: Data-driven Regularized Structured Streamlining for Large Language Models

Authors: Mingkuan Feng, Jinyang Wu, Shuai Zhang, Pengpeng Shao, Ruihan Jin, Zhengqi Wen, Jianhua Tao, Feihu Che

Abstract: Large language models (LLMs) have achieved significant progress across various domains, but their increasing scale results in high computational and memory costs. Recent studies have revealed that LLMs exhibit sparsity, providing the potential to reduce model size through pruning techniques. However, existing pruning methods typically follow a prune-then-finetune paradigm. Since the pruned compone… ▽ More Large language models (LLMs) have achieved significant progress across various domains, but their increasing scale results in high computational and memory costs. Recent studies have revealed that LLMs exhibit sparsity, providing the potential to reduce model size through pruning techniques. However, existing pruning methods typically follow a prune-then-finetune paradigm. Since the pruned components still contain valuable information, their direct removal often leads to irreversible performance degradation, imposing a substantial computational burden to recover performance during finetuning. In this paper, we propose a novel paradigm that first applies regularization, then prunes, and finally finetunes. Based on this paradigm, we introduce DReSS, a simple and effective Data-driven Regularized Structured Streamlining method for LLMs. By leveraging a small amount of data to regularize the components to be pruned, DReSS explicitly transfers the important information to the remaining parts of the model in advance. Compared to direct pruning, this can reduce the information loss caused by parameter removal, thereby enhancing its language modeling capabilities. Experimental results demonstrate that DReSS significantly outperforms existing pruning methods even under extreme pruning ratios, significantly reducing latency and increasing throughput. △ Less

Submitted 9 February, 2025; v1 submitted 29 January, 2025; originally announced January 2025.

arXiv:2501.16566 [pdf, other]

AffectGPT: A New Dataset, Model, and Benchmark for Emotion Understanding with Multimodal Large Language Models

Authors: Zheng Lian, Haoyu Chen, Lan Chen, Haiyang Sun, Licai Sun, Yong Ren, Zebang Cheng, Bin Liu, Rui Liu, Xiaojiang Peng, Jiangyan Yi, Jianhua Tao

Abstract: The emergence of multimodal large language models (MLLMs) advances multimodal emotion recognition (MER) to the next level, from naive discriminative tasks to complex emotion understanding with advanced video understanding abilities and natural language description. However, the current community suffers from a lack of large-scale datasets with intensive, descriptive emotion annotations, as well as… ▽ More The emergence of multimodal large language models (MLLMs) advances multimodal emotion recognition (MER) to the next level, from naive discriminative tasks to complex emotion understanding with advanced video understanding abilities and natural language description. However, the current community suffers from a lack of large-scale datasets with intensive, descriptive emotion annotations, as well as a multimodal-centric framework to maximize the potential of MLLMs for emotion understanding. To address this, we establish a new benchmark for MLLM-based emotion understanding with a novel dataset (MER-Caption) and a new model (AffectGPT). Utilizing our model-based crowd-sourcing data collection strategy, we construct the largest descriptive emotion dataset to date (by far), featuring over 2K fine-grained emotion categories across 115K samples. We also introduce the AffectGPT model, designed with pre-fusion operations to enhance multimodal integration. Finally, we present MER-UniBench, a unified benchmark with evaluation metrics tailored for typical MER tasks and the free-form, natural language output style of MLLMs. Extensive experimental results show AffectGPT's robust performance across various MER tasks. We have released both the code and the dataset to advance research and development in emotion understanding: https://github.com/zeroQiaoba/AffectGPT. △ Less

Submitted 7 May, 2025; v1 submitted 27 January, 2025; originally announced January 2025.

arXiv:2501.15269 [pdf, other]

Mirage in the Eyes: Hallucination Attack on Multi-modal Large Language Models with Only Attention Sink

Authors: Yining Wang, Mi Zhang, Junjie Sun, Chenyue Wang, Min Yang, Hui Xue, Jialing Tao, Ranjie Duan, Jiexi Liu

Abstract: Fusing visual understanding into language generation, Multi-modal Large Language Models (MLLMs) are revolutionizing visual-language applications. Yet, these models are often plagued by the hallucination problem, which involves generating inaccurate objects, attributes, and relationships that do not match the visual content. In this work, we delve into the internal attention mechanisms of MLLMs to… ▽ More Fusing visual understanding into language generation, Multi-modal Large Language Models (MLLMs) are revolutionizing visual-language applications. Yet, these models are often plagued by the hallucination problem, which involves generating inaccurate objects, attributes, and relationships that do not match the visual content. In this work, we delve into the internal attention mechanisms of MLLMs to reveal the underlying causes of hallucination, exposing the inherent vulnerabilities in the instruction-tuning process. We propose a novel hallucination attack against MLLMs that exploits attention sink behaviors to trigger hallucinated content with minimal image-text relevance, posing a significant threat to critical downstream applications. Distinguished from previous adversarial methods that rely on fixed patterns, our approach generates dynamic, effective, and highly transferable visual adversarial inputs, without sacrificing the quality of model responses. Comprehensive experiments on 6 prominent MLLMs demonstrate the efficacy of our attack in compromising black-box MLLMs even with extensive mitigating mechanisms, as well as the promising results against cutting-edge commercial APIs, such as GPT-4o and Gemini 1.5. Our code is available at https://huggingface.co/RachelHGF/Mirage-in-the-Eyes. △ Less

Submitted 25 January, 2025; originally announced January 2025.

Comments: USENIX Security 2025

arXiv:2501.15044 [pdf, other]

Signal Whisperers: Enhancing Wireless Reception Using DRL-Guided Reflector Arrays

Authors: Hieu Le, Oguz Bedir, Mostafa Ibrahim, Jian Tao, Sabit Ekin

Abstract: This paper presents a novel approach for enhancing wireless signal reception through self-adjustable metallic surfaces, termed reflectors, which are guided by deep reinforcement learning (DRL). The designed reflector system aims to improve signal quality for multiple users in scenarios where a direct line-of-sight (LOS) from the access point (AP) and reflector to users is not guaranteed. Utilizing… ▽ More This paper presents a novel approach for enhancing wireless signal reception through self-adjustable metallic surfaces, termed reflectors, which are guided by deep reinforcement learning (DRL). The designed reflector system aims to improve signal quality for multiple users in scenarios where a direct line-of-sight (LOS) from the access point (AP) and reflector to users is not guaranteed. Utilizing DRL techniques, the reflector autonomously modifies its configuration to optimize beam allocation from the AP to user equipment (UE), thereby maximizing path gain. Simulation results indicate substantial improvements in the average path gain for all UEs compared to baseline configurations, highlighting the potential of DRL-driven reflectors in creating adaptive communication environments. △ Less

Submitted 20 May, 2025; v1 submitted 24 January, 2025; originally announced January 2025.

arXiv:2501.06869 [pdf, other]

A Foundational Generative Model for Breast Ultrasound Image Analysis

Authors: Haojun Yu, Youcheng Li, Nan Zhang, Zihan Niu, Xuantong Gong, Yanwen Luo, Haotian Ye, Siyu He, Quanlin Wu, Wangyan Qin, Mengyuan Zhou, Jie Han, Jia Tao, Ziwei Zhao, Di Dai, Di He, Dong Wang, Binghui Tang, Ling Huo, James Zou, Qingli Zhu, Yong Wang, Liwei Wang

Abstract: Foundational models have emerged as powerful tools for addressing various tasks in clinical settings. However, their potential development to breast ultrasound analysis remains untapped. In this paper, we present BUSGen, the first foundational generative model specifically designed for breast ultrasound image analysis. Pretrained on over 3.5 million breast ultrasound images, BUSGen has acquired ex… ▽ More Foundational models have emerged as powerful tools for addressing various tasks in clinical settings. However, their potential development to breast ultrasound analysis remains untapped. In this paper, we present BUSGen, the first foundational generative model specifically designed for breast ultrasound image analysis. Pretrained on over 3.5 million breast ultrasound images, BUSGen has acquired extensive knowledge of breast structures, pathological features, and clinical variations. With few-shot adaptation, BUSGen can generate repositories of realistic and informative task-specific data, facilitating the development of models for a wide range of downstream tasks. Extensive experiments highlight BUSGen's exceptional adaptability, significantly exceeding real-data-trained foundational models in breast cancer screening, diagnosis, and prognosis. In breast cancer early diagnosis, our approach outperformed all board-certified radiologists (n=9), achieving an average sensitivity improvement of 16.5% (P-value<0.0001). Additionally, we characterized the scaling effect of using generated data which was as effective as the collected real-world data for training diagnostic models. Moreover, extensive experiments demonstrated that our approach improved the generalization ability of downstream models. Importantly, BUSGen protected patient privacy by enabling fully de-identified data sharing, making progress forward in secure medical data utilization. An online demo of BUSGen is available at https://aibus.bio. △ Less

Submitted 12 January, 2025; originally announced January 2025.

Comments: Peking University; Stanford University; Peking University Cancer Hospital & Institute; Peking Union Medical College Hospital; Cancer Hospital, Chinese Academy of Medical Sciences

arXiv:2501.06764 [pdf, other]

MTPareto: A MultiModal Targeted Pareto Framework for Fake News Detection

Authors: Kaiying Yan, Moyang Liu, Yukun Liu, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Xuefei Liu, Guanjun Li

Abstract: Multimodal fake news detection is essential for maintaining the authenticity of Internet multimedia information. Significant differences in form and content of multimodal information lead to intensified optimization conflicts, hindering effective model training as well as reducing the effectiveness of existing fusion methods for bimodal. To address this problem, we propose the MTPareto framework t… ▽ More Multimodal fake news detection is essential for maintaining the authenticity of Internet multimedia information. Significant differences in form and content of multimodal information lead to intensified optimization conflicts, hindering effective model training as well as reducing the effectiveness of existing fusion methods for bimodal. To address this problem, we propose the MTPareto framework to optimize multimodal fusion, using a Targeted Pareto(TPareto) optimization algorithm for fusion-level-specific objective learning with a certain focus. Based on the designed hierarchical fusion network, the algorithm defines three fusion levels with corresponding losses and implements all-modal-oriented Pareto gradient integration for each. This approach accomplishes superior multimodal fusion by utilizing the information obtained from intermediate fusion to provide positive effects to the entire process. Experiment results on FakeSV and FVC datasets show that the proposed framework outperforms baselines and the TPareto optimization algorithm achieves 2.40% and 1.89% accuracy improvement respectively. △ Less

Submitted 24 January, 2025; v1 submitted 12 January, 2025; originally announced January 2025.

arXiv:2501.04931 [pdf, other]

Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency

Authors: Shiji Zhao, Ranjie Duan, Fengxiang Wang, Chi Chen, Caixin Kang, Jialing Tao, YueFeng Chen, Hui Xue, Xingxing Wei

Abstract: Multimodal Large Language Models (MLLMs) have achieved impressive performance and have been put into practical use in commercial applications, but they still have potential safety mechanism vulnerabilities. Jailbreak attacks are red teaming methods that aim to bypass safety mechanisms and discover MLLMs' potential risks. Existing MLLMs' jailbreak methods often bypass the model's safety mechanism t… ▽ More Multimodal Large Language Models (MLLMs) have achieved impressive performance and have been put into practical use in commercial applications, but they still have potential safety mechanism vulnerabilities. Jailbreak attacks are red teaming methods that aim to bypass safety mechanisms and discover MLLMs' potential risks. Existing MLLMs' jailbreak methods often bypass the model's safety mechanism through complex optimization methods or carefully designed image and text prompts. Despite achieving some progress, they have a low attack success rate on commercial closed-source MLLMs. Unlike previous research, we empirically find that there exists a Shuffle Inconsistency between MLLMs' comprehension ability and safety ability for the shuffled harmful instruction. That is, from the perspective of comprehension ability, MLLMs can understand the shuffled harmful text-image instructions well. However, they can be easily bypassed by the shuffled harmful instructions from the perspective of safety ability, leading to harmful responses. Then we innovatively propose a text-image jailbreak attack named SI-Attack. Specifically, to fully utilize the Shuffle Inconsistency and overcome the shuffle randomness, we apply a query-based black-box optimization method to select the most harmful shuffled inputs based on the feedback of the toxic judge model. A series of experiments show that SI-Attack can improve the attack's performance on three benchmarks. In particular, SI-Attack can obviously improve the attack success rate for commercial MLLMs such as GPT-4o or Claude-3.5-Sonnet. △ Less

Submitted 8 January, 2025; originally announced January 2025.

arXiv:2412.19099 [pdf, other]

BSDB-Net: Band-Split Dual-Branch Network with Selective State Spaces Mechanism for Monaural Speech Enhancement

Authors: Cunhang Fan, Enrui Liu, Andong Li, Jianhua Tao, Jian Zhou, Jiahao Li, Chengshi Zheng, Zhao Lv

Abstract: Although the complex spectrum-based speech enhancement(SE) methods have achieved significant performance, coupling amplitude and phase can lead to a compensation effect, where amplitude information is sacrificed to compensate for the phase that is harmful to SE. In addition, to further improve the performance of SE, many modules are stacked onto SE, resulting in increased model complexity that lim… ▽ More Although the complex spectrum-based speech enhancement(SE) methods have achieved significant performance, coupling amplitude and phase can lead to a compensation effect, where amplitude information is sacrificed to compensate for the phase that is harmful to SE. In addition, to further improve the performance of SE, many modules are stacked onto SE, resulting in increased model complexity that limits the application of SE. To address these problems, we proposed a dual-path network based on compressed frequency using Mamba. First, we extract amplitude and phase information through parallel dual branches. This approach leverages structured complex spectra to implicitly capture phase information and solves the compensation effect by decoupling amplitude and phase, and the network incorporates an interaction module to suppress unnecessary parts and recover missing components from the other branch. Second, to reduce network complexity, the network introduces a band-split strategy to compress the frequency dimension. To further reduce complexity while maintaining good performance, we designed a Mamba-based module that models the time and frequency dimensions under linear complexity. Finally, compared to baselines, our model achieves an average 8.3 times reduction in computational complexity while maintaining superior performance. Furthermore, it achieves a 25 times reduction in complexity compared to transformer-based models. △ Less

Submitted 26 December, 2024; originally announced December 2024.

Comments: Accepted by AAAI 2025

arXiv:2412.15517 [pdf, other]

Novelty-Guided Data Reuse for Efficient and Diversified Multi-Agent Reinforcement Learning

Authors: Yangkun Chen, Kai Yang, Jian Tao, Jiafei Lyu

Abstract: Recently, deep Multi-Agent Reinforcement Learning (MARL) has demonstrated its potential to tackle complex cooperative tasks, pushing the boundaries of AI in collaborative environments. However, the efficiency of these systems is often compromised by inadequate sample utilization and a lack of diversity in learning strategies. To enhance MARL performance, we introduce a novel sample reuse approach… ▽ More Recently, deep Multi-Agent Reinforcement Learning (MARL) has demonstrated its potential to tackle complex cooperative tasks, pushing the boundaries of AI in collaborative environments. However, the efficiency of these systems is often compromised by inadequate sample utilization and a lack of diversity in learning strategies. To enhance MARL performance, we introduce a novel sample reuse approach that dynamically adjusts policy updates based on observation novelty. Specifically, we employ a Random Network Distillation (RND) network to gauge the novelty of each agent's current state, assigning additional sample update opportunities based on the uniqueness of the data. We name our method Multi-Agent Novelty-GuidEd sample Reuse (MANGER). This method increases sample efficiency and promotes exploration and diverse agent behaviors. Our evaluations confirm substantial improvements in MARL effectiveness in complex cooperative scenarios such as Google Research Football and super-hard StarCraft II micromanagement tasks. △ Less

Submitted 19 December, 2024; originally announced December 2024.

Comments: AAAI 2025

arXiv:2412.12759 [pdf, other]

Versatile Ordering Network: An Attention-based Neural Network for Ordering Across Scales and Quality Metrics

Authors: Zehua Yu, Weihan Zhang, Sihan Pan, Jun Tao

Abstract: Ordering has been extensively studied in many visualization applications, such as axis and matrix reordering, for the simple reason that the order will greatly impact the perceived pattern of data. Many quality metrics concerning data pattern, perception, and aesthetics are proposed, and respective optimization algorithms are developed. However, the optimization problems related to ordering are of… ▽ More Ordering has been extensively studied in many visualization applications, such as axis and matrix reordering, for the simple reason that the order will greatly impact the perceived pattern of data. Many quality metrics concerning data pattern, perception, and aesthetics are proposed, and respective optimization algorithms are developed. However, the optimization problems related to ordering are often difficult to solve (e.g., TSP is NP-complete), and developing specialized optimization algorithms is costly. In this paper, we propose Versatile Ordering Network (VON), which automatically learns the strategy to order given a quality metric. VON uses the quality metric to evaluate its solutions, and leverages reinforcement learning with a greedy rollout baseline to improve itself. This keeps the metric transparent and allows VON to optimize over different metrics. Additionally, VON uses the attention mechanism to collect information across scales and reposition the data points with respect to the current context. This allows VONs to deal with data points following different distributions. We examine the effectiveness of VON under different usage scenarios and metrics. The results demonstrate that VON can produce comparable results to specialized solvers. The code is available at https://github.com/sysuvis/VON. △ Less

Submitted 18 December, 2024; v1 submitted 17 December, 2024; originally announced December 2024.

Comments: has been accepted by TVCG on 11-Dec-2024

MSC Class: I.2.6

arXiv:2412.11551 [pdf, other]

Region-Based Optimization in Continual Learning for Audio Deepfake Detection

Authors: Yujie Chen, Jiangyan Yi, Cunhang Fan, Jianhua Tao, Yong Ren, Siding Zeng, Chu Yuan Zhang, Xinrui Yan, Hao Gu, Jun Xue, Chenglong Wang, Zhao Lv, Xiaohui Zhang

Abstract: Rapid advancements in speech synthesis and voice conversion bring convenience but also new security risks, creating an urgent need for effective audio deepfake detection. Although current models perform well, their effectiveness diminishes when confronted with the diverse and evolving nature of real-world deepfakes. To address this issue, we propose a continual learning method named Region-Based O… ▽ More Rapid advancements in speech synthesis and voice conversion bring convenience but also new security risks, creating an urgent need for effective audio deepfake detection. Although current models perform well, their effectiveness diminishes when confronted with the diverse and evolving nature of real-world deepfakes. To address this issue, we propose a continual learning method named Region-Based Optimization (RegO) for audio deepfake detection. Specifically, we use the Fisher information matrix to measure important neuron regions for real and fake audio detection, dividing them into four regions. First, we directly fine-tune the less important regions to quickly adapt to new tasks. Next, we apply gradient optimization in parallel for regions important only to real audio detection, and in orthogonal directions for regions important only to fake audio detection. For regions that are important to both, we use sample proportion-based adaptive gradient optimization. This region-adaptive optimization ensures an appropriate trade-off between memory stability and learning plasticity. Additionally, to address the increase of redundant neurons from old tasks, we further introduce the Ebbinghaus forgetting mechanism to release them, thereby promoting the capability of the model to learn more generalized discriminative features. Experimental results show our method achieves a 21.3% improvement in EER over the state-of-the-art continual learning approach RWM for audio deepfake detection. Moreover, the effectiveness of RegO extends beyond the audio deepfake detection domain, showing potential significance in other tasks, such as image recognition. The code is available at https://github.com/cyjie429/RegO △ Less

Submitted 16 December, 2024; originally announced December 2024.

Comments: Accepted by AAAI 2025

Showing 1–50 of 547 results for author: Tae, J