Skip to main content

Showing 1–50 of 295 results for author: Tao, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.04788  [pdf, ps, other

    cs.LG

    Machine Learning from Explanations

    Authors: Jiashu Tao, Reza Shokri

    Abstract: Acquiring and training on large-scale labeled data can be impractical due to cost constraints. Additionally, the use of small training datasets can result in considerable variability in model outcomes, overfitting, and learning of spurious correlations. A crucial shortcoming of data labels is their lack of any reasoning behind a specific label assignment, causing models to learn any arbitrary clas… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

    Comments: ICML 2025 AIW

  2. arXiv:2507.04278  [pdf, ps, other

    cs.HC

    DMER-Ranker: Learning to Rank Emotion Descriptions in the Absence of Ground Truth

    Authors: Zheng Lian, Licai Sun, Haoyu Chen, Zebang Cheng, Fan Zhang, Ziyu Jia, Ziyang Ma, Fei Ma, Xiaojiang Peng, Jianhua Tao

    Abstract: With the recent success of Large Language Models (LLMs), Descriptive Multimodal Emotion Recognition (DMER) has garnered increasing attention, which aims to describe a person's emotional state using free-form natural language. Unlike traditional discriminative methods that rely on predefined emotion taxonomies, DMER offers greater flexibility in emotional expression, enabling fine-grained and inter… ▽ More

    Submitted 9 July, 2025; v1 submitted 6 July, 2025; originally announced July 2025.

  3. arXiv:2506.21557  [pdf, ps, other

    cs.CL

    Debunk and Infer: Multimodal Fake News Detection via Diffusion-Generated Evidence and LLM Reasoning

    Authors: Kaiying Yan, Moyang Liu, Yukun Liu, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Xuefei Liu

    Abstract: The rapid spread of fake news across multimedia platforms presents serious challenges to information credibility. In this paper, we propose a Debunk-and-Infer framework for Fake News Detection(DIFND) that leverages debunking knowledge to enhance both the performance and interpretability of fake news detection. DIFND integrates the generative strength of conditional diffusion models with the collab… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

  4. arXiv:2506.18781  [pdf, ps, other

    cs.CL

    Existing LLMs Are Not Self-Consistent For Simple Tasks

    Authors: Zhenru Lin, Jiawen Tao, Yang Yuan, Andrew Chi-Chih Yao

    Abstract: Large Language Models (LLMs) have grown increasingly powerful, yet ensuring their decisions remain transparent and trustworthy requires self-consistency -- no contradictions in their internal reasoning. Our study reveals that even on simple tasks, such as comparing points on a line or a plane, or reasoning in a family tree, all smaller models are highly inconsistent, and even state-of-the-art mode… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

    Comments: 10 pages, 6 figures

  5. arXiv:2506.14396  [pdf, ps, other

    cs.SD cs.MM

    Manipulated Regions Localization For Partially Deepfake Audio: A Survey

    Authors: Jiayi He, Jiangyan Yi, Jianhua Tao, Siding Zeng, Hao Gu

    Abstract: With the development of audio deepfake techniques, attacks with partially deepfake audio are beginning to rise. Compared to fully deepfake, it is much harder to be identified by the detector due to the partially cryptic manipulation, resulting in higher security risks. Although some studies have been launched, there is no comprehensive review to systematically introduce the current situations and… ▽ More

    Submitted 6 July, 2025; v1 submitted 17 June, 2025; originally announced June 2025.

    Comments: Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence

  6. arXiv:2506.07935  [pdf, ps, other

    cs.MA cs.AI cs.GT

    Diffusion of Responsibility in Collective Decision Making

    Authors: Pavel Naumov, Jia Tao

    Abstract: The term "diffusion of responsibility'' refers to situations in which multiple agents share responsibility for an outcome, obscuring individual accountability. This paper examines this frequently undesirable phenomenon in the context of collective decision-making mechanisms. The work shows that if a decision is made by two agents, then the only way to avoid diffusion of responsibility is for one… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  7. arXiv:2506.03880  [pdf, ps, other

    cs.CL cs.AI

    RadialRouter: Structured Representation for Efficient and Robust Large Language Models Routing

    Authors: Ruihan Jin, Pengpeng Shao, Zhengqi Wen, Jinyang Wu, Mingkuan Feng, Shuai Zhang, Jianhua Tao

    Abstract: The rapid advancements in large language models (LLMs) have led to the emergence of routing techniques, which aim to efficiently select the optimal LLM from diverse candidates to tackle specific tasks, optimizing performance while reducing costs. Current LLM routing methods are limited in effectiveness due to insufficient exploration of the intrinsic connection between user queries and the charact… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

  8. arXiv:2506.02931  [pdf, ps, other

    cs.MA cs.AI cs.LG

    ThinkTank: A Framework for Generalizing Domain-Specific AI Agent Systems into Universal Collaborative Intelligence Platforms

    Authors: Praneet Sai Madhu Surabhi, Dheeraj Reddy Mudireddy, Jian Tao

    Abstract: This paper presents ThinkTank, a comprehensive and scalable framework designed to transform specialized AI agent systems into versatile collaborative intelligence platforms capable of supporting complex problem-solving across diverse domains. ThinkTank systematically generalizes agent roles, meeting structures, and knowledge integration mechanisms by adapting proven scientific collaboration method… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

  9. Event-based multi-view photogrammetry for high-dynamic, high-velocity target measurement

    Authors: Taihang Lei, Banglei Guan, Minzu Liang, Xiangyu Li, Jianbing Liu, Jing Tao, Yang Shang, Qifeng Yu

    Abstract: The characterization of mechanical properties for high-dynamic, high-velocity target motion is essential in industries. It provides crucial data for validating weapon systems and precision manufacturing processes etc. However, existing measurement methods face challenges such as limited dynamic range, discontinuous observations, and high costs. This paper presents a new approach leveraging an even… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

    Comments: 9 pages, 9 figures, 1 table. This paper was accepted by Acta Mechanica Sinica (Date:30.May 2025)

  10. arXiv:2506.00375  [pdf, ps, other

    cs.SD eess.AS

    RPRA-ADD: Forgery Trace Enhancement-Driven Audio Deepfake Detection

    Authors: Ruibo Fu, Xiaopeng Wang, Zhengqi Wen, Jianhua Tao, Yuankun Xie, Zhiyong Wang, Chunyu Qiang, Xuefei Liu, Cunhang Fan, Chenxing Li, Guanjun Li

    Abstract: Existing methods for deepfake audio detection have demonstrated some effectiveness. However, they still face challenges in generalizing to new forgery techniques and evolving attack patterns. This limitation mainly arises because the models rely heavily on the distribution of the training data and fail to learn a decision boundary that captures the essential characteristics of forgeries. Additiona… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

  11. arXiv:2505.18232  [pdf, ps, other

    cs.LG cs.AI cs.CL

    Two-Stage Regularization-Based Structured Pruning for LLMs

    Authors: Mingkuan Feng, Jinyang Wu, Siyuan Liu, Shuai Zhang, Ruihan Jin, Feihu Che, Pengpeng Shao, Zhengqi Wen, Jianhua Tao

    Abstract: The deployment of large language models (LLMs) is largely hindered by their large number of parameters. Structural pruning has emerged as a promising solution. Prior structured pruning methods directly remove unimportant parameters based on certain metrics, which often causes knowledge loss and necessitates extensive retraining. To overcome this, we introduce a novel pruning method TRSP: Two-Stage… ▽ More

    Submitted 30 June, 2025; v1 submitted 23 May, 2025; originally announced May 2025.

  12. arXiv:2505.15692  [pdf, other

    cs.CL cs.LG

    Thought-Augmented Policy Optimization: Bridging External Guidance and Internal Capabilities

    Authors: Jinyang Wu, Chonghua Liao, Mingkuan Feng, Shuai Zhang, Zhengqi Wen, Pengpeng Shao, Huazhe Xu, Jianhua Tao

    Abstract: Reinforcement learning (RL) has emerged as an effective method for training reasoning models. However, existing RL approaches typically bias the model's output distribution toward reward-maximizing paths without introducing external knowledge. This limits their exploration capacity and results in a narrower reasoning capability boundary compared to base models. To address this limitation, we propo… ▽ More

    Submitted 26 May, 2025; v1 submitted 21 May, 2025; originally announced May 2025.

  13. arXiv:2505.15210  [pdf, other

    cs.CL cs.IR

    Deliberation on Priors: Trustworthy Reasoning of Large Language Models on Knowledge Graphs

    Authors: Jie Ma, Ning Qu, Zhitao Gao, Rui Xing, Jun Liu, Hongbin Pei, Jiang Xie, Linyun Song, Pinghui Wang, Jing Tao, Zhou Su

    Abstract: Knowledge graph-based retrieval-augmented generation seeks to mitigate hallucinations in Large Language Models (LLMs) caused by insufficient or outdated knowledge. However, existing methods often fail to fully exploit the prior knowledge embedded in knowledge graphs (KGs), particularly their structural information and explicit or implicit constraints. The former can enhance the faithfulness of LLM… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: Under Review

    ACM Class: I.2.4

  14. arXiv:2505.14135  [pdf, other

    cs.CV

    Hunyuan-Game: Industrial-grade Intelligent Game Creation Model

    Authors: Ruihuang Li, Caijin Zhou, Shoujian Zheng, Jianxiang Lu, Jiabin Huang, Comi Chen, Junshu Tang, Guangzheng Xu, Jiale Tao, Hongmei Wang, Donghao Li, Wenqing Yu, Senbo Wang, Zhimin Li, Yetshuan Shi, Haoyu Yang, Yukun Wang, Wenxun Dai, Jiaqi Li, Linqing Wang, Qixun Wang, Zhiyong Xu, Yingfang Zhang, Jiangfeng Xiong, Weijie Kong , et al. (33 additional authors not shown)

    Abstract: Intelligent game creation represents a transformative advancement in game development, utilizing generative artificial intelligence to dynamically generate and enhance game content. Despite notable progress in generative models, the comprehensive synthesis of high-quality game assets, including both images and videos, remains a challenging frontier. To create high-fidelity game content that simult… ▽ More

    Submitted 28 May, 2025; v1 submitted 20 May, 2025; originally announced May 2025.

  15. arXiv:2505.11770  [pdf, ps, other

    cs.LG cs.AI cs.CL stat.ML

    Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

    Authors: Jing Huang, Junyi Tao, Thomas Icard, Diyi Yang, Christopher Potts

    Abstract: Interpretability research now offers a variety of techniques for identifying abstract internal mechanisms in neural networks. Can such techniques be used to predict how models will behave on out-of-distribution examples? In this work, we provide a positive answer to this question. Through a diverse set of language modeling tasks--including symbol manipulation, knowledge retrieval, and instruction… ▽ More

    Submitted 16 May, 2025; originally announced May 2025.

    Comments: ICML 2025

  16. arXiv:2505.11733  [pdf, ps, other

    cs.CL

    MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports

    Authors: Kevin Wu, Eric Wu, Rahul Thapa, Kevin Wei, Angela Zhang, Arvind Suresh, Jacqueline J. Tao, Min Woo Sun, Alejandro Lozano, James Zou

    Abstract: Doctors and patients alike increasingly use Large Language Models (LLMs) to diagnose clinical cases. However, unlike domains such as math or coding, where correctness can be objectively defined by the final answer, medical diagnosis requires both the outcome and the reasoning process to be accurate. Currently, widely used medical benchmarks like MedQA and MMLU assess only accuracy in the final ans… ▽ More

    Submitted 20 May, 2025; v1 submitted 16 May, 2025; originally announced May 2025.

  17. arXiv:2505.11079  [pdf, ps, other

    cs.SD cs.CL eess.AS

    ALLM4ADD: Unlocking the Capabilities of Audio Large Language Models for Audio Deepfake Detection

    Authors: Hao Gu, Jiangyan Yi, Chenglong Wang, Jianhua Tao, Zheng Lian, Jiayi He, Yong Ren, Yujie Chen, Zhengqi Wen

    Abstract: Audio deepfake detection (ADD) has grown increasingly important due to the rise of high-fidelity audio generative models and their potential for misuse. Given that audio large language models (ALLMs) have made significant progress in various audio processing tasks, a heuristic question arises: \textit{Can ALLMs be leveraged to solve ADD?}. In this paper, we first conduct a comprehensive zero-shot… ▽ More

    Submitted 8 July, 2025; v1 submitted 16 May, 2025; originally announced May 2025.

    Comments: Accepted by ACMMM 2025

  18. arXiv:2505.11044  [pdf, ps, other

    cs.LG

    Exploration by Random Distribution Distillation

    Authors: Zhirui Fang, Kai Yang, Jian Tao, Jiafei Lyu, Lusong Li, Li Shen, Xiu Li

    Abstract: Exploration remains a critical challenge in online reinforcement learning, as an agent must effectively explore unknown environments to achieve high returns. Currently, the main exploration algorithms are primarily count-based methods and curiosity-based methods, with prediction-error methods being a prominent example. In this paper, we propose a novel method called \textbf{R}andom \textbf{D}istri… ▽ More

    Submitted 16 May, 2025; originally announced May 2025.

  19. arXiv:2505.06312  [pdf, other

    cs.GT cs.AI

    Responsibility Gap in Collective Decision Making

    Authors: Pavel Naumov, Jia Tao

    Abstract: The responsibility gap is a set of outcomes of a collective decision-making mechanism in which no single agent is individually responsible. In general, when designing a decision-making process, it is desirable to minimise the gap. The paper proposes a concept of an elected dictatorship. It shows that, in a perfect information setting, the gap is empty if and only if the mechanism is an elected d… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

    Comments: full version of an IJCAI-25 paper

  20. arXiv:2504.19423  [pdf, other

    cs.HC

    MER 2025: When Affective Computing Meets Large Language Models

    Authors: Zheng Lian, Rui Liu, Kele Xu, Bin Liu, Xuefei Liu, Yazhou Zhang, Xin Liu, Yong Li, Zebang Cheng, Haolin Zuo, Ziyang Ma, Xiaojiang Peng, Xie Chen, Ya Li, Erik Cambria, Guoying Zhao, Björn W. Schuller, Jianhua Tao

    Abstract: MER2025 is the third year of our MER series of challenges, aiming to bring together researchers in the affective computing community to explore emerging trends and future directions in the field. Previously, MER2023 focused on multi-label learning, noise robustness, and semi-supervised learning, while MER2024 introduced a new track dedicated to open-vocabulary emotion recognition. This year, MER20… ▽ More

    Submitted 29 April, 2025; v1 submitted 27 April, 2025; originally announced April 2025.

  21. arXiv:2504.12395  [pdf, other

    cs.CV

    InstantCharacter: Personalize Any Characters with a Scalable Diffusion Transformer Framework

    Authors: Jiale Tao, Yanbing Zhang, Qixun Wang, Yiji Cheng, Haofan Wang, Xu Bai, Zhengguang Zhou, Ruihuang Li, Linqing Wang, Chunyu Wang, Qin Lin, Qinglin Lu

    Abstract: Current learning-based subject customization approaches, predominantly relying on U-Net architectures, suffer from limited generalization ability and compromised image quality. Meanwhile, optimization-based methods require subject-specific fine-tuning, which inevitably degrades textual controllability. To address these challenges, we propose InstantCharacter, a scalable framework for character cus… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

    Comments: Tech Report. Code is available at https://github.com/Tencent/InstantCharacter

  22. arXiv:2504.05197  [pdf, other

    cs.SD

    P2Mark: Plug-and-play Parameter-level Watermarking for Neural Speech Generation

    Authors: Yong Ren, Jiangyan Yi, Tao Wang, Jianhua Tao, Zheng Lian, Zhengqi Wen, Chenxing Li, Ruibo Fu, Ye Bai, Xiaohui Zhang

    Abstract: Neural speech generation (NSG) has rapidly advanced as a key component of artificial intelligence-generated content, enabling the generation of high-quality, highly realistic speech for diverse applications. This development increases the risk of technique misuse and threatens social security. Audio watermarking can embed imperceptible marks into generated audio, providing a promising approach for… ▽ More

    Submitted 5 May, 2025; v1 submitted 7 April, 2025; originally announced April 2025.

  23. arXiv:2504.00487  [pdf, other

    cs.MM cs.CL cs.CV

    FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning

    Authors: Jie Ma, Zhitao Gao, Qi Chai, Jun Liu, Pinghui Wang, Jing Tao, Zhou Su

    Abstract: Audio-Visual Question Answering (AVQA) is a challenging multimodal reasoning task requiring intelligent systems to answer natural language queries based on paired audio-video inputs accurately. However, existing AVQA approaches often suffer from overfitting to dataset biases, leading to poor robustness. Moreover, current datasets may not effectively diagnose these methods. To address these challen… ▽ More

    Submitted 2 April, 2025; v1 submitted 1 April, 2025; originally announced April 2025.

    Comments: Under Review

    ACM Class: H.5.1; I.2.4

  24. arXiv:2503.22724  [pdf, other

    cs.LG

    A Spatial-temporal Deep Probabilistic Diffusion Model for Reliable Hail Nowcasting with Radar Echo Extrapolation

    Authors: Haonan Shi, Long Tian, Jie Tao, Yufei Li, Liming Wang, Xiyang Liu

    Abstract: Hail nowcasting is a considerable contributor to meteorological disasters and there is a great need to mitigate its socioeconomic effects through precise forecast that has high resolution, long lead times and local details with large landscapes. Existing medium-range weather forecasting methods primarily rely on changes in upper air currents and cloud layers to predict precipitation events, such a… ▽ More

    Submitted 26 March, 2025; originally announced March 2025.

  25. arXiv:2503.18246  [pdf, other

    eess.IV cs.CV

    ZECO: ZeroFusion Guided 3D MRI Conditional Generation

    Authors: Feiran Wang, Bin Duan, Jiachen Tao, Nikhil Sharma, Dawen Cai, Yan Yan

    Abstract: Medical image segmentation is crucial for enhancing diagnostic accuracy and treatment planning in Magnetic Resonance Imaging (MRI). However, acquiring precise lesion masks for segmentation model training demands specialized expertise and significant time investment, leading to a small dataset scale in clinical practice. In this paper, we present ZECO, a ZeroFusion guided 3D MRI conditional generat… ▽ More

    Submitted 23 March, 2025; originally announced March 2025.

    Comments: Project page: \url{https://brack-wang.github.io/ZECO_web/}; Github Code: \url{https://github.com/Brack-Wang/ZECO}

  26. arXiv:2503.14359  [pdf, other

    cs.CV

    ImViD: Immersive Volumetric Videos for Enhanced VR Engagement

    Authors: Zhengxian Yang, Shi Pan, Shengqi Wang, Haoxiang Wang, Li Lin, Guanjun Li, Zhengqi Wen, Borong Lin, Jianhua Tao, Tao Yu

    Abstract: User engagement is greatly enhanced by fully immersive multi-modal experiences that combine visual and auditory stimuli. Consequently, the next frontier in VR/AR technologies lies in immersive volumetric videos with complete scene capture, large 6-DoF interaction space, multi-modal feedback, and high resolution & frame-rate contents. To stimulate the reconstruction of immersive volumetric videos,… ▽ More

    Submitted 18 March, 2025; originally announced March 2025.

    Comments: Accepted by CVPR 2025

  27. arXiv:2503.13991  [pdf, other

    cs.CV cs.AI

    GraphTEN: Graph Enhanced Texture Encoding Network

    Authors: Bo Peng, Jintao Chen, Mufeng Yao, Chenhao Zhang, Jianghui Zhang, Mingmin Chi, Jiang Tao

    Abstract: Texture recognition is a fundamental problem in computer vision and pattern recognition. Recent progress leverages feature aggregation into discriminative descriptions based on convolutional neural networks (CNNs). However, modeling non-local context relations through visual primitives remains challenging due to the variability and randomness of texture primitives in spatial distributions. In this… ▽ More

    Submitted 18 March, 2025; originally announced March 2025.

    Comments: 6 pages, 7 figures, conference paper

    MSC Class: 68T45 ACM Class: I.2.10; I.4.7

  28. arXiv:2503.09962  [pdf, other

    cs.CV

    Modeling Thousands of Human Annotators for Generalizable Text-to-Image Person Re-identification

    Authors: Jiayu Jiang, Changxing Ding, Wentao Tan, Junhong Wang, Jin Tao, Xiangmin Xu

    Abstract: Text-to-image person re-identification (ReID) aims to retrieve the images of an interested person based on textual descriptions. One main challenge for this task is the high cost in manually annotating large-scale databases, which affects the generalization ability of ReID models. Recent works handle this problem by leveraging Multi-modal Large Language Models (MLLMs) to describe pedestrian images… ▽ More

    Submitted 12 March, 2025; originally announced March 2025.

    Comments: CVPR 2025. Project website: https://github.com/sssaury/HAM

  29. arXiv:2503.08596  [pdf, other

    cs.CV

    X-Field: A Physically Grounded Representation for 3D X-ray Reconstruction

    Authors: Feiran Wang, Jiachen Tao, Junyi Wu, Haoxuan Wang, Bin Duan, Kai Wang, Zongxin Yang, Yan Yan

    Abstract: X-ray imaging is indispensable in medical diagnostics, yet its use is tightly regulated due to potential health risks. To mitigate radiation exposure, recent research focuses on generating novel views from sparse inputs and reconstructing Computed Tomography (CT) volumes, borrowing representations from the 3D reconstruction area. However, these representations originally target visible light imagi… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

    Comments: Project Page: \url{https://brack-wang.github.io/XField/}, Github Code: \url{https://github.com/Brack-Wang/X-Field}

  30. arXiv:2503.08131  [pdf, ps, other

    cs.LG

    Large Scale Multi-Task Bayesian Optimization with Large Language Models

    Authors: Yimeng Zeng, Natalie Maus, Haydn Thomas Jones, Jeffrey Tao, Fangping Wan, Marcelo Der Torossian Torres, Cesar de la Fuente-Nunez, Ryan Marcus, Osbert Bastani, Jacob R. Gardner

    Abstract: In multi-task Bayesian optimization, the goal is to leverage experience from optimizing existing tasks to improve the efficiency of optimizing new ones. While approaches using multi-task Gaussian processes or deep kernel transfer exist, the performance improvement is marginal when scaling beyond a moderate number of tasks. We introduce a novel approach leveraging large language models (LLMs) to le… ▽ More

    Submitted 12 June, 2025; v1 submitted 11 March, 2025; originally announced March 2025.

  31. arXiv:2502.18549  [pdf, ps, other

    cs.LG cs.CR

    ARBoids: Adaptive Residual Reinforcement Learning With Boids Model for Cooperative Multi-USV Target Defense

    Authors: Jiyue Tao, Tongsheng Shen, Dexin Zhao, Feitian Zhang

    Abstract: The target defense problem (TDP) for unmanned surface vehicles (USVs) concerns intercepting an adversarial USV before it breaches a designated target region, using one or more defending USVs. A particularly challenging scenario arises when the attacker exhibits superior maneuverability compared to the defenders, significantly complicating effective interception. To tackle this challenge, this lett… ▽ More

    Submitted 10 July, 2025; v1 submitted 25 February, 2025; originally announced February 2025.

  32. arXiv:2502.17475  [pdf, other

    eess.SP cs.AI cs.CL cs.LG

    ECG-Expert-QA: A Benchmark for Evaluating Medical Large Language Models in Heart Disease Diagnosis

    Authors: Xu Wang, Jiaju Kang, Puyu Han, Yubao Zhao, Qian Liu, Liwenfei He, Lingqiong Zhang, Lingyun Dai, Yongcheng Wang, Jie Tao

    Abstract: We present ECG-Expert-QA, a comprehensive multimodal dataset for evaluating diagnostic capabilities in electrocardiogram (ECG) interpretation. It combines real-world clinical ECG data with systematically generated synthetic cases, covering 12 essential diagnostic tasks and totaling 47,211 expert-validated QA pairs. These encompass diverse clinical scenarios, from basic rhythm recognition to comple… ▽ More

    Submitted 7 April, 2025; v1 submitted 16 February, 2025; originally announced February 2025.

  33. arXiv:2502.13917  [pdf, ps, other

    cs.CL

    TESS 2: A Large-Scale Generalist Diffusion Language Model

    Authors: Jaesung Tae, Hamish Ivison, Sachin Kumar, Arman Cohan

    Abstract: We introduce TESS 2, a general instruction-following diffusion language model that outperforms contemporary instruction-tuned diffusion models, as well as matches and sometimes exceeds strong autoregressive (AR) models. We train TESS 2 by first adapting a strong AR model via continued pretraining with the usual cross-entropy as diffusion loss, and then performing further instruction tuning. We fin… ▽ More

    Submitted 31 May, 2025; v1 submitted 19 February, 2025; originally announced February 2025.

    Comments: ACL 2025 camera-ready

  34. arXiv:2502.05256  [pdf, other

    cs.DB

    Learned Offline Query Planning via Bayesian Optimization

    Authors: Jeffrey Tao, Natalie Maus, Haydn Jones, Yimeng Zeng, Jacob R. Gardner, Ryan Marcus

    Abstract: Analytics database workloads often contain queries that are executed repeatedly. Existing optimization techniques generally prioritize keeping optimization cost low, normally well below the time it takes to execute a single instance of a query. If a given query is going to be executed thousands of times, could it be worth investing significantly more optimization time? In contrast to traditional o… ▽ More

    Submitted 7 February, 2025; originally announced February 2025.

  35. arXiv:2502.02339  [pdf, ps, other

    cs.CL

    Boosting Multimodal Reasoning with Automated Structured Thinking

    Authors: Jinyang Wu, Mingkuan Feng, Shuai Zhang, Fangrui Lv, Ruihan Jin, Feihu Che, Zengqi Wen, Jianhua Tao

    Abstract: Multimodal large language models excel across diverse domains but struggle with complex visual reasoning tasks. Current approaches aim to incorporate structured thinking via two strategies: explicit search methods and post-training techniques. However, both approaches face significant limitations: Search-based methods suffer from computational inefficiency due to extensive solution space explorati… ▽ More

    Submitted 30 May, 2025; v1 submitted 4 February, 2025; originally announced February 2025.

  36. arXiv:2501.17905  [pdf, ps, other

    cs.LG cs.AI cs.CL

    DReSS: Data-driven Regularized Structured Streamlining for Large Language Models

    Authors: Mingkuan Feng, Jinyang Wu, Shuai Zhang, Pengpeng Shao, Ruihan Jin, Zhengqi Wen, Jianhua Tao, Feihu Che

    Abstract: Large language models (LLMs) have achieved significant progress across various domains, but their increasing scale results in high computational and memory costs. Recent studies have revealed that LLMs exhibit sparsity, providing the potential to reduce model size through pruning techniques. However, existing pruning methods typically follow a prune-then-finetune paradigm. Since the pruned compone… ▽ More

    Submitted 28 June, 2025; v1 submitted 29 January, 2025; originally announced January 2025.

  37. arXiv:2501.16566  [pdf, other

    cs.HC

    AffectGPT: A New Dataset, Model, and Benchmark for Emotion Understanding with Multimodal Large Language Models

    Authors: Zheng Lian, Haoyu Chen, Lan Chen, Haiyang Sun, Licai Sun, Yong Ren, Zebang Cheng, Bin Liu, Rui Liu, Xiaojiang Peng, Jiangyan Yi, Jianhua Tao

    Abstract: The emergence of multimodal large language models (MLLMs) advances multimodal emotion recognition (MER) to the next level, from naive discriminative tasks to complex emotion understanding with advanced video understanding abilities and natural language description. However, the current community suffers from a lack of large-scale datasets with intensive, descriptive emotion annotations, as well as… ▽ More

    Submitted 7 May, 2025; v1 submitted 27 January, 2025; originally announced January 2025.

  38. arXiv:2501.15269  [pdf, other

    cs.LG cs.CR cs.CV

    Mirage in the Eyes: Hallucination Attack on Multi-modal Large Language Models with Only Attention Sink

    Authors: Yining Wang, Mi Zhang, Junjie Sun, Chenyue Wang, Min Yang, Hui Xue, Jialing Tao, Ranjie Duan, Jiexi Liu

    Abstract: Fusing visual understanding into language generation, Multi-modal Large Language Models (MLLMs) are revolutionizing visual-language applications. Yet, these models are often plagued by the hallucination problem, which involves generating inaccurate objects, attributes, and relationships that do not match the visual content. In this work, we delve into the internal attention mechanisms of MLLMs to… ▽ More

    Submitted 25 January, 2025; originally announced January 2025.

    Comments: USENIX Security 2025

  39. arXiv:2501.06869  [pdf, other

    cs.AI cs.CV cs.HC cs.LG

    A Foundational Generative Model for Breast Ultrasound Image Analysis

    Authors: Haojun Yu, Youcheng Li, Nan Zhang, Zihan Niu, Xuantong Gong, Yanwen Luo, Haotian Ye, Siyu He, Quanlin Wu, Wangyan Qin, Mengyuan Zhou, Jie Han, Jia Tao, Ziwei Zhao, Di Dai, Di He, Dong Wang, Binghui Tang, Ling Huo, James Zou, Qingli Zhu, Yong Wang, Liwei Wang

    Abstract: Foundational models have emerged as powerful tools for addressing various tasks in clinical settings. However, their potential development to breast ultrasound analysis remains untapped. In this paper, we present BUSGen, the first foundational generative model specifically designed for breast ultrasound image analysis. Pretrained on over 3.5 million breast ultrasound images, BUSGen has acquired ex… ▽ More

    Submitted 12 January, 2025; originally announced January 2025.

    Comments: Peking University; Stanford University; Peking University Cancer Hospital & Institute; Peking Union Medical College Hospital; Cancer Hospital, Chinese Academy of Medical Sciences

  40. arXiv:2501.06764  [pdf, other

    cs.LG

    MTPareto: A MultiModal Targeted Pareto Framework for Fake News Detection

    Authors: Kaiying Yan, Moyang Liu, Yukun Liu, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Xuefei Liu, Guanjun Li

    Abstract: Multimodal fake news detection is essential for maintaining the authenticity of Internet multimedia information. Significant differences in form and content of multimodal information lead to intensified optimization conflicts, hindering effective model training as well as reducing the effectiveness of existing fusion methods for bimodal. To address this problem, we propose the MTPareto framework t… ▽ More

    Submitted 24 January, 2025; v1 submitted 12 January, 2025; originally announced January 2025.

  41. arXiv:2501.04931  [pdf, ps, other

    cs.CR cs.AI cs.CL

    Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency

    Authors: Shiji Zhao, Ranjie Duan, Fengxiang Wang, Chi Chen, Caixin Kang, Shouwei Ruan, Jialing Tao, YueFeng Chen, Hui Xue, Xingxing Wei

    Abstract: Multimodal Large Language Models (MLLMs) have achieved impressive performance and have been put into practical use in commercial applications, but they still have potential safety mechanism vulnerabilities. Jailbreak attacks are red teaming methods that aim to bypass safety mechanisms and discover MLLMs' potential risks. Existing MLLMs' jailbreak methods often bypass the model's safety mechanism t… ▽ More

    Submitted 27 June, 2025; v1 submitted 8 January, 2025; originally announced January 2025.

    Comments: ICCV2025

  42. arXiv:2412.19099  [pdf, other

    cs.SD eess.AS

    BSDB-Net: Band-Split Dual-Branch Network with Selective State Spaces Mechanism for Monaural Speech Enhancement

    Authors: Cunhang Fan, Enrui Liu, Andong Li, Jianhua Tao, Jian Zhou, Jiahao Li, Chengshi Zheng, Zhao Lv

    Abstract: Although the complex spectrum-based speech enhancement(SE) methods have achieved significant performance, coupling amplitude and phase can lead to a compensation effect, where amplitude information is sacrificed to compensate for the phase that is harmful to SE. In addition, to further improve the performance of SE, many modules are stacked onto SE, resulting in increased model complexity that lim… ▽ More

    Submitted 26 December, 2024; originally announced December 2024.

    Comments: Accepted by AAAI 2025

  43. arXiv:2412.15517  [pdf, other

    cs.LG

    Novelty-Guided Data Reuse for Efficient and Diversified Multi-Agent Reinforcement Learning

    Authors: Yangkun Chen, Kai Yang, Jian Tao, Jiafei Lyu

    Abstract: Recently, deep Multi-Agent Reinforcement Learning (MARL) has demonstrated its potential to tackle complex cooperative tasks, pushing the boundaries of AI in collaborative environments. However, the efficiency of these systems is often compromised by inadequate sample utilization and a lack of diversity in learning strategies. To enhance MARL performance, we introduce a novel sample reuse approach… ▽ More

    Submitted 19 December, 2024; originally announced December 2024.

    Comments: AAAI 2025

  44. arXiv:2412.12759  [pdf, other

    cs.LG

    Versatile Ordering Network: An Attention-based Neural Network for Ordering Across Scales and Quality Metrics

    Authors: Zehua Yu, Weihan Zhang, Sihan Pan, Jun Tao

    Abstract: Ordering has been extensively studied in many visualization applications, such as axis and matrix reordering, for the simple reason that the order will greatly impact the perceived pattern of data. Many quality metrics concerning data pattern, perception, and aesthetics are proposed, and respective optimization algorithms are developed. However, the optimization problems related to ordering are of… ▽ More

    Submitted 18 December, 2024; v1 submitted 17 December, 2024; originally announced December 2024.

    Comments: has been accepted by TVCG on 11-Dec-2024

    MSC Class: I.2.6

  45. arXiv:2412.11551  [pdf, other

    cs.SD cs.AI eess.AS

    Region-Based Optimization in Continual Learning for Audio Deepfake Detection

    Authors: Yujie Chen, Jiangyan Yi, Cunhang Fan, Jianhua Tao, Yong Ren, Siding Zeng, Chu Yuan Zhang, Xinrui Yan, Hao Gu, Jun Xue, Chenglong Wang, Zhao Lv, Xiaohui Zhang

    Abstract: Rapid advancements in speech synthesis and voice conversion bring convenience but also new security risks, creating an urgent need for effective audio deepfake detection. Although current models perform well, their effectiveness diminishes when confronted with the diverse and evolving nature of real-world deepfakes. To address this issue, we propose a continual learning method named Region-Based O… ▽ More

    Submitted 16 December, 2024; originally announced December 2024.

    Comments: Accepted by AAAI 2025

  46. arXiv:2412.06324  [pdf, other

    cs.CV

    World knowledge-enhanced Reasoning Using Instruction-guided Interactor in Autonomous Driving

    Authors: Mingliang Zhai, Cheng Li, Zengyuan Guo, Ningrui Yang, Xiameng Qin, Sanyuan Zhao, Junyu Han, Ji Tao, Yuwei Wu, Yunde Jia

    Abstract: The Multi-modal Large Language Models (MLLMs) with extensive world knowledge have revitalized autonomous driving, particularly in reasoning tasks within perceivable regions. However, when faced with perception-limited areas (dynamic or static occlusion regions), MLLMs struggle to effectively integrate perception ability with world knowledge for reasoning. These perception-limited regions can conce… ▽ More

    Submitted 1 January, 2025; v1 submitted 9 December, 2024; originally announced December 2024.

    Comments: AAAI 2025. 14 pages. Supplementary Material

  47. arXiv:2412.03520  [pdf, other

    cs.CV

    Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention

    Authors: Hannan Lu, Xiaohe Wu, Shudong Wang, Xiameng Qin, Xinyu Zhang, Junyu Han, Wangmeng Zuo, Ji Tao

    Abstract: Generating multi-view videos for autonomous driving training has recently gained much attention, with the challenge of addressing both cross-view and cross-frame consistency. Existing methods typically apply decoupled attention mechanisms for spatial, temporal, and view dimensions. However, these approaches often struggle to maintain consistency across dimensions, particularly when handling fast-m… ▽ More

    Submitted 9 December, 2024; v1 submitted 4 December, 2024; originally announced December 2024.

  48. arXiv:2412.01425  [pdf, other

    cs.SD cs.AI eess.AS

    Reject Threshold Adaptation for Open-Set Model Attribution of Deepfake Audio

    Authors: Xinrui Yan, Jiangyan Yi, Jianhua Tao, Yujie Chen, Hao Gu, Guanjun Li, Junzuo Zhou, Yong Ren, Tao Xu

    Abstract: Open environment oriented open set model attribution of deepfake audio is an emerging research topic, aiming to identify the generation models of deepfake audio. Most previous work requires manually setting a rejection threshold for unknown classes to compare with predicted probabilities. However, models often overfit training instances and generate overly confident predictions. Moreover, threshol… ▽ More

    Submitted 2 December, 2024; originally announced December 2024.

    Comments: Accepted by ISCSLP 2024

  49. arXiv:2411.18478  [pdf, ps, other

    cs.CL

    Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS

    Authors: Jinyang Wu, Mingkuan Feng, Shuai Zhang, Feihu Che, Zengqi Wen, Chonghua Liao, Jianhua Tao

    Abstract: In-context learning (ICL) enables large language models (LLMs) to perform downstream tasks through advanced prompting and high-quality demonstrations. However, traditional ICL paradigms encounter significant limitations in complex reasoning tasks, stemming primarily from their dependence on example quality and absence of explicit reasoning guidance. To address these challenges, we introduce HiAR-I… ▽ More

    Submitted 2 June, 2025; v1 submitted 27 November, 2024; originally announced November 2024.

  50. arXiv:2411.03814  [pdf, other

    cs.AI cs.CL cs.CR

    MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue

    Authors: Fengxiang Wang, Ranjie Duan, Peng Xiao, Xiaojun Jia, Shiji Zhao, Cheng Wei, YueFeng Chen, Chongwen Wang, Jialing Tao, Hang Su, Jun Zhu, Hui Xue

    Abstract: Large Language Models (LLMs) demonstrate outstanding performance in their reservoir of knowledge and understanding capabilities, but they have also been shown to be prone to illegal or unethical reactions when subjected to jailbreak attacks. To ensure their responsible deployment in critical applications, it is crucial to understand the safety capabilities and vulnerabilities of LLMs. Previous wor… ▽ More

    Submitted 7 January, 2025; v1 submitted 6 November, 2024; originally announced November 2024.