Skip to main content

Showing 1–50 of 3,089 results for author: huang, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.18155  [pdf, ps, other

    cs.LG

    Probabilistic and reinforced mining of association rules

    Authors: Yongchao Huang

    Abstract: This work introduces 4 novel probabilistic and reinforcement-driven methods for association rule mining (ARM): Gaussian process-based association rule mining (GPAR), Bayesian ARM (BARM), multi-armed bandit based ARM (MAB-ARM), and reinforcement learning based association rule mining (RLAR). These methods depart fundamentally from traditional frequency-based algorithms such as Apriori, FP-Growth, a… ▽ More

    Submitted 22 June, 2025; originally announced June 2025.

    Comments: 205 pages

  2. arXiv:2506.18096  [pdf, ps, other

    cs.AI

    Deep Research Agents: A Systematic Examination And Roadmap

    Authors: Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, Jianye Hao, Kun Shao, Jun Wang

    Abstract: The rapid progress of Large Language Models (LLMs) has given rise to a new category of autonomous AI systems, referred to as Deep Research (DR) agents. These agents are designed to tackle complex, multi-turn informational research tasks by leveraging a combination of dynamic reasoning, adaptive long-horizon planning, multi-hop information retrieval, iterative tool use, and the generation of struct… ▽ More

    Submitted 22 June, 2025; originally announced June 2025.

  3. arXiv:2506.17873  [pdf, ps, other

    cs.CV cs.AI

    SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model

    Authors: Guankun Wang, Wenjin Mo, Junyi Wang, Long Bai, Kun Yuan, Ming Hu, Jinlin Wu, Junjun He, Yiming Huang, Nicolas Padoy, Zhen Lei, Hongbin Liu, Nassir Navab, Hongliang Ren

    Abstract: Recent advances in Multimodal Large Language Models have demonstrated great potential in the medical domain, facilitating users to understand surgical scenes and procedures. Beyond image-based methods, the exploration of Video Large Language Models (Vid-LLMs) has emerged as a promising avenue for capturing the complex sequences of information involved in surgery. However, there is still a lack of… ▽ More

    Submitted 21 June, 2025; originally announced June 2025.

  4. arXiv:2506.17623  [pdf, ps, other

    cs.MM cs.CV

    Can Generated Images Serve as a Viable Modality for Text-Centric Multimodal Learning?

    Authors: Yuesheng Huang, Peng Zhang, Riliang Liu, Jiaqi Liang

    Abstract: A significant ``modality gap" exists between the abundance of text-only data and the increasing power of multimodal models. This work systematically investigates whether images generated on-the-fly by Text-to-Image (T2I) models can serve as a valuable complementary modality for text-centric tasks. Through a comprehensive evaluation framework on text classification, we analyze the impact of critica… ▽ More

    Submitted 21 June, 2025; originally announced June 2025.

    Comments: 4 figures,7 tables

  5. arXiv:2506.17585  [pdf, ps, other

    cs.AI cs.CL cs.LG

    Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models

    Authors: Yukun Huang, Sanxing Chen, Jian Pei, Manzil Zaheer, Bhuwan Dhingra

    Abstract: Trustworthy language models should provide both correct and verifiable answers. While language models can sometimes attribute their outputs to pretraining data, their citations are often unreliable due to hallucination. As a result, current systems insert citations by querying an external retriever at inference time, introducing latency, infrastructure dependence, and vulnerability to retrieval no… ▽ More

    Submitted 21 June, 2025; originally announced June 2025.

  6. arXiv:2506.17457  [pdf, ps, other

    cs.CV

    When Every Millisecond Counts: Real-Time Anomaly Detection via the Multimodal Asynchronous Hybrid Network

    Authors: Dong Xiao, Guangyao Chen, Peixi Peng, Yangru Huang, Yifan Zhao, Yongxing Dai, Yonghong Tian

    Abstract: Anomaly detection is essential for the safety and reliability of autonomous driving systems. Current methods often focus on detection accuracy but neglect response time, which is critical in time-sensitive driving scenarios. In this paper, we introduce real-time anomaly detection for autonomous driving, prioritizing both minimal response time and high accuracy. We propose a novel multimodal asynch… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

    Comments: ICML 2025 Spotlight

  7. arXiv:2506.17206  [pdf, ps, other

    cs.GR cs.CV cs.LG

    DreamCube: 3D Panorama Generation via Multi-plane Synchronization

    Authors: Yukun Huang, Yanning Zhou, Jianan Wang, Kaiyi Huang, Xihui Liu

    Abstract: 3D panorama synthesis is a promising yet challenging task that demands high-quality and diverse visual appearance and geometry of the generated omnidirectional content. Existing methods leverage rich image priors from pre-trained 2D foundation models to circumvent the scarcity of 3D panoramic data, but the incompatibility between 3D panoramas and 2D single views limits their effectiveness. In this… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

    Comments: Project page: https://yukun-huang.github.io/DreamCube/

  8. arXiv:2506.16398  [pdf, ps, other

    cs.CV

    HyperPath: Knowledge-Guided Hyperbolic Semantic Hierarchy Modeling for WSI Analysis

    Authors: Peixiang Huang, Yanyan Huang, Weiqin Zhao, Junjun He, Lequan Yu

    Abstract: Pathology is essential for cancer diagnosis, with multiple instance learning (MIL) widely used for whole slide image (WSI) analysis. WSIs exhibit a natural hierarchy -- patches, regions, and slides -- with distinct semantic associations. While some methods attempt to leverage this hierarchy for improved representation, they predominantly rely on Euclidean embeddings, which struggle to fully captur… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

  9. arXiv:2506.16336  [pdf, ps, other

    cs.RO cs.MA

    Goal-conditioned Hierarchical Reinforcement Learning for Sample-efficient and Safe Autonomous Driving at Intersections

    Authors: Yiou Huang

    Abstract: Reinforcement learning (RL) exhibits remarkable potential in addressing autonomous driving tasks. However, it is difficult to train a sample-efficient and safe policy in complex scenarios. In this article, we propose a novel hierarchical reinforcement learning (HRL) framework with a goal-conditioned collision prediction (GCCP) module. In the hierarchical structure, the GCCP module predicts collisi… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

  10. arXiv:2506.16250  [pdf, ps, other

    quant-ph cs.IT

    Graph-Cover-based Characterization of the Bethe Partition Function of Double-Edge Factor Graphs

    Authors: Yuwen Huang, Pascal O. Vontobel

    Abstract: For standard factor graphs (S-FGs) with non-negative real-valued local functions, Vontobel provided a combinatorial characterization of the Bethe approximation of the partition function, also known as the Bethe partition function, using finite graph covers. The proof of this characterization, i.e., the graph-cover theorem for S-FGs, heavily relied on the method of types. In this paper, we study… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

    Comments: arXiv admin note: substantial text overlap with arXiv:2412.05942

  11. arXiv:2506.16211  [pdf, ps, other

    cs.RO

    ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models

    Authors: Puhao Li, Yingying Wu, Ziheng Xi, Wanlin Li, Yuzhe Huang, Zhiyuan Zhang, Yinghan Chen, Jianan Wang, Song-Chun Zhu, Tengyu Liu, Siyuan Huang

    Abstract: Learning real-world robotic manipulation is challenging, particularly when limited demonstrations are available. Existing methods for few-shot manipulation often rely on simulation-augmented data or pre-built modules like grasping and pose estimation, which struggle with sim-to-real gaps and lack extensibility. While large-scale imitation pre-training shows promise, adapting these general-purpose… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

    Comments: Website: https://controlvla.github.io

  12. arXiv:2506.15755  [pdf, ps, other

    cs.CV cs.CL

    VLMInferSlow: Evaluating the Efficiency Robustness of Large Vision-Language Models as a Service

    Authors: Xiasi Wang, Tianliang Yao, Simin Chen, Runqi Wang, Lei YE, Kuofeng Gao, Yi Huang, Yuan Yao

    Abstract: Vision-Language Models (VLMs) have demonstrated great potential in real-world applications. While existing research primarily focuses on improving their accuracy, the efficiency remains underexplored. Given the real-time demands of many applications and the high inference overhead of VLMs, efficiency robustness is a critical issue. However, previous studies evaluate efficiency robustness under unr… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

    Comments: Accepted by ACL 2025

  13. arXiv:2506.15218  [pdf, ps, other

    cs.CV

    DM-FNet: Unified multimodal medical image fusion via diffusion process-trained encoder-decoder

    Authors: Dan He, Weisheng Li, Guofen Wang, Yuping Huang, Shiqiang Liu

    Abstract: Multimodal medical image fusion (MMIF) extracts the most meaningful information from multiple source images, enabling a more comprehensive and accurate diagnosis. Achieving high-quality fusion results requires a careful balance of brightness, color, contrast, and detail; this ensures that the fused images effectively display relevant anatomical structures and reflect the functional status of the t… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

    Comments: This paper has been accepted by IEEE Transactions on Multimedia (TMM) in March 2025

  14. arXiv:2506.14973  [pdf, ps, other

    eess.AS cs.AI

    Thinking in Directivity: Speech Large Language Model for Multi-Talker Directional Speech Recognition

    Authors: Jiamin Xie, Ju Lin, Yiteng Huang, Tyler Vuong, Zhaojiang Lin, Zhaojun Yang, Peng Su, Prashant Rawat, Sangeeta Srivastava, Ming Sun, Florian Metze

    Abstract: Recent studies have demonstrated that prompting large language models (LLM) with audio encodings enables effective speech recognition capabilities. However, the ability of Speech LLMs to comprehend and process multi-channel audio with spatial cues remains a relatively uninvestigated area of research. In this work, we present directional-SpeechLlama, a novel approach that leverages the microphone a… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

    Comments: Accepted to Interspeech 2025

  15. arXiv:2506.14015  [pdf, ps, other

    cs.CV

    Disentangling 3D from Large Vision-Language Models for Controlled Portrait Generation

    Authors: Nick Yiwen Huang, Akin Caliskan, Berkay Kicanaoglu, James Tompkin, Hyeongwoo Kim

    Abstract: We consider the problem of disentangling 3D from large vision-language models, which we show on generative 3D portraits. This allows free-form text control of appearance attributes like age, hair style, and glasses, and 3D geometry control of face expression and camera pose. In this setting, we assume we use a pre-trained large vision-language model (LVLM; CLIP) to generate from a smaller 2D datas… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  16. arXiv:2506.13833  [pdf, ps, other

    cs.SD cs.AI cs.RO eess.AS physics.app-ph

    A Survey on World Models Grounded in Acoustic Physical Information

    Authors: Xiaoliang Chen, Le Chang, Xin Yu, Yunhe Huang, Xianling Tu

    Abstract: This survey provides a comprehensive overview of the emerging field of world models grounded in the foundation of acoustic physical information. It examines the theoretical underpinnings, essential methodological frameworks, and recent technological advancements in leveraging acoustic signals for high-fidelity environmental perception, causal physical reasoning, and predictive simulation of dynami… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: 28 pages,11 equations

    MSC Class: 68T07; 35L05; 78A45 ACM Class: I.2.6; H.5.5; I.2.9

  17. arXiv:2506.13725  [pdf, ps, other

    cs.RO

    CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding

    Authors: Wenxuan Song, Jiayi Chen, Pengxiang Ding, Yuxin Huang, Han Zhao, Donglin Wang, Haoang Li

    Abstract: In recent years, Vision-Language-Action (VLA) models have become a vital research direction in robotics due to their impressive multimodal understanding and generalization capabilities. Despite the progress, their practical deployment is severely constrained by inference speed bottlenecks, particularly in high-frequency and dexterous manipulation tasks. While recent studies have explored Jacobi de… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: 16 pages

  18. arXiv:2506.13585  [pdf, ps, other

    cs.CL cs.LG

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    Authors: MiniMax, :, Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, Chengjun Xiao, Chengyu Du, Chi Zhang, Chu Qiao, Chunhao Zhang, Chunhui Du, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dong Li, Enwei Jiao, Haigang Zhou , et al. (103 additional authors not shown)

    Abstract: We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: A technical report from MiniMax. The authors are listed in alphabetical order. We open-source our MiniMax-M1 at https://github.com/MiniMax-AI/MiniMax-M1

  19. arXiv:2506.13387  [pdf, ps, other

    cs.CV

    TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Scale-Oriented Contrast

    Authors: Beilei Cui, Yiming Huang, Long Bai, Hongliang Ren

    Abstract: This work presents a generalizable framework to transfer relative depth to metric depth. Current monocular depth estimation methods are mainly divided into metric depth estimation (MMDE) and relative depth estimation (MRDE). MMDEs estimate depth in metric scale but are often limited to a specific domain. MRDEs generalize well across different domains, but with uncertain scales which hinders downst… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  20. arXiv:2506.12909  [pdf, ps, other

    cs.CL

    SciDA: Scientific Dynamic Assessor of LLMs

    Authors: Junting Zhou, Tingjia Miao, Yiyan Liao, Qichao Wang, Zhoufutu Wen, Yanqin Wang, Yunjie Huang, Ge Yan, Leqi Wang, Yucheng Xia, Hongwan Gao, Yuansong Zeng, Renjie Zheng, Chen Dun, Yitao Liang, Tong Yang, Wenhao Huang, Ge Zhang

    Abstract: Advancement in Large Language Models (LLMs) reasoning capabilities enables them to solve scientific problems with enhanced efficacy. Thereby, a high-quality benchmark for comprehensive and appropriate assessment holds significance, while existing ones either confront the risk of data contamination or lack involved disciplines. To be specific, due to the data source overlap of LLMs training and sta… ▽ More

    Submitted 15 June, 2025; originally announced June 2025.

  21. arXiv:2506.12517  [pdf, ps, other

    cs.CV

    Retrieval Augmented Comic Image Generation

    Authors: Yunhao Shui, Xuekuan Wang, Feng Qiu, Yuqiu Huang, Jinzhu Li, Haoyu Zheng, Jinru Han, Zhuo Zeng, Pengpeng Zhang, Jiarui Han, Keqiang Sun

    Abstract: We present RaCig, a novel system for generating comic-style image sequences with consistent characters and expressive gestures. RaCig addresses two key challenges: (1) maintaining character identity and costume consistency across frames, and (2) producing diverse and vivid character gestures. Our approach integrates a retrieval-based character assignment module, which aligns characters in textual… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

  22. arXiv:2506.12270  [pdf, ps, other

    cs.AI cs.HC cs.LG eess.SY

    Cloud Infrastructure Management in the Age of AI Agents

    Authors: Zhenning Yang, Archit Bhatnagar, Yiming Qiu, Tongyuan Miao, Patrick Tser Jern Kon, Yunming Xiao, Yibo Huang, Martin Casado, Ang Chen

    Abstract: Cloud infrastructure is the cornerstone of the modern IT industry. However, managing this infrastructure effectively requires considerable manual effort from the DevOps engineering team. We make a case for developing AI agents powered by large language models (LLMs) to automate cloud infrastructure management tasks. In a preliminary study, we investigate the potential for AI agents to use differen… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

  23. arXiv:2506.11104  [pdf, ps, other

    cs.CL cs.AI

    DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference Acceleration

    Authors: Hanzhi Zhang, Heng Fan, Kewei Sha, Yan Huang, Yunhe Feng

    Abstract: Long-context understanding is crucial for many NLP applications, yet transformers struggle with efficiency due to the quadratic complexity of self-attention. Sparse attention methods alleviate this cost but often impose static, predefined masks, failing to capture heterogeneous attention patterns. This results in suboptimal token interactions, limiting adaptability and retrieval accuracy in long-s… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

  24. arXiv:2506.11073  [pdf, ps, other

    cs.CL cs.AI cs.CV

    CLAIM: Mitigating Multilingual Object Hallucination in Large Vision-Language Models with Cross-Lingual Attention Intervention

    Authors: Zekai Ye, Qiming Li, Xiaocheng Feng, Libo Qin, Yichong Huang, Baohang Li, Kui Jiang, Yang Xiang, Zhirui Zhang, Yunfei Lu, Duyu Tang, Dandan Tu, Bing Qin

    Abstract: Large Vision-Language Models (LVLMs) have demonstrated impressive multimodal abilities but remain prone to multilingual object hallucination, with a higher likelihood of generating responses inconsistent with the visual input when utilizing queries in non-English languages compared to English. Most existing approaches to address these rely on pretraining or fine-tuning, which are resource-intensiv… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: ACL2025 Main

  25. arXiv:2506.10981  [pdf, ps, other

    cs.CV

    SceneCompleter: Dense 3D Scene Completion for Generative Novel View Synthesis

    Authors: Weiliang Chen, Jiayi Bi, Yuanhui Huang, Wenzhao Zheng, Yueqi Duan

    Abstract: Generative models have gained significant attention in novel view synthesis (NVS) by alleviating the reliance on dense multi-view captures. However, existing methods typically fall into a conventional paradigm, where generative models first complete missing areas in 2D, followed by 3D recovery techniques to reconstruct the scene, which often results in overly smooth surfaces and distorted geometry… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

  26. arXiv:2506.10962  [pdf, ps, other

    cs.CV cs.AI cs.LG

    SpectralAR: Spectral Autoregressive Visual Generation

    Authors: Yuanhui Huang, Weiliang Chen, Wenzhao Zheng, Yueqi Duan, Jie Zhou, Jiwen Lu

    Abstract: Autoregressive visual generation has garnered increasing attention due to its scalability and compatibility with other modalities compared with diffusion models. Most existing methods construct visual sequences as spatial patches for autoregressive generation. However, image patches are inherently parallel, contradicting the causal nature of autoregressive modeling. To address this, we propose a S… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

    Comments: Project Page: https://huang-yh.github.io/spectralar/

  27. arXiv:2506.10887  [pdf, ps, other

    cs.CL cs.LG

    Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers

    Authors: Yixiao Huang, Hanlin Zhu, Tianyu Guo, Jiantao Jiao, Somayeh Sojoudi, Michael I. Jordan, Stuart Russell, Song Mei

    Abstract: Large language models (LLMs) can acquire new knowledge through fine-tuning, but this process exhibits a puzzling duality: models can generalize remarkably from new facts, yet are also prone to hallucinating incorrect information. However, the reasons for this phenomenon remain poorly understood. In this work, we argue that both behaviors stem from a single mechanism known as out-of-context reasoni… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

  28. arXiv:2506.10465  [pdf, ps, other

    cs.CV

    MedSeg-R: Reasoning Segmentation in Medical Images with Multimodal Large Language Models

    Authors: Yu Huang, Zelin Peng, Yichen Zhao, Piao Yang, Xiaokang Yang, Wei Shen

    Abstract: Medical image segmentation is crucial for clinical diagnosis, yet existing models are limited by their reliance on explicit human instructions and lack the active reasoning capabilities to understand complex clinical questions. While recent advancements in multimodal large language models (MLLMs) have improved medical question-answering (QA) tasks, most methods struggle to generate precise segment… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

    Comments: †: Equal contribution

  29. arXiv:2506.10335  [pdf, ps, other

    cs.CV

    PointGS: Point Attention-Aware Sparse View Synthesis with Gaussian Splatting

    Authors: Lintao Xiang, Hongpei Zheng, Yating Huang, Qijun Yang, Hujun Yin

    Abstract: 3D Gaussian splatting (3DGS) is an innovative rendering technique that surpasses the neural radiance field (NeRF) in both rendering speed and visual quality by leveraging an explicit 3D scene representation. Existing 3DGS approaches require a large number of calibrated views to generate a consistent and complete scene representation. When input views are limited, 3DGS tends to overfit the training… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

  30. Rethinking Brain Tumor Segmentation from the Frequency Domain Perspective

    Authors: Minye Shao, Zeyu Wang, Haoran Duan, Yawen Huang, Bing Zhai, Shizheng Wang, Yang Long, Yefeng Zheng

    Abstract: Precise segmentation of brain tumors, particularly contrast-enhancing regions visible in post-contrast MRI (areas highlighted by contrast agent injection), is crucial for accurate clinical diagnosis and treatment planning but remains challenging. However, current methods exhibit notable performance degradation in segmenting these enhancing brain tumor areas, largely due to insufficient considerati… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: Accepted by IEEE Transactions on Medical Imaging

  31. arXiv:2506.10027  [pdf, ps, other

    cs.GR cs.CV cs.LG

    Learning-based density-equalizing map

    Authors: Yanwen Huang, Lok Ming Lui, Gary P. T. Choi

    Abstract: Density-equalizing map (DEM) serves as a powerful technique for creating shape deformations with the area changes reflecting an underlying density function. In recent decades, DEM has found widespread applications in fields such as data visualization, geometry processing, and medical imaging. Traditional approaches to DEM primarily rely on iterative numerical solvers for diffusion equations or opt… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  32. arXiv:2506.09663  [pdf, ps, other

    cs.CV

    Self-Supervised Multi-Part Articulated Objects Modeling via Deformable Gaussian Splatting and Progressive Primitive Segmentation

    Authors: Haowen Wang, Xiaoping Yuan, Zhao Jin, Zhen Zhao, Zhengping Che, Yousong Xue, Jin Tian, Yakun Huang, Jian Tang

    Abstract: Articulated objects are ubiquitous in everyday life, and accurate 3D representations of their geometry and motion are critical for numerous applications. However, in the absence of human annotation, existing approaches still struggle to build a unified representation for objects that contain multiple movable parts. We introduce DeGSS, a unified framework that encodes articulated objects as deforma… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

  33. arXiv:2506.08967  [pdf, ps, other

    cs.SD cs.CL eess.AS

    Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model

    Authors: Ailin Huang, Bingxin Li, Bruce Wang, Boyong Wu, Chao Yan, Chengli Feng, Heng Wang, Hongyu Zhou, Hongyuan Wang, Jingbei Li, Jianjian Sun, Joanna Wang, Mingrui Chen, Peng Liu, Ruihang Miao, Shilei Jiang, Tian Fei, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Ge, Zheng Gong, Zhewei Huang , et al. (51 additional authors not shown)

    Abstract: Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a du… ▽ More

    Submitted 13 June, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

    Comments: 12 pages, 3 figures

  34. arXiv:2506.08844  [pdf, ps, other

    cs.LG cs.CE

    IMAGIC-500: IMputation benchmark on A Generative Imaginary Country (500k samples)

    Authors: Siyi Sun, David Antony Selby, Yunchuan Huang, Sebastian Vollmer, Seth Flaxman, Anisoara Calinescu

    Abstract: Missing data imputation in tabular datasets remains a pivotal challenge in data science and machine learning, particularly within socioeconomic research. However, real-world socioeconomic datasets are typically subject to strict data protection protocols, which often prohibit public sharing, even for synthetic derivatives. This severely limits the reproducibility and accessibility of benchmark stu… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

  35. arXiv:2506.08473  [pdf, ps, other

    cs.LG

    AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

    Authors: Shuo Yang, Qihui Zhang, Yuyang Liu, Yue Huang, Xiaojun Jia, Kunpeng Ning, Jiayu Yao, Jigang Wang, Hailiang Dai, Yibing Song, Li Yuan

    Abstract: Large language models (LLMs) are vulnerable to safety risks during fine-tuning, where small amounts of malicious or harmless data can compromise safeguards. In this paper, building on the concept of alignment direction -- defined by the weight difference between aligned and unaligned models -- we observe that perturbations along this direction preserve model safety. In contrast, perturbations alon… ▽ More

    Submitted 10 June, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

  36. arXiv:2506.08365  [pdf, ps, other

    cs.LG q-bio.BM

    AlphaFold Database Debiasing for Robust Inverse Folding

    Authors: Cheng Tan, Zhenxiao Cao, Zhangyang Gao, Siyuan Li, Yufei Huang, Stan Z. Li

    Abstract: The AlphaFold Protein Structure Database (AFDB) offers unparalleled structural coverage at near-experimental accuracy, positioning it as a valuable resource for data-driven protein design. However, its direct use in training deep models that are sensitive to fine-grained atomic geometry, such as inverse folding, exposes a critical limitation. Comparative analysis of structural feature distribution… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

    Comments: Under review

  37. arXiv:2506.08347  [pdf, ps, other

    cs.LG cs.CR

    Differentially Private Relational Learning with Entity-level Privacy Guarantees

    Authors: Yinan Huang, Haoteng Yin, Eli Chien, Rongzhe Wei, Pan Li

    Abstract: Learning with relational and network-structured data is increasingly vital in sensitive domains where protecting the privacy of individual entities is paramount. Differential Privacy (DP) offers a principled approach for quantifying privacy risks, with DP-SGD emerging as a standard mechanism for private model training. However, directly applying DP-SGD to relational learning is challenging due to… ▽ More

    Submitted 12 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

  38. arXiv:2506.07961  [pdf, ps, other

    cs.RO cs.AI

    BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models

    Authors: Peiyan Li, Yixiang Chen, Hongtao Wu, Xiao Ma, Xiangnan Wu, Yan Huang, Liang Wang, Tao Kong, Tieniu Tan

    Abstract: Recently, leveraging pre-trained vision-language models (VLMs) for building vision-language-action (VLA) models has emerged as a promising approach to effective robot manipulation learning. However, only few methods incorporate 3D signals into VLMs for action prediction, and they do not fully leverage the spatial structure inherent in 3D data, leading to low sample efficiency. In this paper, we in… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

    Comments: In Submission

  39. arXiv:2506.07900  [pdf, ps, other

    cs.CL cs.AI

    MiniCPM4: Ultra-Efficient LLMs on End Devices

    Authors: MiniCPM Team, Chaojun Xiao, Yuxuan Li, Xu Han, Yuzhuo Bai, Jie Cai, Haotian Chen, Wentong Chen, Xin Cong, Ganqu Cui, Ning Ding, Shengdan Fan, Yewei Fang, Zixuan Fu, Wenyu Guan, Yitong Guan, Junshao Guo, Yufeng Han, Bingxiang He, Yuxiang Huang, Cunliang Kong, Qiuzuo Li, Siyuan Li, Wenhao Li, Yanghao Li , et al. (50 additional authors not shown)

    Abstract: This paper introduces MiniCPM4, a highly efficient large language model (LLM) designed explicitly for end-side devices. We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems. Specifically, in terms of model architecture, we propose InfLLM v2, a trainable sparse attention mechanism that accelera… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

    Comments: MiniCPM4 Technical Report

  40. arXiv:2506.07799  [pdf, ps, other

    cs.IT

    Learned Off-Grid Imager for Low-Altitude Economy with Cooperative ISAC Network

    Authors: Yixuan Huang, Jie Yang, Shuqiang Xia, Chao-Kai Wen, Shi Jin

    Abstract: The low-altitude economy is emerging as a key driver of future economic growth, necessitating effective flight activity surveillance using existing mobile cellular network sensing capabilities. However, traditional monostatic and localizationbased sensing methods face challenges in fusing sensing results and matching channel parameters. To address these challenges, we model low-altitude surveillan… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

    Comments: submitted to IEEE for possible publication

  41. arXiv:2506.07664  [pdf, ps, other

    cs.CL cs.AI

    Synthesis by Design: Controlled Data Generation via Structural Guidance

    Authors: Lei Xu, Sirui Chen, Yuxuan Huang, Chaochao Lu

    Abstract: Mathematical reasoning remains challenging for LLMs due to complex logic and the need for precise computation. Existing methods enhance LLM reasoning by synthesizing datasets through problem rephrasing, but face issues with generation quality and problem complexity. To address this, we propose to extract structural information with generated problem-solving code from mathematical reasoning and gui… ▽ More

    Submitted 10 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

  42. arXiv:2506.07587  [pdf, ps, other

    cs.LG cs.AI

    PrunePEFT: Iterative Hybrid Pruning for Parameter-Efficient Fine-tuning of LLMs

    Authors: Tongzhou Yu, Zhuhao Zhang, Guanghui Zhu, Shen Jiang, Meikang Qiu, Yihua Huang

    Abstract: Parameter Efficient Fine-Tuning (PEFT) methods have emerged as effective and promising approaches for fine-tuning pre-trained language models. Compared with Full parameter Fine-Tuning (FFT), PEFT achieved comparable task performance with a substantial reduction of trainable parameters, which largely saved the training and storage costs. However, using the PEFT method requires considering a vast de… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  43. arXiv:2506.07584  [pdf, ps, other

    cs.LG

    MIRA: Medical Time Series Foundation Model for Real-World Health Data

    Authors: Hao Li, Bowen Deng, Chang Xu, Zhiyuan Feng, Viktor Schlegel, Yu-Hao Huang, Yizheng Sun, Jingyuan Sun, Kailai Yang, Yiyao Yu, Jiang Bian

    Abstract: A unified foundation model for medical time series -- pretrained on open access and ethics board-approved medical corpora -- offers the potential to reduce annotation burdens, minimize model customization, and enable robust transfer across clinical institutions, modalities, and tasks, particularly in data-scarce or privacy-constrained environments. However, existing generalist time series foundati… ▽ More

    Submitted 11 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

  44. arXiv:2506.07408  [pdf, ps, other

    cs.LG cs.AI

    Fractional-order Jacobian Matrix Differentiation and Its Application in Artificial Neural Networks

    Authors: Xiaojun zhou, Chunna Zhao, Yaqun Huang, Chengli Zhou, Junjie Ye, Kemeng Xiang

    Abstract: Fractional-order differentiation has many characteristics different from integer-order differentiation. These characteristics can be applied to the optimization algorithms of artificial neural networks to obtain better results. However, due to insufficient theoretical research, at present, there is no fractional-order matrix differentiation method that is perfectly compatible with automatic differ… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  45. arXiv:2506.07403  [pdf, ps, other

    cs.CR

    Enhancing Watermarking Quality for LLMs via Contextual Generation States Awareness

    Authors: Peiru Yang, Xintian Li, Wanchun Ni, Jinhua Yin, Huili Wang, Guoshun Nan, Shangguang Wang, Yongfeng Huang, Tao Qi

    Abstract: Recent advancements in watermarking techniques have enabled the embedding of secret messages into AI-generated text (AIGT), serving as an important mechanism for AIGT detection. Existing methods typically interfere with the generation processes of large language models (LLMs) to embed signals within the generated text. However, these methods often rely on heuristic rules, which can result in subop… ▽ More

    Submitted 8 June, 2025; originally announced June 2025.

  46. arXiv:2506.07399  [pdf, ps, other

    cs.CV cs.AI

    MrM: Black-Box Membership Inference Attacks against Multimodal RAG Systems

    Authors: Peiru Yang, Jinhua Yin, Haoran Zheng, Xueying Bai, Huili Wang, Yufei Sun, Xintian Li, Shangguang Wang, Yongfeng Huang, Tao Qi

    Abstract: Multimodal retrieval-augmented generation (RAG) systems enhance large vision-language models by integrating cross-modal knowledge, enabling their increasing adoption across real-world multimodal tasks. These knowledge databases may contain sensitive information that requires privacy protection. However, multimodal RAG systems inherently grant external users indirect access to such data, making the… ▽ More

    Submitted 8 June, 2025; originally announced June 2025.

  47. arXiv:2506.07343  [pdf, ps, other

    physics.soc-ph cs.SI

    Powers of Magnetic Graph Matrix: Fourier Spectrum, Walk Compression, and Applications

    Authors: Yinan Huang, David F. Gleich, Pan Li

    Abstract: Magnetic graphs, originally developed to model quantum systems under magnetic fields, have recently emerged as a powerful framework for analyzing complex directed networks. Existing research has primarily used the spectral properties of the magnetic graph matrix to study global and stationary network features. However, their capacity to model local, non-equilibrium behaviors, often described by ma… ▽ More

    Submitted 8 June, 2025; originally announced June 2025.

  48. arXiv:2506.07309  [pdf, other

    cs.CL

    ConfQA: Answer Only If You Are Confident

    Authors: Yin Huang, Yifan Ethan Xu, Kai Sun, Vera Yan, Alicia Sun, Haidar Khan, Jimmy Nguyen, Mohammad Kachuee, Zhaojiang Lin, Yue Liu, Aaron Colak, Anuj Kumar, Wen-tau Yih, Xin Luna Dong

    Abstract: Can we teach Large Language Models (LLMs) to refrain from hallucinating factual statements? In this paper we present a fine-tuning strategy that we call ConfQA, which can reduce hallucination rate from 20-40% to under 5% across multiple factuality benchmarks. The core idea is simple: when the LLM answers a question correctly, it is trained to continue with the answer; otherwise, it is trained to a… ▽ More

    Submitted 8 June, 2025; originally announced June 2025.

    Comments: 10 pages main content, 10 pages appendix, 5 figures, 7 tables

  49. arXiv:2506.06821  [pdf, ps, other

    cs.CL cs.AI cs.SE

    Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems

    Authors: Yuhan Cao, Zian Chen, Kun Quan, Ziliang Zhang, Yu Wang, Xiaoning Dong, Yeqi Feng, Guanzhong He, Jingcheng Huang, Jianhao Li, Yixuan Tan, Jiafu Tang, Yilin Tang, Junlei Wu, Qianyu Xiao, Can Zheng, Shouchen Zhou, Yuxiang Zhu, Yiming Huang, Tian Xie, Tianxing He

    Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, capable of tackling complex tasks during inference. However, the extent to which LLMs can be utilized for code checking or debugging through test case generation remains largely unexplored. We investigate this problem from the perspective of competition-level programming (CP) programs and propose TCGBench, a… ▽ More

    Submitted 10 June, 2025; v1 submitted 7 June, 2025; originally announced June 2025.

    Comments: 37 pages, 22 figures

  50. arXiv:2506.06705  [pdf, ps, other

    cs.CL cs.AI

    DivScore: Zero-Shot Detection of LLM-Generated Text in Specialized Domains

    Authors: Zhihui Chen, Kai He, Yucheng Huang, Yunxiao Zhu, Mengling Feng

    Abstract: Detecting LLM-generated text in specialized and high-stakes domains like medicine and law is crucial for combating misinformation and ensuring authenticity. However, current zero-shot detectors, while effective on general text, often fail when applied to specialized content due to domain shift. We provide a theoretical analysis showing this failure is fundamentally linked to the KL divergence betw… ▽ More

    Submitted 7 June, 2025; originally announced June 2025.

    Comments: Zhihui Chen and Kai He contributed equally to this work, Mengling Feng is the corresponding author