Skip to main content

Showing 1–50 of 729 results for author: Pan, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.10557  [pdf, ps, other

    cs.CV cs.AI cs.CL

    MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning

    Authors: Ke Wang, Junting Pan, Linda Wei, Aojun Zhou, Weikang Shi, Zimu Lu, Han Xiao, Yunqiao Yang, Houxing Ren, Mingjie Zhan, Hongsheng Li

    Abstract: Natural language image-caption datasets, widely used for training Large Multimodal Models, mainly focus on natural scenarios and overlook the intricate details of mathematical figures that are critical for problem-solving, hindering the advancement of current LMMs in multimodal mathematical reasoning. To this end, we propose leveraging code as supervision for cross-modal alignment, since code inhe… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

    Comments: Accepted to ACL 2025 Findings

  2. arXiv:2505.10415  [pdf, ps, other

    cs.RO cs.HC

    Internal State Estimation in Groups via Active Information Gathering

    Authors: Xuebo Ji, Zherong Pan, Xifeng Gao, Lei Yang, Xinxin Du, Kaiyun Li, Yongjin Liu, Wenping Wang, Changhe Tu, Jia Pan

    Abstract: Accurately estimating human internal states, such as personality traits or behavioral patterns, is critical for enhancing the effectiveness of human-robot interaction, particularly in group settings. These insights are key in applications ranging from social navigation to autism diagnosis. However, prior methods are limited by scalability and passive observation, making real-time estimation in com… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  3. arXiv:2505.07747  [pdf, other

    cs.CV

    Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets

    Authors: Weiyu Li, Xuanyang Zhang, Zheng Sun, Di Qi, Hao Li, Wei Cheng, Weiwei Cai, Shihao Wu, Jiarui Liu, Zihao Wang, Xiao Chen, Feipeng Tian, Jianxiong Pan, Zeming Li, Gang Yu, Xiangyu Zhang, Daxin Jiang, Ping Tan

    Abstract: While generative artificial intelligence has advanced significantly across text, image, audio, and video domains, 3D generation remains comparatively underdeveloped due to fundamental challenges such as data scarcity, algorithmic limitations, and ecosystem fragmentation. To this end, we present Step1X-3D, an open framework addressing these challenges through: (1) a rigorous data curation pipeline… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

    Comments: Technical report

  4. arXiv:2505.07538  [pdf, ps, other

    cs.CV

    Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning

    Authors: Bohan Wang, Zhongqi Yue, Fengda Zhang, Shuo Chen, Li'an Bi, Junzhe Zhang, Xue Song, Kennard Yanting Chan, Jiachun Pan, Weijia Wu, Mingze Zhou, Wang Lin, Kaihang Pan, Saining Zhang, Liyu Jia, Wentao Hu, Wei Zhao, Hanwang Zhang

    Abstract: We completely discard the conventional spatial prior in image representation and introduce a novel discrete visual tokenizer: Self-consistency Tokenizer (Selftok). At its design core, we compose an autoregressive (AR) prior -- mirroring the causal structure of language -- into visual tokens by using the reverse diffusion process of image generation. The AR property makes Selftok fundamentally dist… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  5. arXiv:2505.04073  [pdf

    cs.CL

    Natural Language Generation in Healthcare: A Review of Methods and Applications

    Authors: Mengxian Lyu, Xiaohan Li, Ziyi Chen, Jinqian Pan, Cheng Peng, Sankalp Talankar, Yonghui Wu

    Abstract: Natural language generation (NLG) is the key technology to achieve generative artificial intelligence (AI). With the breakthroughs in large language models (LLMs), NLG has been widely used in various medical applications, demonstrating the potential to enhance clinical workflows, support clinical decision-making, and improve clinical documentation. Heterogeneous and diverse medical data modalities… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

  6. Artificial Protozoa Optimizer (APO): A novel bio-inspired metaheuristic algorithm for engineering optimization

    Authors: Xiaopeng Wang, Vaclav Snasel, Seyedali Mirjalili, Jeng-Shyang Pan, Lingping Kong, Hisham A. Shehadeh

    Abstract: This study proposes a novel artificial protozoa optimizer (APO) that is inspired by protozoa in nature. The APO mimics the survival mechanisms of protozoa by simulating their foraging, dormancy, and reproductive behaviors. The APO was mathematically modeled and implemented to perform the optimization processes of metaheuristic algorithms. The performance of the APO was verified via experimental si… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

  7. arXiv:2505.02865  [pdf, other

    cs.CL cs.AI

    Accelerating Large Language Model Reasoning via Speculative Search

    Authors: Zhihai Wang, Jie Wang, Jilai Pan, Xilin Xia, Huiling Zhen, Mingxuan Yuan, Jianye Hao, Feng Wu

    Abstract: Tree-search-based reasoning methods have significantly enhanced the reasoning capability of large language models (LLMs) by facilitating the exploration of multiple intermediate reasoning steps, i.e., thoughts. However, these methods suffer from substantial inference latency, as they have to generate numerous reasoning thoughts, severely limiting LLM applicability. To address this challenge, we pr… ▽ More

    Submitted 3 May, 2025; originally announced May 2025.

    Comments: Accepted by ICML2025

  8. arXiv:2505.01746  [pdf, other

    cs.CV

    Co$^{3}$Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion

    Authors: Xingqun Qi, Yatian Wang, Hengyuan Zhang, Jiahao Pan, Wei Xue, Shanghang Zhang, Wenhan Luo, Qifeng Liu, Yike Guo

    Abstract: Generating gestures from human speech has gained tremendous progress in animating virtual avatars. While the existing methods enable synthesizing gestures cooperated by individual self-talking, they overlook the practicality of concurrent gesture modeling with two-person interactive conversations. Moreover, the lack of high-quality datasets with concurrent co-speech gestures also limits handling t… ▽ More

    Submitted 3 May, 2025; originally announced May 2025.

    Comments: Accepted as ICLR 2025 (Spotlight)

  9. arXiv:2505.01263  [pdf, other

    cs.MM cs.CV cs.SD eess.AS

    FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing

    Authors: Gaoxiang Cong, Liang Li, Jiadong Pan, Zhedong Zhang, Amin Beheshti, Anton van den Hengel, Yuankai Qi, Qingming Huang

    Abstract: Movie Dubbing aims to convert scripts into speeches that align with the given movie clip in both temporal and emotional aspects while preserving the vocal timbre of a given brief reference audio. Existing methods focus primarily on reducing the word error rate while ignoring the importance of lip-sync and acoustic quality. To address these issues, we propose a large language model (LLM) based flow… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

  10. arXiv:2505.00843  [pdf, other

    cs.CR cs.AI

    OET: Optimization-based prompt injection Evaluation Toolkit

    Authors: Jinsheng Pan, Xiaogeng Liu, Chaowei Xiao

    Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation, enabling their widespread adoption across various domains. However, their susceptibility to prompt injection attacks poses significant security risks, as adversarial inputs can manipulate model behavior and override intended instructions. Despite numerous defense strategies, a s… ▽ More

    Submitted 1 May, 2025; originally announced May 2025.

  11. arXiv:2505.00675  [pdf, other

    cs.CL

    Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions

    Authors: Yiming Du, Wenyu Huang, Danna Zheng, Zhaowei Wang, Sebastien Montella, Mirella Lapata, Kam-Fai Wong, Jeff Z. Pan

    Abstract: Memory is a fundamental component of AI systems, underpinning large language models (LLMs) based agents. While prior surveys have focused on memory applications with LLMs, they often overlook the atomic operations that underlie memory dynamics. In this survey, we first categorize memory representations into parametric, contextual structured, and contextual unstructured and then introduce six funda… ▽ More

    Submitted 1 May, 2025; originally announced May 2025.

  12. arXiv:2504.21067  [pdf, other

    cs.GR cs.CV cs.RO

    GauSS-MI: Gaussian Splatting Shannon Mutual Information for Active 3D Reconstruction

    Authors: Yuhan Xie, Yixi Cai, Yinqiang Zhang, Lei Yang, Jia Pan

    Abstract: This research tackles the challenge of real-time active view selection and uncertainty quantification on visual quality for active 3D reconstruction. Visual quality is a critical aspect of 3D reconstruction. Recent advancements such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have notably enhanced the image rendering quality of reconstruction models. Nonetheless, the efficien… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

  13. arXiv:2504.15843  [pdf, other

    cs.CL

    Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model

    Authors: Junshu Pan, Wei Shen, Shulin Huang, Qiji Zhou, Yue Zhang

    Abstract: Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback (RLHF) for large language models (LLMs) by directly optimizing human preferences without an explicit reward model. We find that during DPO training, the reference model plays the role of a data weight adjuster. However, the common practice of initializing the policy and reference models identically in DPO ca… ▽ More

    Submitted 25 April, 2025; v1 submitted 22 April, 2025; originally announced April 2025.

  14. arXiv:2504.15466  [pdf, other

    cs.AI cs.CL

    Learning Adaptive Parallel Reasoning with Language Models

    Authors: Jiayi Pan, Xiuyu Li, Long Lian, Charlie Snell, Yifei Zhou, Adam Yala, Trevor Darrell, Kurt Keutzer, Alane Suhr

    Abstract: Scaling inference-time computation has substantially improved the reasoning capabilities of language models. However, existing methods have significant limitations: serialized chain-of-thought approaches generate overly long outputs, leading to increased latency and exhausted context windows, while parallel methods such as self-consistency suffer from insufficient coordination, resulting in redund… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

    Comments: Code, model, and data are available at https://github.com/Parallel-Reasoning/APR. The first three authors contributed equally to this work

  15. arXiv:2504.15254  [pdf, other

    cs.SE cs.CL

    CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation

    Authors: Anirudh Khatry, Robert Zhang, Jia Pan, Ziteng Wang, Qiaochu Chen, Greg Durrett, Isil Dillig

    Abstract: C-to-Rust transpilation is essential for modernizing legacy C code while enhancing safety and interoperability with modern Rust ecosystems. However, no dataset currently exists for evaluating whether a system can transpile C into safe Rust that passes a set of test cases. We introduce CRUST-Bench, a dataset of 100 C repositories, each paired with manually-written interfaces in safe Rust as well as… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

  16. arXiv:2504.13631  [pdf, other

    cs.AI

    Multi-modal Knowledge Graph Generation with Semantics-enriched Prompts

    Authors: Yajing Xu, Zhiqiang Liu, Jiaoyan Chen, Mingchen Tu, Zhuo Chen, Jeff Z. Pan, Yichi Zhang, Yushan Zhu, Wen Zhang, Huajun Chen

    Abstract: Multi-modal Knowledge Graphs (MMKGs) have been widely applied across various domains for knowledge representation. However, the existing MMKGs are significantly fewer than required, and their construction faces numerous challenges, particularly in ensuring the selection of high-quality, contextually relevant images for knowledge graph enrichment. To address these challenges, we present a framework… ▽ More

    Submitted 18 April, 2025; originally announced April 2025.

    Comments: Accepted by IJCNN 2025

  17. arXiv:2504.13037  [pdf, ps, other

    eess.IV cs.AI cs.CV

    Towards Cardiac MRI Foundation Models: Comprehensive Visual-Tabular Representations for Whole-Heart Assessment and Beyond

    Authors: Yundi Zhang, Paul Hager, Che Liu, Suprosanna Shit, Chen Chen, Daniel Rueckert, Jiazhen Pan

    Abstract: Cardiac magnetic resonance imaging is the gold standard for non-invasive cardiac assessment, offering rich spatio-temporal views of the cardiac anatomy and physiology. Patient-level health factors, such as demographics, metabolic, and lifestyle, are known to substantially influence cardiovascular health and disease risk, yet remain uncaptured by CMR alone. To holistically understand cardiac health… ▽ More

    Submitted 18 April, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

  18. arXiv:2504.12711  [pdf, other

    cs.CV cs.AI eess.IV

    NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results

    Authors: Xin Li, Yeying Jin, Xin Jin, Zongwei Wu, Bingchen Li, Yufei Wang, Wenhan Yang, Yu Li, Zhibo Chen, Bihan Wen, Robby T. Tan, Radu Timofte, Qiyu Rong, Hongyuan Jing, Mengmeng Zhang, Jinglong Li, Xiangyu Lu, Yi Ren, Yuting Liu, Meng Zhang, Xiang Chen, Qiyuan Guan, Jiangxin Dong, Jinshan Pan, Conglin Gou , et al. (112 additional authors not shown)

    Abstract: This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includ… ▽ More

    Submitted 19 April, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

    Comments: Challenge Report of CVPR NTIRE 2025; 26 pages; Methods from 32 teams

  19. arXiv:2504.12613  [pdf, other

    cs.CE

    Fast and Accurate Prediction of Antenna Reflection Coefficients in Planar Layered Media Environment via Generalized Scattering Matrix

    Authors: Chenbo Shi, Shichen Liang, Xin Gu, Jin Pan

    Abstract: The numerical algorithm for evaluating the reflection coefficient of an antenna in the presence of the planar layered medium is reformulated using the antenna's generalized scattering matrix (GSM). The interaction between the antenna and the layered medium is modeled through spherical-to-planar vector wave transformations, ensuring no approximations that could compromise computational accuracy. Th… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

  20. arXiv:2504.11420  [pdf, other

    cs.CL

    Reinforcing Compositional Retrieval: Retrieving Step-by-Step for Composing Informative Contexts

    Authors: Quanyu Long, Jianda Chen, Zhengyuan Liu, Nancy F. Chen, Wenya Wang, Sinno Jialin Pan

    Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across numerous tasks, yet they often rely on external context to handle complex tasks. While retrieval-augmented frameworks traditionally focus on selecting top-ranked documents in a single pass, many real-world scenarios demand compositional retrieval, where multiple sources must be combined in a coordinated manner. In this w… ▽ More

    Submitted 15 April, 2025; originally announced April 2025.

    Comments: 19 pages, 8 figures

  21. arXiv:2504.10686  [pdf, other

    cs.CV eess.IV

    The Tenth NTIRE 2025 Efficient Super-Resolution Challenge Report

    Authors: Bin Ren, Hang Guo, Lei Sun, Zongwei Wu, Radu Timofte, Yawei Li, Yao Zhang, Xinning Chai, Zhengxue Cheng, Yingsheng Qin, Yucai Yang, Li Song, Hongyuan Yu, Pufan Xu, Cheng Wan, Zhijuan Huang, Peng Guo, Shuyuan Cui, Chenjun Li, Xuehai Hu, Pan Pan, Xin Zhang, Heng Zhang, Qing Luo, Linyan Jiang , et al. (122 additional authors not shown)

    Abstract: This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

    Comments: Accepted by CVPR2025 NTIRE Workshop, Efficient Super-Resolution Challenge Report. 50 pages

  22. arXiv:2504.10685  [pdf, other

    cs.CV cs.AI

    NTIRE 2025 Challenge on Cross-Domain Few-Shot Object Detection: Methods and Results

    Authors: Yuqian Fu, Xingyu Qiu, Bin Ren, Yanwei Fu, Radu Timofte, Nicu Sebe, Ming-Hsuan Yang, Luc Van Gool, Kaijin Zhang, Qingpeng Nong, Xiugang Dong, Hong Gao, Xiangsheng Zhou, Jiancheng Pan, Yanxing Liu, Xiao He, Jiahao Li, Yuze Sun, Xiaomeng Huang, Zhenyu Zhang, Ran Ma, Yuhan Liu, Zijian Zhuang, Shuai Yi, Yixiong Zou , et al. (37 additional authors not shown)

    Abstract: Cross-Domain Few-Shot Object Detection (CD-FSOD) poses significant challenges to existing object detection and few-shot detection models when applied across domains. In conjunction with NTIRE 2025, we organized the 1st CD-FSOD Challenge, aiming to advance the performance of current object detectors on entirely novel target domains with only limited labeled data. The challenge attracted 152 registe… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

    Comments: accepted by CVPRW 25 @ NTIRE

  23. arXiv:2504.10180  [pdf, other

    cs.HC

    ChartOptimiser: Task-driven Optimisation of Chart Designs

    Authors: Yao Wang, Jiarong Pan, Danqing Shi, Zhiming Hu, Antti Oulasvirta, Andreas Bulling

    Abstract: Effective chart design is essential for satisfying viewers' information needs, such as retrieving values from a chart or comparing two values. However, creating effective charts is challenging and time-consuming due to the large design space and the inter-dependencies between individual design parameters. To address this challenge, we propose ChartOptimiser -- a Bayesian approach for task-driven o… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

  24. arXiv:2504.09389  [pdf, other

    cs.CL

    Beyond Memorization: Mapping the Originality-Quality Frontier of Language Models

    Authors: Vishakh Padmakumar, Chen Yueh-Han, Jane Pan, Valerie Chen, He He

    Abstract: As large language models (LLMs) are increasingly used for ideation and scientific discovery, it is important to evaluate their ability to generate novel output. Prior work evaluates novelty as the originality with respect to training data, but original outputs can be low quality. In contrast, non-expert judges may favor high-quality but memorized outputs, limiting the reliability of human preferen… ▽ More

    Submitted 12 April, 2025; originally announced April 2025.

  25. arXiv:2504.08850  [pdf, other

    cs.DC cs.AI

    SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting

    Authors: Jiaming Xu, Jiayi Pan, Yongkang Zhou, Siming Chen, Jinhao Li, Yaoxiu Lian, Junyi Wu, Guohao Dai

    Abstract: Early exiting has recently emerged as a promising technique for accelerating large language models (LLMs) by effectively reducing the hardware computation and memory access. In this paper, we present SpecEE, a fast LLM inference engine with speculative early exiting. (1) At the algorithm level, we propose the speculation-based lightweight predictor design by exploiting the probabilistic correlatio… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

    Comments: Accepted by ISCA 2025

  26. arXiv:2504.05599  [pdf, other

    cs.CV cs.CL

    Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought

    Authors: Yi Peng, Chris, Xiaokun Wang, Yichen Wei, Jiangbo Pei, Weijie Qiu, Ai Jian, Yunzhuo Hao, Jiachun Pan, Tianyidan Xie, Li Ge, Rongxian Zhuang, Xuchen Song, Yang Liu, Yahui Zhou

    Abstract: We introduce Skywork R1V, a multimodal reasoning model extending the an R1-series Large language models (LLM) to visual modalities via an efficient multimodal transfer method. Leveraging a lightweight visual projector, Skywork R1V facilitates seamless multimodal adaptation without necessitating retraining of either the foundational language model or the vision encoder. To strengthen visual-text al… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

  27. arXiv:2504.05419  [pdf, other

    cs.AI cs.CL

    Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification

    Authors: Anqi Zhang, Yulin Chen, Jane Pan, Chen Zhao, Aurojit Panda, Jinyang Li, He He

    Abstract: Reasoning models have achieved remarkable performance on tasks like math and logical reasoning thanks to their ability to search during reasoning. However, they still suffer from overthinking, often performing unnecessary reasoning steps even after reaching the correct answer. This raises the question: can models evaluate the correctness of their intermediate answers during reasoning? In this work… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

  28. arXiv:2504.05050  [pdf, other

    cs.CL cs.AI

    Revealing the Intrinsic Ethical Vulnerability of Aligned Large Language Models

    Authors: Jiawei Lian, Jianhong Pan, Lefan Wang, Yi Wang, Shaohui Mei, Lap-Pui Chau

    Abstract: Large language models (LLMs) are foundational explorations to artificial general intelligence, yet their alignment with human values via instruction tuning and preference learning achieves only superficial compliance. Here, we demonstrate that harmful knowledge embedded during pretraining persists as indelible "dark patterns" in LLMs' parametric memory, evading alignment safeguards and resurfacing… ▽ More

    Submitted 17 April, 2025; v1 submitted 7 April, 2025; originally announced April 2025.

  29. arXiv:2504.04517  [pdf, other

    cs.CV cs.AI

    Enhance Then Search: An Augmentation-Search Strategy with Foundation Models for Cross-Domain Few-Shot Object Detection

    Authors: Jiancheng Pan, Yanxing Liu, Xiao He, Long Peng, Jiahao Li, Yuze Sun, Xiaomeng Huang

    Abstract: Foundation models pretrained on extensive datasets, such as GroundingDINO and LAE-DINO, have performed remarkably in the cross-domain few-shot object detection (CD-FSOD) task. Through rigorous few-shot training, we found that the integration of image-based data augmentation techniques and grid-based sub-domain search strategy significantly enhances the performance of these foundation models. Build… ▽ More

    Submitted 6 April, 2025; originally announced April 2025.

    Comments: 9 pages, 6 figures

  30. arXiv:2504.01407  [pdf, other

    cs.CV cs.AI

    TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding

    Authors: Junwen Pan, Rui Zhang, Xin Wan, Yuan Zhang, Ming Lu, Qi She

    Abstract: Large video-language models (LVLMs) have shown remarkable performance across various video-language tasks. However, they encounter significant challenges when processing long videos because of the large number of video frames involved. Downsampling long videos in either space or time can lead to visual hallucinations, making it difficult to accurately interpret long videos. Motivated by human hier… ▽ More

    Submitted 2 April, 2025; originally announced April 2025.

  31. arXiv:2504.01260  [pdf, other

    cs.RO cs.HC eess.SY

    The Social Life of Industrial Arms: How Arousal and Attention Shape Human-Robot Interaction

    Authors: Roy El-Helou, Matthew K. X. J Pan

    Abstract: This study explores how human perceptions of a non-anthropomorphic robotic manipulator are shaped by two key dimensions of behaviour: arousal, defined as the robot's movement energy and expressiveness, and attention, defined as the robot's capacity to selectively orient toward and engage with a user. We introduce a novel control architecture that integrates a gaze-like attention engine with an aro… ▽ More

    Submitted 1 April, 2025; originally announced April 2025.

    Comments: 7 pages, 3 figures, 1 table

  32. Integrating Large Language Models with Human Expertise for Disease Detection in Electronic Health Records

    Authors: Jie Pan, Seungwon Lee, Cheligeer Cheligeer, Elliot A. Martin, Kiarash Riazi, Hude Quan, Na Li

    Abstract: Objective: Electronic health records (EHR) are widely available to complement administrative data-based disease surveillance and healthcare performance evaluation. Defining conditions from EHR is labour-intensive and requires extensive manual labelling of disease outcomes. This study developed an efficient strategy based on advanced large language models to identify multiple conditions from EHR cl… ▽ More

    Submitted 31 March, 2025; originally announced April 2025.

  33. arXiv:2503.24180  [pdf, other

    cs.CV cs.HC

    Navi-plus: Managing Ambiguous GUI Navigation Tasks with Follow-up

    Authors: Ziming Cheng, Zhiyuan Huang, Junting Pan, Zhaohui Hou, Mingjie Zhan

    Abstract: Graphical user interfaces (GUI) automation agents are emerging as powerful tools, enabling humans to accomplish increasingly complex tasks on smart devices. However, users often inadvertently omit key information when conveying tasks, which hinders agent performance in the current agent paradigm that does not support immediate user intervention. To address this issue, we introduce a… ▽ More

    Submitted 31 March, 2025; originally announced March 2025.

  34. arXiv:2503.20174  [pdf, other

    cs.CV

    Devil is in the Uniformity: Exploring Diverse Learners within Transformer for Image Restoration

    Authors: Shihao Zhou, Dayu Li, Jinshan Pan, Juncheng Zhou, Jinglei Shi, Jufeng Yang

    Abstract: Transformer-based approaches have gained significant attention in image restoration, where the core component, i.e, Multi-Head Attention (MHA), plays a crucial role in capturing diverse features and recovering high-quality results. In MHA, heads perform attention calculation independently from uniform split subspaces, and a redundancy issue is triggered to hinder the model from achieving satisfact… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

    Comments: 11 pages, 10 figures

  35. arXiv:2503.19470  [pdf, other

    cs.AI cs.CL

    ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

    Authors: Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z. Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, Weipeng Chen

    Abstract: Large Language Models (LLMs) have shown remarkable capabilities in reasoning, exemplified by the success of OpenAI-o1 and DeepSeek-R1. However, integrating reasoning with external search processes remains challenging, especially for complex multi-hop questions requiring multiple retrieval steps. We propose ReSearch, a novel framework that trains LLMs to Reason with Search via reinforcement learnin… ▽ More

    Submitted 27 March, 2025; v1 submitted 25 March, 2025; originally announced March 2025.

    Comments: Work in progress

  36. arXiv:2503.17616  [pdf, other

    cs.CE

    Generalized Scattering Matrix Synthesis for Hybrid Systems with Multiple Scatterers and Antennas Using Independent Structure Simulations

    Authors: Chenbo Shi, Shichen Liang, Jin Pan, Xin Gu, Le Zuo

    Abstract: This paper presents a unified formulation for calculating the generalized scattering matrix (GS-matrix) of hybrid systems involving multiple scatterers and antennas. The GS-matrix of the entire system is synthesized through the scattering matrices and GS-matrices of each independent component, using the addition theorem of vector spherical wavefunctions and fully matrix-based operations. Since our… ▽ More

    Submitted 21 March, 2025; originally announced March 2025.

  37. arXiv:2503.14037  [pdf, other

    cs.CV

    Intra and Inter Parser-Prompted Transformers for Effective Image Restoration

    Authors: Cong Wang, Jinshan Pan, Liyan Wang, Wei Wang

    Abstract: We propose Intra and Inter Parser-Prompted Transformers (PPTformer) that explore useful features from visual foundation models for image restoration. Specifically, PPTformer contains two parts: an Image Restoration Network (IRNet) for restoring images from degraded observations and a Parser-Prompted Feature Generation Network (PPFGNet) for providing IRNet with reliable parser information to boost… ▽ More

    Submitted 18 March, 2025; originally announced March 2025.

    Comments: This version is accepted by the Association for the Advancement of Artificial Intelligence (AAAI-25)

  38. arXiv:2503.11091  [pdf, other

    cs.CV

    Aerial Vision-and-Language Navigation with Grid-based View Selection and Map Construction

    Authors: Ganlong Zhao, Guanbin Li, Jia Pan, Yizhou Yu

    Abstract: Aerial Vision-and-Language Navigation (Aerial VLN) aims to obtain an unmanned aerial vehicle agent to navigate aerial 3D environments following human instruction. Compared to ground-based VLN, aerial VLN requires the agent to decide the next action in both horizontal and vertical directions based on the first-person view observations. Previous methods struggle to perform well due to the longer nav… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

    Comments: Under Submission

  39. arXiv:2503.09218  [pdf, other

    cs.CL

    N2C2: Nearest Neighbor Enhanced Confidence Calibration for Cross-Lingual In-Context Learning

    Authors: Jie He, Simon Yu, Deyi Xiong, Víctor Gutiérrez-Basulto, Jeff Z. Pan

    Abstract: Recent advancements of in-context learning (ICL) show language models can significantly improve their performance when demonstrations are provided. However, little attention has been paid to model calibration and prediction confidence of ICL in cross-lingual scenarios. To bridge this gap, we conduct a thorough analysis of ICL for cross-lingual sentiment classification. Our findings suggest that IC… ▽ More

    Submitted 12 March, 2025; originally announced March 2025.

  40. arXiv:2503.09160  [pdf, other

    cs.CV

    WonderVerse: Extendable 3D Scene Generation with Video Generative Models

    Authors: Hao Feng, Zhi Zuo, Jia-Hui Pan, Ka-Hei Hui, Yihua Shao, Qi Dou, Wei Xie, Zhengzhe Liu

    Abstract: We introduce \textit{WonderVerse}, a simple but effective framework for generating extendable 3D scenes. Unlike existing methods that rely on iterative depth estimation and image inpainting, often leading to geometric distortions and inconsistencies, WonderVerse leverages the powerful world-level priors embedded within video generative foundation models to create highly immersive and geometrically… ▽ More

    Submitted 14 March, 2025; v1 submitted 12 March, 2025; originally announced March 2025.

  41. arXiv:2503.09066  [pdf, other

    cs.LG cs.AI cs.CR

    Probing Latent Subspaces in LLM for AI Security: Identifying and Manipulating Adversarial States

    Authors: Xin Wei Chia, Jonathan Pan

    Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, yet they remain vulnerable to adversarial manipulations such as jailbreaking via prompt injection attacks. These attacks bypass safety mechanisms to generate restricted or harmful content. In this study, we investigated the underlying latent subspaces of safe and jailbroken states by extracting hidden acti… ▽ More

    Submitted 12 March, 2025; originally announced March 2025.

    Comments: 4 figures

  42. arXiv:2503.08638  [pdf, other

    eess.AS cs.AI cs.MM cs.SD

    YuE: Scaling Open Foundation Models for Long-Form Music Generation

    Authors: Ruibin Yuan, Hanfeng Lin, Shuyue Guo, Ge Zhang, Jiahao Pan, Yongyi Zang, Haohe Liu, Yiming Liang, Wenye Ma, Xingjian Du, Xinrun Du, Zhen Ye, Tianyu Zheng, Yinghao Ma, Minghao Liu, Zeyue Tian, Ziya Zhou, Liumeng Xue, Xingwei Qu, Yizhi Li, Shangda Wu, Tianhao Shen, Ziyang Ma, Jun Zhan, Chunhui Wang , et al. (32 additional authors not shown)

    Abstract: We tackle the task of long-form music generation--particularly the challenging \textbf{lyrics-to-song} problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

    Comments: https://github.com/multimodal-art-projection/YuE

  43. arXiv:2503.08189  [pdf, other

    cs.IR

    SoTCKGE:Continual Knowledge Graph Embedding Based on Spatial Offset Transformation

    Authors: Xinyan Wang, Jinshuo Liu, Cheng Bi, Kaijian Xie, Meng Wang, Juan Deng, Jeff Pan

    Abstract: Current Continual Knowledge Graph Embedding (CKGE) methods primarily rely on translation-based embedding methods, leveraging previously acquired knowledge to initialize new facts. To enhance learning efficiency, these methods often integrate fine-tuning or continual learning strategies. However, this compromises the model's prediction accuracy and the translation-based methods lack support for com… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

    Comments: 9 pages, 5 figures

    MSC Class: 68T30 ACM Class: E.2

  44. arXiv:2503.05728  [pdf, other

    cs.CY cs.AI

    Political Neutrality in AI is Impossible- But Here is How to Approximate it

    Authors: Jillian Fisher, Ruth E. Appel, Chan Young Park, Yujin Potter, Liwei Jiang, Taylor Sorensen, Shangbin Feng, Yulia Tsvetkov, Margaret E. Roberts, Jennifer Pan, Dawn Song, Yejin Choi

    Abstract: AI systems often exhibit political bias, influencing users' opinions and decision-making. While political neutrality-defined as the absence of bias-is often seen as an ideal solution for fairness and safety, this position paper argues that true political neutrality is neither feasible nor universally desirable due to its subjective nature and the biases inherent in AI training data, algorithms, an… ▽ More

    Submitted 18 February, 2025; originally announced March 2025.

    Comments: Code: https://github.com/jfisher52/Approximation_Political_Neutrality

  45. arXiv:2503.05281  [pdf, other

    cs.CL

    Similarity-Based Domain Adaptation with LLMs

    Authors: Jie He, Wendi Zhou, Xiang Lorraine Li, Jeff Z. Pan

    Abstract: Unsupervised domain adaptation leverages abundant labeled data from various source domains to generalize onto unlabeled target data. Prior research has primarily focused on learning domain-invariant features across the source and target domains. However, these methods often require training a model using source domain data, which is time-consuming and can limit model usage for applications with di… ▽ More

    Submitted 7 March, 2025; originally announced March 2025.

  46. arXiv:2503.03464  [pdf, other

    cs.RO

    Generative Artificial Intelligence in Robotic Manipulation: A Survey

    Authors: Kun Zhang, Peng Yun, Jun Cen, Junhao Cai, Didi Zhu, Hangjie Yuan, Chao Zhao, Tao Feng, Michael Yu Wang, Qifeng Chen, Jia Pan, Wei Zhang, Bo Yang, Hua Chen

    Abstract: This survey provides a comprehensive review on recent advancements of generative learning models in robotic manipulation, addressing key challenges in the field. Robotic manipulation faces critical bottlenecks, including significant challenges in insufficient data and inefficient data acquisition, long-horizon and complex task planning, and the multi-modality reasoning ability for robust policy le… ▽ More

    Submitted 10 March, 2025; v1 submitted 5 March, 2025; originally announced March 2025.

  47. arXiv:2503.03196  [pdf, other

    cs.CV cs.HC cs.RO

    SpiritSight Agent: Advanced GUI Agent with One Look

    Authors: Zhiyuan Huang, Ziming Cheng, Junting Pan, Zhaohui Hou, Mingjie Zhan

    Abstract: Graphical User Interface (GUI) agents show amazing abilities in assisting human-computer interaction, automating human user's navigation on digital devices. An ideal GUI agent is expected to achieve high accuracy, low latency, and compatibility for different GUI platforms. Recent vision-based approaches have shown promise by leveraging advanced Vision Language Models (VLMs). While they generally m… ▽ More

    Submitted 16 April, 2025; v1 submitted 5 March, 2025; originally announced March 2025.

    Comments: Paper accepted to CVPR 2025

  48. arXiv:2503.02112  [pdf, other

    cs.LG astro-ph.IM

    Building Machine Learning Challenges for Anomaly Detection in Science

    Authors: Elizabeth G. Campolongo, Yuan-Tang Chou, Ekaterina Govorkova, Wahid Bhimji, Wei-Lun Chao, Chris Harris, Shih-Chieh Hsu, Hilmar Lapp, Mark S. Neubauer, Josephine Namayanja, Aneesh Subramanian, Philip Harris, Advaith Anand, David E. Carlyn, Subhankar Ghosh, Christopher Lawrence, Eric Moreno, Ryan Raikman, Jiaman Wu, Ziheng Zhang, Bayu Adhi, Mohammad Ahmadi Gharehtoragh, Saúl Alonso Monsalve, Marta Babicz, Furqan Baig , et al. (125 additional authors not shown)

    Abstract: Scientific discoveries are often made by finding a pattern or object that was not predicted by the known rules of science. Oftentimes, these anomalous events or objects that do not conform to the norms are an indication that the rules of science governing the data are incomplete, and something new needs to be present to explain these unexpected outliers. The challenge of finding anomalies can be c… ▽ More

    Submitted 29 March, 2025; v1 submitted 3 March, 2025; originally announced March 2025.

    Comments: 17 pages 6 figures to be submitted to Nature Communications

  49. arXiv:2503.01743  [pdf, other

    cs.CL cs.AI cs.LG

    Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    Authors: Microsoft, :, Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, Dong Chen, Dongdong Chen, Junkun Chen, Weizhu Chen, Yen-Chun Chen, Yi-ling Chen, Qi Dai, Xiyang Dai, Ruchao Fan, Mei Gao, Min Gao, Amit Garg, Abhishek Goswami , et al. (51 additional authors not shown)

    Abstract: We introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable language and multimodal models. Phi-4-Mini is a 3.8-billion-parameter language model trained on high-quality web and synthetic data, significantly outperforming recent open-source models of similar size and matching the performance of models twice its size on math and coding tasks requiring complex reasoning. This achievement… ▽ More

    Submitted 7 March, 2025; v1 submitted 3 March, 2025; originally announced March 2025.

    Comments: 39 pages

  50. arXiv:2503.01710  [pdf, other

    cs.SD cs.AI eess.AS

    Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

    Authors: Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li, Zheng Liang, Qixi Zheng, Rui Wang, Xiaoqin Feng, Weizhen Bian, Zhen Ye, Sitong Cheng, Ruibin Yuan, Zhixian Zhao, Xinfa Zhu, Jiahao Pan, Liumeng Xue, Pengcheng Zhu, Yunlin Chen, Zhifei Li, Xie Chen, Lei Xie, Yike Guo, Wei Xue

    Abstract: Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting multiple codebooks, limiting efficiency and integration flexibility. To overcome these challenges, we introduce Spark-TTS, a novel system powered by BiCodec, a sin… ▽ More

    Submitted 3 March, 2025; originally announced March 2025.

    Comments: Submitted to ACL 2025