Skip to main content

Showing 1–50 of 777 results for author: Pan, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.05397  [pdf, ps, other

    cs.CV

    Neural-Driven Image Editing

    Authors: Pengfei Zhou, Jie Xia, Xiaopeng Peng, Wangbo Zhao, Zilong Ye, Zekai Li, Suorong Yang, Jiadong Pan, Yuanxiang Chen, Ziqiao Wang, Kai Wang, Qian Zheng, Xiaojun Chang, Gang Pan, Shurong Dong, Kaipeng Zhang, Yang You

    Abstract: Traditional image editing typically relies on manual prompting, making it labor-intensive and inaccessible to individuals with limited motor control or language abilities. Leveraging recent advances in brain-computer interfaces (BCIs) and generative models, we propose LoongX, a hands-free image editing approach driven by multimodal neurophysiological signals. LoongX utilizes state-of-the-art diffu… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

    Comments: 22 pages, 14 figures

  2. arXiv:2507.03226  [pdf

    cs.AI

    Efficient Knowledge Graph Construction and Retrieval from Unstructured Text for Large-Scale RAG Systems

    Authors: Congmin Min, Rhea Mathew, Joyce Pan, Sahil Bansal, Abbas Keshavarzi, Amar Viswanathan Kannan

    Abstract: We propose a scalable and cost-efficient framework for deploying Graph-based Retrieval Augmented Generation (GraphRAG) in enterprise environments. While GraphRAG has shown promise for multi-hop reasoning and structured retrieval, its adoption has been limited by the high computational cost of constructing knowledge graphs using large language models (LLMs) and the latency of graph-based retrieval.… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  3. arXiv:2507.02128  [pdf, ps, other

    cs.LG

    CROP: Circuit Retrieval and Optimization with Parameter Guidance using LLMs

    Authors: Jingyu Pan, Isaac Jacobson, Zheng Zhao, Tung-Chieh Chen, Guanglei Zhou, Chen-Chia Chang, Vineet Rashingkar, Yiran Chen

    Abstract: Modern very large-scale integration (VLSI) design requires the implementation of integrated circuits using electronic design automation (EDA) tools. Due to the complexity of EDA algorithms, the vast parameter space poses a huge challenge to chip design optimization, as the combination of even moderate numbers of parameters creates an enormous solution space to explore. Manual parameter selection r… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Accepted by ICCAD 2025

  4. arXiv:2507.01275  [pdf, ps, other

    cs.CV

    Frequency Domain-Based Diffusion Model for Unpaired Image Dehazing

    Authors: Chengxu Liu, Lu Qi, Jinshan Pan, Xueming Qian, Ming-Hsuan Yang

    Abstract: Unpaired image dehazing has attracted increasing attention due to its flexible data requirements during model training. Dominant methods based on contrastive learning not only introduce haze-unrelated content information, but also ignore haze-specific properties in the frequency domain (\ie,~haze-related degradation is mainly manifested in the amplitude spectrum). To address these issues, we propo… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: Accepted by ICCV 2025

  5. arXiv:2507.00917  [pdf, ps, other

    cs.RO

    A Survey: Learning Embodied Intelligence from Physical Simulators and World Models

    Authors: Xiaoxiao Long, Qingrui Zhao, Kaiwen Zhang, Zihao Zhang, Dingrui Wang, Yumeng Liu, Zhengjie Shu, Yi Lu, Shouzheng Wang, Xinzhe Wei, Wei Li, Wei Yin, Yao Yao, Jia Pan, Qiu Shen, Ruigang Yang, Xun Cao, Qionghai Dai

    Abstract: The pursuit of artificial general intelligence (AGI) has placed embodied intelligence at the forefront of robotics research. Embodied intelligence focuses on agents capable of perceiving, reasoning, and acting within the physical world. Achieving robust embodied intelligence requires not only advanced perception and control, but also the ability to ground abstract cognition in real-world interacti… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: https://github.com/NJU3DV-LoongGroup/Embodied-World-Models-Survey

  6. arXiv:2507.00429  [pdf, ps, other

    cs.CV

    DiGA3D: Coarse-to-Fine Diffusional Propagation of Geometry and Appearance for Versatile 3D Inpainting

    Authors: Jingyi Pan, Dan Xu, Qiong Luo

    Abstract: Developing a unified pipeline that enables users to remove, re-texture, or replace objects in a versatile manner is crucial for text-guided 3D inpainting. However, there are still challenges in performing multiple 3D inpainting tasks within a unified framework: 1) Single reference inpainting methods lack robustness when dealing with views that are far from the reference view. 2) Appearance inconsi… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: ICCV 2025, Project page: https://rorisis.github.io/DiGA3D/

  7. arXiv:2506.23584  [pdf, ps, other

    eess.IV cs.AI cs.CV

    A Clinically-Grounded Two-Stage Framework for Renal CT Report Generation

    Authors: Renjie Liang, Zhengkang Fan, Jinqian Pan, Chenkun Sun, Russell Terry, Jie Xu

    Abstract: Generating radiology reports from CT scans remains a complex task due to the nuanced nature of medical imaging and the variability in clinical documentation. In this study, we propose a two-stage framework for generating renal radiology reports from 2D CT slices. First, we extract structured abnormality features using a multi-task learning model trained to identify lesion attributes such as locati… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  8. arXiv:2506.22490  [pdf, ps, other

    eess.SP cs.LG

    MENGLAN: Multiscale Enhanced Nonparametric Gas Analyzer with Lightweight Architecture and Networks

    Authors: Zhenke Duan, Jiqun Pan, Jiani Tu

    Abstract: Accurate detection of ethylene concentrations in mixed gases is crucial in chemical production for safety and health purposes. Traditional methods are hindered by high cost and complexity, limiting their practical application. This study proposes MENGLAN, a Multiscale Enhanced Nonparametric Gas Analyzer that integrates a dual-stream structure, a Hybrid Multi-Head Attention mechanism, and a Feature… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  9. arXiv:2506.21901  [pdf, ps, other

    cs.DB

    A Survey of LLM Inference Systems

    Authors: James Pan, Guoliang Li

    Abstract: The past few years has witnessed specialized large language model (LLM) inference systems, such as vLLM, SGLang, Mooncake, and DeepFlow, alongside rapid LLM adoption via services like ChatGPT. Driving these system design efforts is the unique autoregressive nature of LLM request processing, motivating new techniques for achieving high performance while preserving high inference quality over high-v… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

    Comments: 25 pages

  10. arXiv:2506.19599  [pdf, ps, other

    cs.CL cs.AI

    ECCoT: A Framework for Enhancing Effective Cognition via Chain of Thought in Large Language Model

    Authors: Zhenke Duan, Jiqun Pan, Jiani Tu, Xiaoyi Wang, Yanqing Wang

    Abstract: In the era of large-scale artificial intelligence, Large Language Models (LLMs) have made significant strides in natural language processing. However, they often lack transparency and generate unreliable outputs, raising concerns about their interpretability. To address this, the Chain of Thought (CoT) prompting method structures reasoning into step-by-step deductions. Yet, not all reasoning chain… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  11. arXiv:2506.18825  [pdf, ps, other

    cs.RO

    SViP: Sequencing Bimanual Visuomotor Policies with Object-Centric Motion Primitives

    Authors: Yizhou Chen, Hang Xu, Dongjie Yu, Zeqing Zhang, Yi Ren, Jia Pan

    Abstract: Imitation learning (IL), particularly when leveraging high-dimensional visual inputs for policy training, has proven intuitive and effective in complex bimanual manipulation tasks. Nonetheless, the generalization capability of visuomotor policies remains limited, especially when small demonstration datasets are available. Accumulated errors in visuomotor policies significantly hinder their ability… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

    Comments: Project website: https://sites.google.com/view/svip-bimanual

  12. arXiv:2506.18586  [pdf

    cs.AI cs.CE cs.CL

    Airalogy: AI-empowered universal data digitization for research automation

    Authors: Zijie Yang, Qiji Zhou, Fang Guo, Sijie Zhang, Yexun Xi, Jinglei Nie, Yudian Zhu, Liping Huang, Chou Wu, Yonghe Xia, Xiaoyu Ma, Yingming Pu, Panzhong Lu, Junshu Pan, Mingtao Chen, Tiannan Guo, Yanmei Dou, Hongyu Chen, Anping Zeng, Jiaxing Huang, Tian Xu, Yue Zhang

    Abstract: Research data are the foundation of Artificial Intelligence (AI)-driven science, yet current AI applications remain limited to a few fields with readily available, well-structured, digitized datasets. Achieving comprehensive AI empowerment across multiple disciplines is still out of reach. Present-day research data collection is often fragmented, lacking unified standards, inefficiently managed, a… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

    Comments: 146 pages, 6 figures, 49 supplementary figures

  13. arXiv:2506.16112  [pdf, ps, other

    cs.CV

    AutoV: Learning to Retrieve Visual Prompt for Large Vision-Language Models

    Authors: Yuan Zhang, Chun-Kai Fan, Tao Huang, Ming Lu, Sicheng Yu, Junwen Pan, Kuan Cheng, Qi She, Shanghang Zhang

    Abstract: Inspired by text prompts in large language models (LLMs), visual prompts have been explored to enhance the reasoning capabilities of large vision-language models (LVLMs). Current methods design heuristic visual prompts, such as overlaying a text-query-guided attention heatmap on the original input image. However, designing effective prompts manually is challenging and time-consuming, and it often… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

    Comments: 19 pages

  14. arXiv:2506.13651  [pdf, ps, other

    cs.LG

    xbench: Tracking Agents Productivity Scaling with Profession-Aligned Real-World Evaluations

    Authors: Kaiyuan Chen, Yixin Ren, Yang Liu, Xiaobo Hu, Haotong Tian, Tianbao Xie, Fangfu Liu, Haoye Zhang, Hongzhang Liu, Yuan Gong, Chen Sun, Han Hou, Hui Yang, James Pan, Jianan Lou, Jiayi Mao, Jizheng Liu, Jinpeng Li, Kangyi Liu, Kenkun Liu, Rui Wang, Run Li, Tong Niu, Wenlong Zhang, Wenqi Yan , et al. (8 additional authors not shown)

    Abstract: We introduce xbench, a dynamic, profession-aligned evaluation suite designed to bridge the gap between AI agent capabilities and real-world productivity. While existing benchmarks often focus on isolated technical skills, they may not accurately reflect the economic value agents deliver in professional settings. To address this, xbench targets commercially significant domains with evaluation tasks… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: Project page: https://xbench.org

  15. arXiv:2506.12754  [pdf, ps, other

    cs.LG cs.AI

    AFBS:Buffer Gradient Selection in Semi-asynchronous Federated Learning

    Authors: Chaoyi Lu, Yiding Sun, Jinqian Chen, Zhichuan Yang, Jiangming Pan, Jihua Zhu

    Abstract: Asynchronous federated learning (AFL) accelerates training by eliminating the need to wait for stragglers, but its asynchronous nature introduces gradient staleness, where outdated gradients degrade performance. Existing solutions address this issue with gradient buffers, forming a semi-asynchronous framework. However, this approach struggles when buffers accumulate numerous stale gradients, as bl… ▽ More

    Submitted 23 June, 2025; v1 submitted 15 June, 2025; originally announced June 2025.

  16. arXiv:2506.12600  [pdf

    cs.MA cs.AI cs.ET cs.GT cs.RO

    Trust-MARL: Trust-Based Multi-Agent Reinforcement Learning Framework for Cooperative On-Ramp Merging Control in Heterogeneous Traffic Flow

    Authors: Jie Pan, Tianyi Wang, Christian Claudel, Jing Shi

    Abstract: Intelligent transportation systems require connected and automated vehicles (CAVs) to conduct safe and efficient cooperation with human-driven vehicles (HVs) in complex real-world traffic environments. However, the inherent unpredictability of human behaviour, especially at bottlenecks such as highway on-ramp merging areas, often disrupts traffic flow and compromises system performance. To address… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

    Comments: 34 pages, 7 figures, 4 tables

  17. arXiv:2506.11820  [pdf, ps, other

    cs.CV cs.CL

    Rethinking Multilingual Vision-Language Translation: Dataset, Evaluation, and Adaptation

    Authors: Xintong Wang, Jingheng Pan, Yixiao Liu, Xiaohu Zhao, Chenyang Lyu, Minghao Wu, Chris Biemann, Longyue Wang, Linlong Xu, Weihua Luo, Kaifu Zhang

    Abstract: Vision-Language Translation (VLT) is a challenging task that requires accurately recognizing multilingual text embedded in images and translating it into the target language with the support of visual context. While recent Large Vision-Language Models (LVLMs) have demonstrated strong multilingual and visual understanding capabilities, there is a lack of systematic evaluation and understanding of t… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

  18. arXiv:2506.10967  [pdf, ps, other

    cs.CV cs.AI

    Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs

    Authors: Qizhe Zhang, Mengzhen Liu, Lichen Li, Ming Lu, Yuan Zhang, Junwen Pan, Qi She, Shanghang Zhang

    Abstract: In multimodal large language models (MLLMs), the length of input visual tokens is often significantly greater than that of their textual counterparts, leading to a high inference cost. Many works aim to address this issue by removing redundant visual tokens. However, current approaches either rely on attention-based pruning, which retains numerous duplicate tokens, or use similarity-based pruning,… ▽ More

    Submitted 1 July, 2025; v1 submitted 12 June, 2025; originally announced June 2025.

    Comments: 22 pages, 5 figures, code: https://github.com/Theia-4869/CDPruner, project page: https://theia-4869.github.io/CDPruner

  19. arXiv:2506.10797  [pdf

    physics.med-ph cs.CV

    Modality-AGnostic Image Cascade (MAGIC) for Multi-Modality Cardiac Substructure Segmentation

    Authors: Nicholas Summerfield, Qisheng He, Alex Kuo, Ahmed I. Ghanem, Simeng Zhu, Chase Ruff, Joshua Pan, Anudeep Kumar, Prashant Nagpal, Jiwei Zhao, Ming Dong, Carri K. Glide-Hurst

    Abstract: Cardiac substructures are essential in thoracic radiation therapy planning to minimize risk of radiation-induced heart disease. Deep learning (DL) offers efficient methods to reduce contouring burden but lacks generalizability across different modalities and overlapping structures. This work introduces and validates a Modality-AGnostic Image Cascade (MAGIC) for comprehensive and multi-modal cardia… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

  20. arXiv:2506.06367  [pdf, ps, other

    cs.AI

    Towards Foundation Model on Temporal Knowledge Graph Reasoning

    Authors: Jiaxin Pan, Mojtaba Nayyeri, Osama Mohammed, Daniel Hernandez, Rongchuan Zhang, Cheng Cheng, Steffen Staab

    Abstract: Temporal Knowledge Graphs (TKGs) store temporal facts with quadruple formats (s, p, o, t). Existing Temporal Knowledge Graph Embedding (TKGE) models perform link prediction tasks in transductive or semi-inductive settings, which means the entities, relations, and temporal information in the test graph are fully or partially observed during training. Such reliance on seen elements during inference… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

  21. arXiv:2506.05334  [pdf, ps, other

    cs.CL cs.IR cs.LG

    Search Arena: Analyzing Search-Augmented LLMs

    Authors: Mihran Miroyan, Tsung-Han Wu, Logan King, Tianle Li, Jiayi Pan, Xinyan Hu, Wei-Lin Chiang, Anastasios N. Angelopoulos, Trevor Darrell, Narges Norouzi, Joseph E. Gonzalez

    Abstract: Search-augmented language models combine web search with Large Language Models (LLMs) to improve response groundedness and freshness. However, analyzing these systems remains challenging: existing datasets are limited in scale and narrow in scope, often constrained to static, single-turn, fact-checking questions. In this work, we introduce Search Arena, a crowd-sourced, large-scale, human-preferen… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

    Comments: Preprint. Code: https://github.com/lmarena/search-arena. Dataset: https://huggingface.co/datasets/lmarena-ai/search-arena-24k

  22. arXiv:2506.02921  [pdf, ps, other

    cs.CL

    A Controllable Examination for Long-Context Language Models

    Authors: Yijun Yang, Zeyu Huang, Wenhao Zhu, Zihan Qiu, Fei Yuan, Jeff Z. Pan, Ivan Titov

    Abstract: Existing frameworks for evaluating long-context language models (LCLM) can be broadly categorized into real-world and synthetic tasks. Despite their utility, both approaches are accompanied by certain intrinsic limitations. Real-world tasks are too complex to interpret or characterize and are susceptible to data contamination. In contrast, synthetic tasks often adopt the needle-in-the-haystack (NI… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: Preprint

  23. arXiv:2506.02009  [pdf, ps, other

    cs.DC

    STRATUS: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds

    Authors: Yinfang Chen, Jiaqi Pan, Jackson Clark, Yiming Su, Noah Zheutlin, Bhavya Bhavya, Rohan Arora, Yu Deng, Saurabh Jha, Tianyin Xu

    Abstract: In cloud-scale systems, failures are the norm. A distributed computing cluster exhibits hundreds of machine failures and thousands of disk failures; software bugs and misconfigurations are reported to be more frequent. The demand for autonomous, AI-driven reliability engineering continues to grow, as existing humanin-the-loop practices can hardly keep up with the scale of modern clouds. This paper… ▽ More

    Submitted 27 May, 2025; originally announced June 2025.

    Comments: 10 pages for main text and 40 pages in total

  24. arXiv:2506.00862  [pdf, ps, other

    cs.LG

    FourierFlow: Frequency-aware Flow Matching for Generative Turbulence Modeling

    Authors: Haixin Wang, Jiashu Pan, Hao Wu, Fan Zhang, Tailin Wu

    Abstract: Modeling complex fluid systems, especially turbulence governed by partial differential equations (PDEs), remains a fundamental challenge in science and engineering. Recently, diffusion-based generative models have gained attention as a powerful approach for these tasks, owing to their capacity to capture long-range dependencies and recover hierarchical structures. However, we present both empirica… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

    Comments: 27 pages, 14 figures

  25. arXiv:2505.23540  [pdf, other

    cs.CL

    Probability-Consistent Preference Optimization for Enhanced LLM Reasoning

    Authors: Yunqiao Yang, Houxing Ren, Zimu Lu, Ke Wang, Weikang Shi, Aojun Zhou, Junting Pan, Mingjie Zhan, Hongsheng Li

    Abstract: Recent advances in preference optimization have demonstrated significant potential for improving mathematical reasoning capabilities in large language models (LLMs). While current approaches leverage high-quality pairwise preference data through outcome-based criteria like answer correctness or consistency, they fundamentally neglect the internal logical coherence of responses. To overcome this, w… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: 14 pages, to be published in ACL 2025 findings

  26. arXiv:2505.23290  [pdf, other

    cs.SD cs.CV eess.AS

    Wav2Sem: Plug-and-Play Audio Semantic Decoupling for 3D Speech-Driven Facial Animation

    Authors: Hao Li, Ju Dai, Xin Zhao, Feng Zhou, Junjun Pan, Lei Li

    Abstract: In 3D speech-driven facial animation generation, existing methods commonly employ pre-trained self-supervised audio models as encoders. However, due to the prevalence of phonetically similar syllables with distinct lip shapes in language, these near-homophone syllables tend to exhibit significant coupling in self-supervised audio feature spaces, leading to the averaging effect in subsequent lip mo… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: Accepted to CVPR 2025

  27. arXiv:2505.22705  [pdf, ps, other

    cs.CV cs.MM

    HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer

    Authors: Qi Cai, Jingwen Chen, Yang Chen, Yehao Li, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Yiheng Zhang, Fengbin Gao, Peihan Xu, Yimeng Wang, Kai Yu, Wenxuan Chen, Ziwei Feng, Zijian Gong, Jianzhuang Pan, Yi Peng, Rui Tian, Siyu Wang, Bo Zhao, Ting Yao, Tao Mei

    Abstract: Recent advancements in image generative foundation models have prioritized quality improvements but often at the cost of increased computational complexity and inference latency. To address this critical trade-off, we introduce HiDream-I1, a new open-source image generative foundation model with 17B parameters that achieves state-of-the-art image generation quality within seconds. HiDream-I1 is co… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

    Comments: Source codes and models are available at https://github.com/HiDream-ai/HiDream-I1 and https://github.com/HiDream-ai/HiDream-E1

  28. arXiv:2505.22407  [pdf, ps, other

    cs.CV

    Self-Reflective Reinforcement Learning for Diffusion-based Image Reasoning Generation

    Authors: Jiadong Pan, Zhiyuan Ma, Kaiyan Zhang, Ning Ding, Bowen Zhou

    Abstract: Diffusion models have recently demonstrated exceptional performance in image generation task. However, existing image generation methods still significantly suffer from the dilemma of image reasoning, especially in logic-centered image generation tasks. Inspired by the success of Chain of Thought (CoT) and Reinforcement Learning (RL) in LLMs, we propose SRRL, a self-reflective RL algorithm for dif… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  29. arXiv:2505.20611  [pdf, ps, other

    cs.CV

    Mamba-Driven Topology Fusion for Monocular 3-D Human Pose Estimation

    Authors: Zenghao Zheng, Lianping Yang, Jinshan Pan, Hegui Zhu

    Abstract: Transformer-based methods for 3-D human pose estimation face significant computational challenges due to the quadratic growth of self-attention mechanism complexity with sequence length. Recently, the Mamba model has substantially reduced computational overhead and demonstrated outstanding performance in modeling long sequences by leveraging state space model (SSM). However, the ability of SSM to… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  30. arXiv:2505.20231  [pdf, ps, other

    cs.CL

    Bridging the Long-Term Gap: A Memory-Active Policy for Multi-Session Task-Oriented Dialogue

    Authors: Yiming Du, Bingbing Wang, Yang He, Bin Liang, Baojun Wang, Zhongyang Li, Lin Gui, Jeff Z. Pan, Ruifeng Xu, Kam-Fai Wong

    Abstract: Existing Task-Oriented Dialogue (TOD) systems primarily focus on single-session dialogues, limiting their effectiveness in long-term memory augmentation. To address this challenge, we introduce a MS-TOD dataset, the first multi-session TOD dataset designed to retain long-term memory across sessions, enabling fewer turns and more efficient task completion. This defines a new benchmark task for eval… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  31. arXiv:2505.19958  [pdf, ps, other

    cs.CV

    UltraVSR: Achieving Ultra-Realistic Video Super-Resolution with Efficient One-Step Diffusion Space

    Authors: Yong Liu, Jinshan Pan, Yinchuan Li, Qingji Dong, Chao Zhu, Yu Guo, Fei Wang

    Abstract: Diffusion models have shown great potential in generating realistic image detail. However, adapting these models to video super-resolution (VSR) remains challenging due to their inherent stochasticity and lack of temporal modeling. In this paper, we propose UltraVSR, a novel framework that enables ultra-realistic and temporal-coherent VSR through an efficient one-step diffusion space. A central co… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

    Comments: Under review, 10 pages, 7 figures

  32. arXiv:2505.17952  [pdf, ps, other

    cs.CL cs.AI

    Beyond Distillation: Pushing the Limits of Medical LLM Reasoning with Minimalist Rule-Based RL

    Authors: Che Liu, Haozhe Wang, Jiazhen Pan, Zhongwei Wan, Yong Dai, Fangzhen Lin, Wenjia Bai, Daniel Rueckert, Rossella Arcucci

    Abstract: Improving performance on complex tasks and enabling interpretable decision making in large language models (LLMs), especially for clinical applications, requires effective reasoning. Yet this remains challenging without supervised fine-tuning (SFT) on costly chain-of-thought (CoT) data distilled from closed-source models (e.g., GPT-4o). In this work, we present AlphaMed, the first medical LLM to s… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

    Comments: Under Review

  33. arXiv:2505.17925  [pdf, other

    cs.IR

    Enhancing CTR Prediction with De-correlated Expert Networks

    Authors: Jiancheng Wang, Mingjia Yin, Junwei Pan, Ximei Wang, Hao Wang, Enhong Chen

    Abstract: Modeling feature interactions is essential for accurate click-through rate (CTR) prediction in advertising systems. Recent studies have adopted the Mixture-of-Experts (MoE) approach to improve performance by ensembling multiple feature interaction experts. These studies employ various strategies, such as learning independent embedding tables for each expert or utilizing heterogeneous expert archit… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

  34. arXiv:2505.17639  [pdf, other

    cs.LG

    PreMoe: Lightening MoEs on Constrained Memory by Expert Pruning and Retrieval

    Authors: Zehua Pei, Ying Zhang, Hui-Ling Zhen, Xianzhi Yu, Wulong Liu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu

    Abstract: Mixture-of-experts (MoE) architectures enable scaling large language models (LLMs) to vast parameter counts without a proportional rise in computational costs. However, the significant memory demands of large MoE models hinder their deployment across various computational environments, from cloud servers to consumer devices. This study first demonstrates pronounced task-specific specialization in… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

  35. arXiv:2505.17528  [pdf, ps, other

    eess.IV cs.CV

    DECT-based Space-Squeeze Method for Multi-Class Classification of Metastatic Lymph Nodes in Breast Cancer

    Authors: Hai Jiang, Chushan Zheng, Jiawei Pan, Yuanpin Zhou, Qiongting Liu, Xiang Zhang, Jun Shen, Yao Lu

    Abstract: Background: Accurate assessment of metastatic burden in axillary lymph nodes is crucial for guiding breast cancer treatment decisions, yet conventional imaging modalities struggle to differentiate metastatic burden levels and capture comprehensive lymph node characteristics. This study leverages dual-energy computed tomography (DECT) to exploit spectral-spatial information for improved multi-class… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

  36. arXiv:2505.17484  [pdf, ps, other

    eess.IV cs.AI cs.CV

    Anatomy-Guided Multitask Learning for MRI-Based Classification of Placenta Accreta Spectrum and its Subtypes

    Authors: Hai Jiang, Qiongting Liu, Yuanpin Zhou, Jiawei Pan, Ting Song, Yao Lu

    Abstract: Placenta Accreta Spectrum Disorders (PAS) pose significant risks during pregnancy, frequently leading to postpartum hemorrhage during cesarean deliveries and other severe clinical complications, with bleeding severity correlating to the degree of placental invasion. Consequently, accurate prenatal diagnosis of PAS and its subtypes-placenta accreta (PA), placenta increta (PI), and placenta percreta… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

  37. arXiv:2505.16161  [pdf, other

    cs.CV

    Deep Learning-Driven Ultra-High-Definition Image Restoration: A Survey

    Authors: Liyan Wang, Weixiang Zhou, Cong Wang, Kin-Man Lam, Zhixun Su, Jinshan Pan

    Abstract: Ultra-high-definition (UHD) image restoration aims to specifically solve the problem of quality degradation in ultra-high-resolution images. Recent advancements in this field are predominantly driven by deep learning-based innovations, including enhancements in dataset construction, network architecture, sampling strategies, prior knowledge integration, and loss functions. In this paper, we system… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: 20 papers, 12 figures

  38. arXiv:2505.15792  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Long-Form Information Alignment Evaluation Beyond Atomic Facts

    Authors: Danna Zheng, Mirella Lapata, Jeff Z. Pan

    Abstract: Information alignment evaluators are vital for various NLG evaluation tasks and trustworthy LLM deployment, reducing hallucinations and enhancing user trust. Current fine-grained methods, like FactScore, verify facts individually but neglect inter-fact dependencies, enabling subtle vulnerabilities. In this work, we introduce MontageLie, a challenging benchmark that constructs deceptive narratives… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

  39. arXiv:2505.15297  [pdf, ps, other

    cs.CL

    Chinese Toxic Language Mitigation via Sentiment Polarity Consistent Rewrites

    Authors: Xintong Wang, Yixiao Liu, Jingheng Pan, Liang Ding, Longyue Wang, Chris Biemann

    Abstract: Detoxifying offensive language while preserving the speaker's original intent is a challenging yet critical goal for improving the quality of online interactions. Although large language models (LLMs) show promise in rewriting toxic content, they often default to overly polite rewrites, distorting the emotional tone and communicative intent. This problem is especially acute in Chinese, where toxic… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: 14 pages, 7 figures

  40. arXiv:2505.15141  [pdf, ps, other

    cs.LG cs.AI stat.ML

    BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms

    Authors: Yunlong Hou, Fengzhuo Zhang, Cunxiao Du, Xuan Zhang, Jiachun Pan, Tianyu Pang, Chao Du, Vincent Y. F. Tan, Zhuoran Yang

    Abstract: Speculative decoding has emerged as a popular method to accelerate the inference of Large Language Models (LLMs) while retaining their superior text generation performance. Previous methods either adopt a fixed speculative decoding configuration regardless of the prefix tokens, or train draft models in an offline or online manner to align them with the context. This paper proposes a training-free… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: 35 pages, 4 figures

  41. arXiv:2505.15111  [pdf, ps, other

    cs.CV cs.AI

    iPad: Iterative Proposal-centric End-to-End Autonomous Driving

    Authors: Ke Guo, Haochen Liu, Xiaojun Wu, Jia Pan, Chen Lv

    Abstract: End-to-end (E2E) autonomous driving systems offer a promising alternative to traditional modular pipelines by reducing information loss and error accumulation, with significant potential to enhance both mobility and safety. However, most existing E2E approaches directly generate plans based on dense bird's-eye view (BEV) grid features, leading to inefficiency and limited planning awareness. To add… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

  42. arXiv:2505.14116  [pdf, ps, other

    cs.CL

    Self-Reasoning Language Models: Unfold Hidden Reasoning Chains with Few Reasoning Catalyst

    Authors: Hongru Wang, Deng Cai, Wanjun Zhong, Shijue Huang, Jeff Z. Pan, Zeming Liu, Kam-Fai Wong

    Abstract: Inference-time scaling has attracted much attention which significantly enhance the performance of Large Language Models (LLMs) in complex reasoning tasks by increasing the length of Chain-of-Thought. These longer intermediate reasoning rationales embody various meta-reasoning skills in human cognition, such as reflection and decomposition, being difficult to create and acquire. In this work, we i… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

  43. arXiv:2505.13339  [pdf, other

    cs.RO cs.AI

    OPA-Pack: Object-Property-Aware Robotic Bin Packing

    Authors: Jia-Hui Pan, Yeok Tatt Cheah, Zhengzhe Liu, Ka-Hei Hui, Xiaojie Gao, Pheng-Ann Heng, Yun-Hui Liu, Chi-Wing Fu

    Abstract: Robotic bin packing aids in a wide range of real-world scenarios such as e-commerce and warehouses. Yet, existing works focus mainly on considering the shape of objects to optimize packing compactness and neglect object properties such as fragility, edibility, and chemistry that humans typically consider when packing objects. This paper presents OPA-Pack (Object-Property-Aware Packing framework),… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

    Comments: Submitted to IEEE Transactions on Robotics (TRO) on Feb. 10, 2025

  44. arXiv:2505.13328  [pdf, other

    cs.CL

    Rethinking Stateful Tool Use in Multi-Turn Dialogues: Benchmarks and Challenges

    Authors: Hongru Wang, Wenyu Huang, Yufei Wang, Yuanhao Xi, Jianqiao Lu, Huan Zhang, Nan Hu, Zeming Liu, Jeff Z. Pan, Kam-Fai Wong

    Abstract: Existing benchmarks that assess Language Models (LMs) as Language Agents (LAs) for tool use primarily focus on stateless, single-turn interactions or partial evaluations, such as tool selection in a single turn, overlooking the inherent stateful nature of interactions in multi-turn applications. To fulfill this gap, we propose \texttt{DialogTool}, a multi-turn dialogue dataset with stateful tool i… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

  45. arXiv:2505.12744  [pdf, ps, other

    cs.AI

    Incentivizing Multimodal Reasoning in Large Models for Direct Robot Manipulation

    Authors: Weiliang Tang, Dong Jing, Jia-Hui Pan, Zhiwu Lu, Yun-Hui Liu, Li Erran Li, Mingyu Ding, Chi-Wing Fu

    Abstract: Recent Large Multimodal Models have demonstrated remarkable reasoning capabilities, especially in solving complex mathematical problems and realizing accurate spatial perception. Our key insight is that these emerging abilities can naturally extend to robotic manipulation by enabling LMMs to directly infer the next goal in language via reasoning, rather than relying on a separate action head. Howe… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

    Comments: 17 pages, 16 figures

  46. arXiv:2505.12108  [pdf, ps, other

    cs.CV cs.AI

    EarthSynth: Generating Informative Earth Observation with Diffusion Models

    Authors: Jiancheng Pan, Shiye Lei, Yuqian Fu, Jiahao Li, Yanxing Liu, Yuze Sun, Xiao He, Long Peng, Xiaomeng Huang, Bo Zhao

    Abstract: Remote sensing image (RSI) interpretation typically faces challenges due to the scarcity of labeled data, which limits the performance of RSI interpretation tasks. To tackle this challenge, we propose EarthSynth, a diffusion-based generative foundation model that enables synthesizing multi-category, cross-satellite labeled Earth observation for downstream RSI interpretation tasks. To the best of o… ▽ More

    Submitted 17 May, 2025; originally announced May 2025.

    Comments: 23 pages

  47. arXiv:2505.11754  [pdf, ps, other

    cs.CL

    Masking in Multi-hop QA: An Analysis of How Language Models Perform with Context Permutation

    Authors: Wenyu Huang, Pavlos Vougiouklis, Mirella Lapata, Jeff Z. Pan

    Abstract: Multi-hop Question Answering (MHQA) adds layers of complexity to question answering, making it more challenging. When Language Models (LMs) are prompted with multiple search results, they are tasked not only with retrieving relevant information but also employing multi-hop reasoning across the information sources. Although LMs perform well on traditional question-answering tasks, the causal mask c… ▽ More

    Submitted 16 May, 2025; originally announced May 2025.

    Comments: ACL 2025 main

  48. arXiv:2505.10732  [pdf

    cs.CR cs.AI

    Automating Security Audit Using Large Language Model based Agent: An Exploration Experiment

    Authors: Jia Hui Chin, Pu Zhang, Yu Xin Cheong, Jonathan Pan

    Abstract: In the current rapidly changing digital environment, businesses are under constant stress to ensure that their systems are secured. Security audits help to maintain a strong security posture by ensuring that policies are in place, controls are implemented, gaps are identified for cybersecurity risks mitigation. However, audits are usually manual, requiring much time and costs. This paper looks at… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  49. arXiv:2505.10557  [pdf, ps, other

    cs.CV cs.AI cs.CL

    MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning

    Authors: Ke Wang, Junting Pan, Linda Wei, Aojun Zhou, Weikang Shi, Zimu Lu, Han Xiao, Yunqiao Yang, Houxing Ren, Mingjie Zhan, Hongsheng Li

    Abstract: Natural language image-caption datasets, widely used for training Large Multimodal Models, mainly focus on natural scenarios and overlook the intricate details of mathematical figures that are critical for problem-solving, hindering the advancement of current LMMs in multimodal mathematical reasoning. To this end, we propose leveraging code as supervision for cross-modal alignment, since code inhe… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

    Comments: Accepted to ACL 2025 Findings

  50. arXiv:2505.10415  [pdf, ps, other

    cs.RO cs.HC

    Internal State Estimation in Groups via Active Information Gathering

    Authors: Xuebo Ji, Zherong Pan, Xifeng Gao, Lei Yang, Xinxin Du, Kaiyun Li, Yongjin Liu, Wenping Wang, Changhe Tu, Jia Pan

    Abstract: Accurately estimating human internal states, such as personality traits or behavioral patterns, is critical for enhancing the effectiveness of human-robot interaction, particularly in group settings. These insights are key in applications ranging from social navigation to autism diagnosis. However, prior methods are limited by scalability and passive observation, making real-time estimation in com… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.