Skip to main content

Showing 1–50 of 576 results for author: Wu, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.04789  [pdf, ps, other

    cs.RO

    Training-free Generation of Temporally Consistent Rewards from VLMs

    Authors: Yinuo Zhao, Jiale Yuan, Zhiyuan Xu, Xiaoshuai Hao, Xinyi Zhang, Kun Wu, Zhengping Che, Chi Harold Liu, Jian Tang

    Abstract: Recent advances in vision-language models (VLMs) have significantly improved performance in embodied tasks such as goal decomposition and visual comprehension. However, providing accurate rewards for robotic manipulation without fine-tuning VLMs remains challenging due to the absence of domain-specific robotic knowledge in pre-trained datasets and high computational costs that hinder real-time app… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

  2. arXiv:2507.04455  [pdf, ps, other

    cs.CL

    GradOT: Training-free Gradient-preserving Offsite-tuning for Large Language Models

    Authors: Kai Yao, Zhaorui Tan, Penglei Gao, Lichun Li, Kaixin Wu, Yinggui Wang, Yuan Zhao, Yixin Ji, Wei Wang, Jianke Zhu

    Abstract: The rapid growth of large language models (LLMs) with traditional centralized fine-tuning emerges as a key technique for adapting these models to domain-specific challenges, yielding privacy risks for both model and data owners. One promising solution, called offsite-tuning (OT), is proposed to address these challenges, where a weaker emulator is compressed from the original model and further fine… ▽ More

    Submitted 6 July, 2025; originally announced July 2025.

    Comments: Accepted by ACL 2025 main

  3. arXiv:2507.04118  [pdf, ps, other

    cs.CV

    PromptSR: Cascade Prompting for Lightweight Image Super-Resolution

    Authors: Wenyang Liu, Chen Cai, Jianjun Gao, Kejun Wu, Yi Wang, Kim-Hui Yap, Lap-Pui Chau

    Abstract: Although the lightweight Vision Transformer has significantly advanced image super-resolution (SR), it faces the inherent challenge of a limited receptive field due to the window-based self-attention modeling. The quadratic computational complexity relative to window size restricts its ability to use a large window size for expanding the receptive field while maintaining low computational costs. T… ▽ More

    Submitted 5 July, 2025; originally announced July 2025.

    Comments: Accepted in TMM

  4. arXiv:2507.03243  [pdf, ps, other

    cs.HC

    Beyond Charging Anxiety: An Explainable Approach to Understanding User Preferences of EV Charging Stations Using Review Data

    Authors: Zifei Wang, Emmanuel Abolarin, Kai Wu, Venkatarao Rebba, Jian Hu, Zhen Hu, Shan Bao, Feng Zhou

    Abstract: Electric vehicles (EVs) charging infrastructure is directly related to the overall EV user experience and thus impacts the widespread adoption of EVs. Understanding key factors that affect EV users' charging experience is essential for building a robust and user-friendly EV charging infrastructure. This study leverages about $17,000$ charging station (CS) reviews on Google Maps to explore EV user… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: 19 pages, 8 figures

  5. arXiv:2507.01192  [pdf, ps, other

    cs.CC

    PCPP-Based Reconfiguration Inapproximability: Query Complexity vs. Soundness Gap Trade-offs

    Authors: Venkatesan Guruswami, Xuandi Ren, Kewen Wu

    Abstract: The Reconfiguration Inapproximability Hypothesis (RIH), recently established by Hirahara-Ohsaka (STOC'24) and Karthik-Manurangsi (ECCC'24), studies the hardness of reconfiguring one solution into another in constraint satisfaction problems (CSP) when restricted to approximate intermediate solutions. In this work, we make a tighter connection between RIH's soundness gap and that of probabilisticall… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  6. arXiv:2507.01059  [pdf, ps, other

    cs.MA cs.AI cs.CL cs.CV cs.RO

    Automated Vehicles Should be Connected with Natural Language

    Authors: Xiangbo Gao, Keshu Wu, Hao Zhang, Kexin Tian, Yang Zhou, Zhengzhong Tu

    Abstract: Multi-agent collaborative driving promises improvements in traffic safety and efficiency through collective perception and decision making. However, existing communication media -- including raw sensor data, neural network features, and perception results -- suffer limitations in bandwidth efficiency, information completeness, and agent interoperability. Moreover, traditional approaches have large… ▽ More

    Submitted 29 June, 2025; originally announced July 2025.

  7. arXiv:2506.20535  [pdf, ps, other

    cs.DC cs.AI cs.LG

    WattsOnAI: Measuring, Analyzing, and Visualizing Energy and Carbon Footprint of AI Workloads

    Authors: Hongzhen Huang, Kunming Zhang, Hanlong Liao, Kui Wu, Guoming Tang

    Abstract: The rapid advancement of AI, particularly large language models (LLMs), has raised significant concerns about the energy use and carbon emissions associated with model training and inference. However, existing tools for measuring and reporting such impacts are often fragmented, lacking systematic metric integration and offering limited support for correlation analysis among them. This paper presen… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

    Comments: 11 pages, 7 figures and 5 tables

  8. arXiv:2506.19283  [pdf, ps, other

    cs.CV cs.AI cs.RO

    AirV2X: Unified Air-Ground Vehicle-to-Everything Collaboration

    Authors: Xiangbo Gao, Yuheng Wu, Fengze Yang, Xuewen Luo, Keshu Wu, Xinghao Chen, Yuping Wang, Chenxi Liu, Yang Zhou, Zhengzhong Tu

    Abstract: While multi-vehicular collaborative driving demonstrates clear advantages over single-vehicle autonomy, traditional infrastructure-based V2X systems remain constrained by substantial deployment costs and the creation of "uncovered danger zones" in rural and suburban areas. We present AirV2X-Perception, a large-scale dataset that leverages Unmanned Aerial Vehicles (UAVs) as a flexible alternative o… ▽ More

    Submitted 2 July, 2025; v1 submitted 23 June, 2025; originally announced June 2025.

  9. arXiv:2506.18512  [pdf, ps, other

    eess.IV cs.CL cs.CV q-bio.QM

    MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and Diagnosis

    Authors: Yuting Zhang, Kaishen Yuan, Hao Lu, Yutao Yue, Jintai Chen, Kaishun Wu

    Abstract: Accurate and interpretable multi-disease diagnosis remains a critical challenge in medical research, particularly when leveraging heterogeneous multimodal medical data. Current approaches often rely on single-modal data, limiting their ability to comprehensively understand complex diseases. To address this, we propose MedTVT-R1, a novel Multimodal Large Language Model (MLLM) framework designed to… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

  10. arXiv:2506.16160  [pdf, ps, other

    cs.CV

    Align the GAP: Prior-based Unified Multi-Task Remote Physiological Measurement Framework For Domain Generalization and Personalization

    Authors: Jiyao Wang, Xiao Yang, Hao Lu, Dengbo He, Kaishun Wu

    Abstract: Multi-source synsemantic domain generalization (MSSDG) for multi-task remote physiological measurement seeks to enhance the generalizability of these metrics and attracts increasing attention. However, challenges like partial labeling and environmental noise may disrupt task-specific accuracy. Meanwhile, given that real-time adaptation is necessary for personalized products, the test-time personal… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

  11. arXiv:2506.10155  [pdf

    cs.CL cs.AI cs.LG

    Measuring Corporate Human Capital Disclosures: Lexicon, Data, Code, and Research Opportunities

    Authors: Elizabeth Demers, Victor Xiaoqi Wang, Kean Wu

    Abstract: Human capital (HC) is increasingly important to corporate value creation. Unlike other assets, however, HC is not currently subject to well-defined measurement or disclosure rules. We use a machine learning algorithm (word2vec) trained on a confirmed set of HC disclosures to develop a comprehensive list of HC-related keywords classified into five subcategories (DEI; health and safety; labor relati… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: 50 pages, 6 figures, 5 tables

    Journal ref: Journal of Information Systems 38 (2024) 163-186

  12. arXiv:2506.09482  [pdf, ps, other

    cs.CV

    Marrying Autoregressive Transformer and Diffusion with Multi-Reference Autoregression

    Authors: Dingcheng Zhen, Qian Qiao, Tan Yu, Kangxi Wu, Ziwei Zhang, Siyuan Liu, Shunshun Yin, Ming Tao

    Abstract: We introduce TransDiff, the first image generation model that marries Autoregressive (AR) Transformer with diffusion models. In this joint modeling framework, TransDiff encodes labels and images into high-level semantic features and employs a diffusion model to estimate the distribution of image samples. On the ImageNet 256x256 benchmark, TransDiff significantly outperforms other image generation… ▽ More

    Submitted 15 June, 2025; v1 submitted 11 June, 2025; originally announced June 2025.

  13. arXiv:2506.08822  [pdf, ps, other

    cs.RO cs.AI

    FreqPolicy: Efficient Flow-based Visuomotor Policy via Frequency Consistency

    Authors: Yifei Su, Ning Liu, Dong Chen, Zhen Zhao, Kun Wu, Meng Li, Zhiyuan Xu, Zhengping Che, Jian Tang

    Abstract: Generative modeling-based visuomotor policies have been widely adopted in robotic manipulation attributed to their ability to model multimodal action distributions. However, the high inference cost of multi-step sampling limits their applicability in real-time robotic systems. To address this issue, existing approaches accelerate the sampling process in generative modeling-based visuomotor policie… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

  14. arXiv:2506.07179  [pdf, ps, other

    cs.LG cs.AI

    Regularized Adaptive Graph Learning for Large-Scale Traffic Forecasting

    Authors: Kaiqi Wu, Weiyang Kong, Sen Zhang, Yubao Liu, Zitong Chen

    Abstract: Traffic prediction is a critical task in spatial-temporal forecasting with broad applications in travel planning and urban management. Adaptive graph convolution networks have emerged as mainstream solutions due to their ability to learn node embeddings in a data-driven manner and capture complex latent dependencies. However, existing adaptive graph learning methods for traffic forecasting often e… ▽ More

    Submitted 8 June, 2025; originally announced June 2025.

  15. arXiv:2506.06644  [pdf, ps, other

    cs.LG stat.ML

    Spark Transformer: Reactivating Sparsity in FFN and Attention

    Authors: Chong You, Kan Wu, Zhipeng Jia, Lin Chen, Srinadh Bhojanapalli, Jiaxian Guo, Utku Evci, Jan Wassenberg, Praneeth Netrapalli, Jeremiah J. Willcock, Suvinay Subramanian, Felix Chern, Alek Andreev, Shreya Pathak, Felix Yu, Prateek Jain, David E. Culler, Henry M. Levy, Sanjiv Kumar

    Abstract: The discovery of the lazy neuron phenomenon in trained Transformers, where the vast majority of neurons in their feed-forward networks (FFN) are inactive for each token, has spurred tremendous interests in activation sparsity for enhancing large model efficiency. While notable progress has been made in translating such sparsity to wall-time benefits, modern Transformers have moved away from the Re… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

  16. arXiv:2506.04941  [pdf, ps, other

    cs.RO

    ArtVIP: Articulated Digital Assets of Visual Realism, Modular Interaction, and Physical Fidelity for Robot Learning

    Authors: Zhao Jin, Zhengping Che, Zhen Zhao, Kun Wu, Yuheng Zhang, Yinuo Zhao, Zehui Liu, Qiang Zhang, Xiaozhu Ju, Jing Tian, Yousong Xue, Jian Tang

    Abstract: Robot learning increasingly relies on simulation to advance complex ability such as dexterous manipulations and precise interactions, necessitating high-quality digital assets to bridge the sim-to-real gap. However, existing open-source articulated-object datasets for simulation are limited by insufficient visual realism and low physical fidelity, which hinder their utility for training models mas… ▽ More

    Submitted 5 June, 2025; v1 submitted 5 June, 2025; originally announced June 2025.

  17. arXiv:2506.04562  [pdf, other

    cs.GR cs.CV

    Handle-based Mesh Deformation Guided By Vision Language Model

    Authors: Xingpeng Sun, Shiyang Jia, Zherong Pan, Kui Wu, Aniket Bera

    Abstract: Mesh deformation is a fundamental tool in 3D content manipulation. Despite extensive prior research, existing approaches often suffer from low output quality, require significant manual tuning, or depend on data-intensive training. To address these limitations, we introduce a training-free, handle-based mesh deformation method. % Our core idea is to leverage a Vision-Language Model (VLM) to interp… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

  18. arXiv:2506.03574  [pdf, ps, other

    cs.RO

    SwitchVLA: Execution-Aware Task Switching for Vision-Language-Action Models

    Authors: Meng Li, Zhen Zhao, Zhengping Che, Fei Liao, Kun Wu, Zhiyuan Xu, Pei Ren, Zhao Jin, Ning Liu, Jian Tang

    Abstract: Robots deployed in dynamic environments must be able to not only follow diverse language instructions but flexibly adapt when user intent changes mid-execution. While recent Vision-Language-Action (VLA) models have advanced multi-task learning and instruction following, they typically assume static task intent, failing to respond when new instructions arrive during ongoing execution. This limitati… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

    Comments: Website: https://switchvla.github.io

  19. arXiv:2506.02917  [pdf, other

    cs.RO

    Text-guided Generation of Efficient Personalized Inspection Plans

    Authors: Xingpeng Sun, Zherong Pan, Xifeng Gao, Kui Wu, Aniket Bera

    Abstract: We propose a training-free, Vision-Language Model (VLM)-guided approach for efficiently generating trajectories to facilitate target inspection planning based on text descriptions. Unlike existing Vision-and-Language Navigation (VLN) methods designed for general agents in unknown environments, our approach specifically targets the efficient inspection of known scenes, with widespread applications… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: 8 pages, 5 figures

  20. arXiv:2506.01376  [pdf, ps, other

    cs.LG

    Modeling All-Atom Glycan Structures via Hierarchical Message Passing and Multi-Scale Pre-training

    Authors: Minghao Xu, Jiaze Song, Keming Wu, Xiangxin Zhou, Bin Cui, Wentao Zhang

    Abstract: Understanding the various properties of glycans with machine learning has shown some preliminary promise. However, previous methods mainly focused on modeling the backbone structure of glycans as graphs of monosaccharides (i.e., sugar units), while they neglected the atomic structures underlying each monosaccharide, which are actually important indicators of glycan properties. We fill this blank b… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: Published at ICML 2025. All code and data are released

  21. arXiv:2505.24476  [pdf, ps, other

    cs.CV

    Period-LLM: Extending the Periodic Capability of Multimodal Large Language Model

    Authors: Yuting Zhang, Hao Lu, Qingyong Hu, Yin Wang, Kaishen Yuan, Xin Liu, Kaishun Wu

    Abstract: Periodic or quasi-periodic phenomena reveal intrinsic characteristics in various natural processes, such as weather patterns, movement behaviors, traffic flows, and biological signals. Given that these phenomena span multiple modalities, the capabilities of Multimodal Large Language Models (MLLMs) offer promising potential to effectively capture and understand their complex nature. However, curren… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

    Comments: Accepted by CVPR 2025

  22. arXiv:2505.23189  [pdf, ps, other

    cs.RO cs.CV

    TrackVLA: Embodied Visual Tracking in the Wild

    Authors: Shaoan Wang, Jiazhao Zhang, Minghan Li, Jiahang Liu, Anqi Li, Kui Wu, Fangwei Zhong, Junzhi Yu, Zhizheng Zhang, He Wang

    Abstract: Embodied visual tracking is a fundamental skill in Embodied AI, enabling an agent to follow a specific target in dynamic environments using only egocentric vision. This task is inherently challenging as it requires both accurate target recognition and effective trajectory planning under conditions of severe occlusion and high scene dynamics. Existing approaches typically address this challenge thr… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

  23. arXiv:2505.22787  [pdf, ps, other

    cs.CL

    Can Large Language Models Match the Conclusions of Systematic Reviews?

    Authors: Christopher Polzak, Alejandro Lozano, Min Woo Sun, James Burgess, Yuhui Zhang, Kevin Wu, Serena Yeung-Levy

    Abstract: Systematic reviews (SR), in which experts summarize and analyze evidence across individual studies to provide insights on a specialized topic, are a cornerstone for evidence-based clinical decision-making, research, and policy. Given the exponential growth of scientific articles, there is growing interest in using large language models (LLMs) to automate SR generation. However, the ability of LLMs… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  24. arXiv:2505.22523  [pdf, ps, other

    cs.CV

    PrismLayers: Open Data for High-Quality Multi-Layer Transparent Image Generative Models

    Authors: Junwen Chen, Heyang Jiang, Yanbin Wang, Keming Wu, Ji Li, Chao Zhang, Keiji Yanai, Dong Chen, Yuhui Yuan

    Abstract: Generating high-quality, multi-layer transparent images from text prompts can unlock a new level of creative control, allowing users to edit each layer as effortlessly as editing text outputs from LLMs. However, the development of multi-layer generative models lags behind that of conventional text-to-image models due to the absence of a large, high-quality corpus of multi-layer transparent data. I… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

    Comments: Homepage: https://prism-layers.github.io/

  25. arXiv:2505.21743  [pdf, other

    cs.LG cs.AI

    Simulating the Unseen: Crash Prediction Must Learn from What Did Not Happen

    Authors: Zihao Li, Xinyuan Cao, Xiangbo Gao, Kexin Tian, Keshu Wu, Mohammad Anis, Hao Zhang, Keke Long, Jiwan Jiang, Xiaopeng Li, Yunlong Zhang, Tianbao Yang, Dominique Lord, Zhengzhong Tu, Yang Zhou

    Abstract: Traffic safety science has long been hindered by a fundamental data paradox: the crashes we most wish to prevent are precisely those events we rarely observe. Existing crash-frequency models and surrogate safety metrics rely heavily on sparse, noisy, and under-reported records, while even sophisticated, high-fidelity simulations undersample the long-tailed situations that trigger catastrophic outc… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  26. arXiv:2505.21349  [pdf, ps, other

    cs.CE

    Out of the Past: An AI-Enabled Pipeline for Traffic Simulation from Noisy, Multimodal Detector Data and Stakeholder Feedback

    Authors: Rex Chen, Karen Wu, John McCartney, Norman Sadeh, Fei Fang

    Abstract: How can a traffic simulation be designed to faithfully reflect real-world traffic conditions? Past data-driven approaches to traffic simulation in the literature have relied on unrealistic or suboptimal heuristics. They also fail to adequately account for the effects of uncertainty and multimodality in the data on simulation outcomes. In this work, we integrate advances in AI to construct a three-… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: 12 pages; 5 figures; preprint version

  27. arXiv:2505.21220  [pdf, ps, other

    astro-ph.CO cs.LG

    Wavelet Flow For Extragalactic Foreground Simulations

    Authors: M. Mebratu, W. L. K. Wu

    Abstract: Extragalactic foregrounds in cosmic microwave background (CMB) observations are both a source of cosmological and astrophysical information and a nuisance to the CMB. Effective field-level modeling that captures their non-Gaussian statistical distributions is increasingly important for optimal information extraction, particularly given the precise and low-noise observations from current and upcomi… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: 19 pages, 7 figures

  28. arXiv:2505.20718  [pdf, ps, other

    cs.CV cs.AI

    VLM Can Be a Good Assistant: Enhancing Embodied Visual Tracking with Self-Improving Vision-Language Models

    Authors: Kui Wu, Shuhang Xu, Hao Chen, Churan Wang, Zhoujun Li, Yizhou Wang, Fangwei Zhong

    Abstract: We introduce a novel self-improving framework that enhances Embodied Visual Tracking (EVT) with Vision-Language Models (VLMs) to address the limitations of current active visual tracking systems in recovering from tracking failure. Our approach combines the off-the-shelf active tracking methods with VLMs' reasoning capabilities, deploying a fast visual policy for normal tracking and activating VLM… ▽ More

    Submitted 28 May, 2025; v1 submitted 27 May, 2025; originally announced May 2025.

  29. arXiv:2505.20710  [pdf, ps, other

    cs.CV

    Hierarchical Instruction-aware Embodied Visual Tracking

    Authors: Kui Wu, Hao Chen, Churan Wang, Fakhri Karray, Zhoujun Li, Yizhou Wang, Fangwei Zhong

    Abstract: User-Centric Embodied Visual Tracking (UC-EVT) presents a novel challenge for reinforcement learning-based models due to the substantial gap between high-level user instructions and low-level agent actions. While recent advancements in language models (e.g., LLMs, VLMs, VLAs) have improved instruction comprehension, these models face critical limitations in either inference speed (LLMs, VLMs) or g… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  30. arXiv:2505.19611  [pdf, other

    cs.CV cs.AI

    Align and Surpass Human Camouflaged Perception: Visual Refocus Reinforcement Fine-Tuning

    Authors: Ruolin Shen, Xiaozhong Ji, Kai WU, Jiangning Zhang, Yijun He, HaiHua Yang, Xiaobin Hu, Xiaoyu Sun

    Abstract: Current multi-modal models exhibit a notable misalignment with the human visual system when identifying objects that are visually assimilated into the background. Our observations reveal that these multi-modal models cannot distinguish concealed objects, demonstrating an inability to emulate human cognitive processes which effectively utilize foreground-background similarity principles for visual… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

    Comments: Project Website: \url{https://github.com/HUuxiaobin/VRRF}

  31. arXiv:2505.18805  [pdf, ps, other

    cs.GR

    DiffHairCard: Auto Hair Card Extraction with Differentiable Rendering

    Authors: Zhongtian Zheng, Tao Huang, Haozhe Su, Xueqi Ma, Yuefan Shen, Tongtong Wang, Yin Yang, Xifeng Gao, Zherong Pan, Kui Wu

    Abstract: Hair cards remain a widely used representation for hair modeling in real-time applications, offering a practical trade-off between visual fidelity, memory usage, and performance. However, generating high-quality hair card models remains a challenging and labor-intensive task. This work presents an automated pipeline for converting strand-based hair models into hair card models with a limited numbe… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

  32. arXiv:2505.15431  [pdf, ps, other

    cs.CL

    Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought

    Authors: Tencent Hunyuan Team, Ao Liu, Botong Zhou, Can Xu, Chayse Zhou, ChenChen Zhang, Chengcheng Xu, Chenhao Wang, Decheng Wu, Dengpeng Wu, Dian Jiao, Dong Du, Dong Wang, Feng Zhang, Fengzong Lian, Guanghui Xu, Guanwei Zhang, Hai Wang, Haipeng Luo, Han Hu, Huilin Xu, Jiajia Wu, Jianchen Zhu, Jianfeng Yan, Jiaqi Zhu , et al. (230 additional authors not shown)

    Abstract: As Large Language Models (LLMs) rapidly advance, we introduce Hunyuan-TurboS, a novel large hybrid Transformer-Mamba Mixture of Experts (MoE) model. It synergistically combines Mamba's long-sequence processing efficiency with Transformer's superior contextual understanding. Hunyuan-TurboS features an adaptive long-short chain-of-thought (CoT) mechanism, dynamically switching between rapid response… ▽ More

    Submitted 4 July, 2025; v1 submitted 21 May, 2025; originally announced May 2025.

  33. arXiv:2505.12884  [pdf, ps, other

    cs.LG cs.AI cs.CV

    TinyAlign: Boosting Lightweight Vision-Language Models by Mitigating Modal Alignment Bottlenecks

    Authors: Yuanze Hu, Zhaoxin Fan, Xinyu Wang, Gen Li, Ye Qiu, Zhichao Yang, Wenjun Wu, Kejian Wu, Yifan Sun, Xiaotie Deng, Jin Dong

    Abstract: Lightweight Vision-Language Models (VLMs) are indispensable for resource-constrained applications. The prevailing approach to aligning vision and language models involves freezing both the vision encoder and the language model while training small connector modules. However, this strategy heavily depends on the intrinsic capabilities of the language model, which can be suboptimal for lightweight m… ▽ More

    Submitted 30 June, 2025; v1 submitted 19 May, 2025; originally announced May 2025.

  34. arXiv:2505.12285  [pdf, ps, other

    cs.NE

    CALM: Co-evolution of Algorithms and Language Model for Automatic Heuristic Design

    Authors: Ziyao Huang, Weiwei Wu, Kui Wu, Jianping Wang, Wei-Bin Lee

    Abstract: Tackling complex optimization problems often relies on expert-designed heuristics, typically crafted through extensive trial and error. Recent advances demonstrate that large language models (LLMs), when integrated into well-designed evolutionary search frameworks, can autonomously discover high-performing heuristics at a fraction of the traditional cost. However, existing approaches predominantly… ▽ More

    Submitted 18 May, 2025; originally announced May 2025.

  35. arXiv:2505.11733  [pdf, ps, other

    cs.CL

    MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports

    Authors: Kevin Wu, Eric Wu, Rahul Thapa, Kevin Wei, Angela Zhang, Arvind Suresh, Jacqueline J. Tao, Min Woo Sun, Alejandro Lozano, James Zou

    Abstract: Doctors and patients alike increasingly use Large Language Models (LLMs) to diagnose clinical cases. However, unlike domains such as math or coding, where correctness can be objectively defined by the final answer, medical diagnosis requires both the outcome and the reasoning process to be accurate. Currently, widely used medical benchmarks like MedQA and MMLU assess only accuracy in the final ans… ▽ More

    Submitted 20 May, 2025; v1 submitted 16 May, 2025; originally announced May 2025.

  36. arXiv:2505.11462  [pdf, ps, other

    cs.CL cs.AI

    Disentangling Reasoning and Knowledge in Medical Large Language Models

    Authors: Rahul Thapa, Qingyang Wu, Kevin Wu, Harrison Zhang, Angela Zhang, Eric Wu, Haotian Ye, Suhana Bedi, Nevin Aresh, Joseph Boen, Shriya Reddy, Ben Athiwaratkun, Shuaiwen Leon Song, James Zou

    Abstract: Medical reasoning in large language models (LLMs) aims to emulate clinicians' diagnostic thinking, but current benchmarks such as MedQA-USMLE, MedMCQA, and PubMedQA often mix reasoning with factual recall. We address this by separating 11 biomedical QA benchmarks into reasoning- and knowledge-focused subsets using a PubMedBERT classifier that reaches 81 percent accuracy, comparable to human perfor… ▽ More

    Submitted 23 June, 2025; v1 submitted 16 May, 2025; originally announced May 2025.

  37. arXiv:2505.10560  [pdf, other

    cs.DB cs.NI

    Approximation-First Timeseries Monitoring Query At Scale

    Authors: Zeying Zhu, Jonathan Chamberlain, Kenny Wu, David Starobinski, Zaoxing Liu

    Abstract: Timeseries monitoring systems such as Prometheus play a crucial role in gaining observability of the underlying system components. These systems collect timeseries metrics from various system components and perform monitoring queries over periodic window-based aggregations (i.e., rule queries). However, despite wide adoption, the operational costs and query latency of rule queries remain high. In… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  38. arXiv:2505.08854  [pdf, ps, other

    cs.CV cs.AI cs.RO

    Generative AI for Autonomous Driving: Frontiers and Opportunities

    Authors: Yuping Wang, Shuo Xing, Cui Can, Renjie Li, Hongyuan Hua, Kexin Tian, Zhaobin Mo, Xiangbo Gao, Keshu Wu, Sulong Zhou, Hengxu You, Juntong Peng, Junge Zhang, Zehao Wang, Rui Song, Mingxuan Yan, Walter Zimmer, Xingcheng Zhou, Peiran Li, Zhaohan Lu, Chia-Ju Chen, Yue Huang, Ryan A. Rossi, Lichao Sun, Hongkai Yu , et al. (22 additional authors not shown)

    Abstract: Generative Artificial Intelligence (GenAI) constitutes a transformative technological wave that reconfigures industries through its unparalleled capabilities for content creation, reasoning, planning, and multimodal understanding. This revolutionary force offers the most promising path yet toward solving one of engineering's grandest challenges: achieving reliable, fully autonomous driving, partic… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  39. arXiv:2505.07687  [pdf, ps, other

    eess.IV cs.CV

    ABS-Mamba: SAM2-Driven Bidirectional Spiral Mamba Network for Medical Image Translation

    Authors: Feng Yuan, Yifan Gao, Wenbin Wu, Keqing Wu, Xiaotong Guo, Jie Jiang, Xin Gao

    Abstract: Accurate multi-modal medical image translation requires ha-rmonizing global anatomical semantics and local structural fidelity, a challenge complicated by intermodality information loss and structural distortion. We propose ABS-Mamba, a novel architecture integrating the Segment Anything Model 2 (SAM2) for organ-aware semantic representation, specialized convolutional neural networks (CNNs) for pr… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

    Comments: MICCAI 2025(under view)

  40. arXiv:2505.06166  [pdf, other

    cs.CV

    DiffLocks: Generating 3D Hair from a Single Image using Diffusion Models

    Authors: Radu Alexandru Rosu, Keyu Wu, Yao Feng, Youyi Zheng, Michael J. Black

    Abstract: We address the task of generating 3D hair geometry from a single image, which is challenging due to the diversity of hairstyles and the lack of paired image-to-3D hair data. Previous methods are primarily trained on synthetic data and cope with the limited amount of such data by using low-dimensional intermediate representations, such as guide strands and scalp-level embeddings, that require post-… ▽ More

    Submitted 9 May, 2025; originally announced May 2025.

    Comments: Accepted to CVPR 2025

  41. arXiv:2505.00322  [pdf, other

    cs.RO cs.AI

    AI2-Active Safety: AI-enabled Interaction-aware Active Safety Analysis with Vehicle Dynamics

    Authors: Keshu Wu, Zihao Li, Sixu Li, Xinyue Ye, Dominique Lord, Yang Zhou

    Abstract: This paper introduces an AI-enabled, interaction-aware active safety analysis framework that accounts for groupwise vehicle interactions. Specifically, the framework employs a bicycle model-augmented with road gradient considerations-to accurately capture vehicle dynamics. In parallel, a hypergraph-based AI model is developed to predict probabilistic trajectories of ambient traffic. By integrating… ▽ More

    Submitted 1 May, 2025; originally announced May 2025.

  42. arXiv:2504.18154  [pdf, other

    cs.DC

    EcoServe: Enabling Cost-effective LLM Serving with Proactive Intra- and Inter-Instance Orchestration

    Authors: Jiangsu Du, Hongbin Zhang, Taosheng Wei, Zhenyi Zheng, Kaiyi Wu, Zhiguang Chen, Yutong Lu

    Abstract: Existing LLM serving strategies can be categorized based on whether prefill and decode phases are disaggregated: non-disaggregated (NoDG) or fully disaggregated (FuDG). However, the NoDG strategy leads to strong prefill-decode interference and the FuDG strategy highly relies on high-performance interconnects, making them less cost-effective. We introduce EcoServe, a system that enables cost-effe… ▽ More

    Submitted 25 April, 2025; originally announced April 2025.

  43. arXiv:2504.17968  [pdf, other

    cs.RO

    Virtual Roads, Smarter Safety: A Digital Twin Framework for Mixed Autonomous Traffic Safety Analysis

    Authors: Hao Zhang, Ximin Yue, Kexin Tian, Sixu Li, Keshu Wu, Zihao Li, Dominique Lord, Yang Zhou

    Abstract: This paper presents a digital-twin platform for active safety analysis in mixed traffic environments. The platform is built using a multi-modal data-enabled traffic environment constructed from drone-based aerial LiDAR, OpenStreetMap, and vehicle sensor data (e.g., GPS and inclinometer readings). High-resolution 3D road geometries are generated through AI-powered semantic segmentation and georefer… ▽ More

    Submitted 24 April, 2025; originally announced April 2025.

    Comments: 14 pages, 18 figures

  44. arXiv:2504.17457  [pdf, other

    cs.CV

    Unveiling Hidden Vulnerabilities in Digital Human Generation via Adversarial Attacks

    Authors: Zhiying Li, Yeying Jin, Fan Shen, Zhi Liu, Weibin Chen, Pengju Zhang, Xiaomei Zhang, Boyu Chen, Michael Shen, Kejian Wu, Zhaoxin Fan, Jin Dong

    Abstract: Expressive human pose and shape estimation (EHPS) is crucial for digital human generation, especially in applications like live streaming. While existing research primarily focuses on reducing estimation errors, it largely neglects robustness and security aspects, leaving these systems vulnerable to adversarial attacks. To address this significant challenge, we propose the \textbf{Tangible Attack… ▽ More

    Submitted 24 April, 2025; originally announced April 2025.

    Comments: 14 pages, 7 figures

  45. arXiv:2504.16474  [pdf, ps, other

    cs.CR cs.LG

    Seeking Flat Minima over Diverse Surrogates for Improved Adversarial Transferability: A Theoretical Framework and Algorithmic Instantiation

    Authors: Meixi Zheng, Kehan Wu, Yanbo Fan, Rui Huang, Baoyuan Wu

    Abstract: The transfer-based black-box adversarial attack setting poses the challenge of crafting an adversarial example (AE) on known surrogate models that remain effective against unseen target models. Due to the practical importance of this task, numerous methods have been proposed to address this challenge. However, most previous methods are heuristically designed and intuitively justified, lacking a th… ▽ More

    Submitted 23 April, 2025; originally announced April 2025.

    Comments: 26 pages, 6 figures

  46. arXiv:2504.15264  [pdf, ps, other

    math.CO cs.DM quant-ph

    Sunflowers and Ramsey problems for restricted intersections

    Authors: Barnabás Janzer, Zhihan Jin, Benny Sudakov, Kewen Wu

    Abstract: Extremal problems on set systems with restricted intersections have been an important part of combinatorics in the last 70 year. In this paper, we study the following Ramsey version of these problems. Given a set $L\subseteq \{0,\dots,k-1\}$ and a family $\mathcal{F}$ of $k$-element sets which does not contain a sunflower with $m$ petals whose kernel size is in $L$, how large a subfamily of… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

    Comments: 23 pages + 7-page appendix

  47. arXiv:2504.14841  [pdf, other

    quant-ph cs.CC math.OC

    (Sub)Exponential Quantum Speedup for Optimization

    Authors: Jiaqi Leng, Kewen Wu, Xiaodi Wu, Yufan Zheng

    Abstract: We demonstrate provable (sub)exponential quantum speedups in both discrete and continuous optimization, achieved through simple and natural quantum optimization algorithms, namely the quantum adiabatic algorithm for discrete optimization and quantum Hamiltonian descent for continuous optimization. Our result builds on the Gilyén--Hastings--Vazirani (sub)exponential oracle separation for adiabatic… ▽ More

    Submitted 20 April, 2025; originally announced April 2025.

    Comments: 69 pages, 6 figures

  48. arXiv:2504.14535  [pdf, other

    cs.CV

    FlowLoss: Dynamic Flow-Conditioned Loss Strategy for Video Diffusion Models

    Authors: Kuanting Wu, Kei Ota, Asako Kanezaki

    Abstract: Video Diffusion Models (VDMs) can generate high-quality videos, but often struggle with producing temporally coherent motion. Optical flow supervision is a promising approach to address this, with prior works commonly employing warping-based strategies that avoid explicit flow matching. In this work, we explore an alternative formulation, FlowLoss, which directly compares flow fields extracted fro… ▽ More

    Submitted 20 April, 2025; originally announced April 2025.

  49. arXiv:2504.11919  [pdf, other

    cs.AI

    Rethinking the Generation of High-Quality CoT Data from the Perspective of LLM-Adaptive Question Difficulty Grading

    Authors: Qianjin Yu, Keyu Wu, Zihan Chen, Chushu Zhang, Manlin Mei, Lingjun Huang, Fang Tan, Yongsheng Du, Kunlin Liu, Yurui Zhu

    Abstract: Recently, DeepSeek-R1 (671B) (DeepSeek-AIet al., 2025) has demonstrated its excellent reasoning ability in complex tasks and has publiclyshared its methodology. This provides potentially high-quality chain-of-thought (CoT) data for stimulating the reasoning abilities of small-sized large language models (LLMs). To generate high-quality CoT data for different LLMs, we seek an efficient method for g… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

  50. arXiv:2504.10373  [pdf, other

    cs.LG math.DS math.NA stat.ML

    DUE: A Deep Learning Framework and Library for Modeling Unknown Equations

    Authors: Junfeng Chen, Kailiang Wu, Dongbin Xiu

    Abstract: Equations, particularly differential equations, are fundamental for understanding natural phenomena and predicting complex dynamics across various scientific and engineering disciplines. However, the governing equations for many complex systems remain unknown due to intricate underlying mechanisms. Recent advancements in machine learning and data science offer a new paradigm for modeling unknown e… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

    Comments: 28 pages