Skip to main content

Showing 1–50 of 1,228 results for author: Pu

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.15841  [pdf, ps, other

    cs.CL cs.AI cs.IR

    MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

    Authors: Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, Paul Pu Liang

    Abstract: Modern language agents must operate over long-horizon, multi-turn interactions, where they retrieve external information, adapt to observations, and answer interdependent queries. Yet, most LLM systems rely on full-context prompting, appending all past turns regardless of their relevance. This leads to unbounded memory growth, increased computational costs, and degraded reasoning performance on ou… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

    Report number: Revised-June18-2025

  2. "How can we learn and use AI at the same time?:: Participatory Design of GenAI with High School Students

    Authors: Isabella Pu, Prerna Ravi, Linh Dieu Dinh, Chelsea Joe, Caitlin Ogoe, Zixuan Li, Cynthia Breazeal, Anastasia K. Ostrowski

    Abstract: As generative AI (GenAI) emerges as a transformative force, clear understanding of high school students' perspectives is essential for GenAI's meaningful integration in high school environments. In this work, we draw insights from a participatory design workshop where we engaged 17 high school students -- a group rarely involved in prior research in this area -- through the design of novel GenAI t… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

    Comments: Copyright protected by ACM, 17 pages, 5 figures, 2 tables, in proceedings of 24th annual ACM Interaction Design and Children Conference (IDC 2025)

  3. arXiv:2506.13415  [pdf, other

    eess.IV cs.AI cs.CV

    Simple is what you need for efficient and accurate medical image segmentation

    Authors: Xiang Yu, Yayan Chen, Guannan He, Qing Zeng, Yue Qin, Meiling Liang, Dandan Luo, Yimei Liao, Zeyu Ren, Cheng Kang, Delong Yang, Bocheng Liang, Bin Pu, Ying Yuan, Shengli Li

    Abstract: While modern segmentation models often prioritize performance over practicality, we advocate a design philosophy prioritizing simplicity and efficiency, and attempted high performance segmentation model design. This paper presents SimpleUNet, a scalable ultra-lightweight medical image segmentation model with three key innovations: (1) A partial feature selection mechanism in skip connections for r… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: 15 pages, 11 figures

    ACM Class: I.4.6

  4. arXiv:2506.10712  [pdf, ps, other

    cs.CV

    Uncertainty-Masked Bernoulli Diffusion for Camouflaged Object Detection Refinement

    Authors: Yuqi Shen, Fengyang Xiao, Sujie Hu, Youwei Pang, Yifan Pu, Chengyu Fang, Xiu Li, Chunming He

    Abstract: Camouflaged Object Detection (COD) presents inherent challenges due to the subtle visual differences between targets and their backgrounds. While existing methods have made notable progress, there remains significant potential for post-processing refinement that has yet to be fully explored. To address this limitation, we propose the Uncertainty-Masked Bernoulli Diffusion (UMBD) model, the first g… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

    Comments: 16 pages, 7 figures

  5. arXiv:2506.10380  [pdf, ps, other

    cs.CL cs.IR

    TableRAG: A Retrieval Augmented Generation Framework for Heterogeneous Document Reasoning

    Authors: Xiaohan Yu, Pu Jian, Chong Chen

    Abstract: Retrieval-Augmented Generation (RAG) has demonstrated considerable effectiveness in open-domain question answering. However, when applied to heterogeneous documents, comprising both textual and tabular components, existing RAG approaches exhibit critical limitations. The prevailing practice of flattening tables and chunking strategies disrupts the intrinsic tabular structure, leads to information… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

    Comments: Under review. Codes are available at https://github.com/yxh-y/TableRAG/tree/main

  6. arXiv:2506.09454  [pdf, ps, other

    cs.LG

    NDCG-Consistent Softmax Approximation with Accelerated Convergence

    Authors: Yuanhao Pu, Defu Lian, Xiaolong Chen, Xu Huang, Jin Chen, Enhong Chen

    Abstract: Ranking tasks constitute fundamental components of extreme similarity learning frameworks, where extremely large corpora of objects are modeled through relative similarity relationships adhering to predefined ordinal structures. Among various ranking surrogates, Softmax (SM) Loss has been widely adopted due to its natural capability to handle listwise ranking via global negative comparisons, along… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: 35 pages

  7. arXiv:2506.06211  [pdf, other

    cs.CL cs.AI cs.CV

    PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts

    Authors: Hengzhi Li, Brendon Jiang, Alexander Naehu, Regan Song, Justin Zhang, Megan Tjandrasuwita, Chanakya Ekbote, Steven-Shine Chen, Adithya Balachandran, Wei Dai, Rebecca Chang, Paul Pu Liang

    Abstract: Puzzlehunts are a genre of complex, multi-step puzzles lacking well-defined problem definitions. In contrast to conventional reasoning benchmarks consisting of tasks with clear instructions, puzzlehunts require models to discover the underlying problem structure from multimodal evidence and iterative reasoning, mirroring real-world domains such as scientific discovery, exploratory data analysis, o… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

  8. arXiv:2506.02891  [pdf, ps, other

    cs.CV

    OpenFace 3.0: A Lightweight Multitask System for Comprehensive Facial Behavior Analysis

    Authors: Jiewen Hu, Leena Mathur, Paul Pu Liang, Louis-Philippe Morency

    Abstract: In recent years, there has been increasing interest in automatic facial behavior analysis systems from computing communities such as vision, multimodal interaction, robotics, and affective computing. Building upon the widespread utility of prior open-source facial analysis systems, we introduce OpenFace 3.0, an open-source toolkit capable of facial landmark detection, facial action unit detection,… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: IEEE FG 2025, \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work

  9. arXiv:2506.02308  [pdf, ps, other

    cs.LG cs.AI

    MINT: Multimodal Instruction Tuning with Multimodal Interaction Grouping

    Authors: Xiaojun Shan, Qi Cao, Xing Han, Haofei Yu, Paul Pu Liang

    Abstract: Recent advances in multimodal foundation models have achieved state-of-the-art performance across a range of tasks. These breakthroughs are largely driven by new pre-training paradigms that leverage large-scale, unlabeled multimodal data, followed by instruction fine-tuning on curated labeled datasets and high-quality prompts. While there is growing interest in scaling instruction fine-tuning to e… ▽ More

    Submitted 6 June, 2025; v1 submitted 2 June, 2025; originally announced June 2025.

  10. arXiv:2506.02210  [pdf, ps, other

    cs.LG cs.AI cs.PF

    Exchangeability in Neural Network Architectures and its Application to Dynamic Pruning

    Authors: Pu, Yi, Tianlang Chen, Yifan Yang, Sara Achour

    Abstract: Neural networks (NNs) are equipped with increasingly many parameters and require more and more resource for deployment. Researchers have explored various ways to improve the efficiency of NNs by identifying and reducing the redundancy, such as pruning or quantizing unimportant weights. Symmetry in the NN architectures has been identified by prior work as a possible type of redundancy, but exploiti… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

  11. arXiv:2506.00711  [pdf, other

    cs.LG cs.AI cs.CV

    QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training

    Authors: Wei Dai, Peilin Chen, Chanakya Ekbote, Paul Pu Liang

    Abstract: Clinical decision-making routinely demands reasoning over heterogeneous data, yet existing multimodal language models (MLLMs) remain largely vision-centric and fail to generalize across clinical specialties. To bridge this gap, we introduce QoQ-Med-7B/32B, the first open generalist clinical foundation model that jointly reasons across medical images, time-series signals, and text reports. QoQ-Med… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

  12. arXiv:2506.00239  [pdf, ps, other

    cs.AI

    SMELLNET: A Large-scale Dataset for Real-world Smell Recognition

    Authors: Dewei Feng, Carol Li, Wei Dai, Paul Pu Liang

    Abstract: The ability of AI to sense and identify various substances based on their smell alone can have profound impacts on allergen detection (e.g., smelling gluten or peanuts in a cake), monitoring the manufacturing process, and sensing hormones that indicate emotional states, stress levels, and diseases. Despite these broad impacts, there are virtually no large scale benchmarks, and therefore little pro… ▽ More

    Submitted 30 May, 2025; originally announced June 2025.

    Comments: 22 pages, 13 figures

  13. arXiv:2506.00160  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Werewolf: A Straightforward Game Framework with TTS for Improved User Engagement

    Authors: Qihui Fan, Enfu Nan, Wenbo Li, Lei Lu, Pu Zhao, Yanzhi Wang

    Abstract: The growing popularity of social deduction game systems for both business applications and AI research has greatly benefited from the rapid advancements in Large Language Models (LLMs), which now demonstrate stronger reasoning and persuasion capabilities. Especially with the raise of DeepSeek R1 and V3 models, LLMs should enable a more engaging experience for human players in LLM-agent-based socia… ▽ More

    Submitted 30 May, 2025; originally announced June 2025.

  14. arXiv:2505.23868  [pdf, ps, other

    cs.LG cs.AI

    Noise-Robustness Through Noise: Asymmetric LoRA Adaption with Poisoning Expert

    Authors: Zhaokun Wang, Jinyu Guo, Jingwen Pu, Lingfeng Chen, Hongli Pu, Jie Ou, Libo Qin, Wenhong Tian

    Abstract: Current parameter-efficient fine-tuning methods for adapting pre-trained language models to downstream tasks are susceptible to interference from noisy data. Conventional noise-handling approaches either rely on laborious data pre-processing or employ model architecture modifications prone to error accumulation. In contrast to existing noise-process paradigms, we propose a noise-robust adaptation… ▽ More

    Submitted 8 June, 2025; v1 submitted 29 May, 2025; originally announced May 2025.

  15. arXiv:2505.23844  [pdf, ps, other

    cs.CL

    Enabling Flexible Multi-LLM Integration for Scalable Knowledge Aggregation

    Authors: Zhenglun Kong, Zheng Zhan, Shiyue Hou, Yifan Gong, Xin Meng, Pengwei Sui, Peiyan Dong, Xuan Shen, Zifeng Wang, Pu Zhao, Hao Tang, Stratis Ioannidis, Yanzhi Wang

    Abstract: Large language models (LLMs) have shown remarkable promise but remain challenging to continually improve through traditional finetuning, particularly when integrating capabilities from other specialized LLMs. Popular methods like ensemble and weight merging require substantial memory and struggle to adapt to changing data environments. Recent efforts have transferred knowledge from multiple LLMs i… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  16. arXiv:2505.21835  [pdf, other

    cs.LG cs.AI

    TuneComp: Joint Fine-tuning and Compression for Large Foundation Models

    Authors: Xiangyu Chen, Jing Liu, Ye Wang, Matthew Brand, Pu, Wang, Toshiaki Koike-Akino

    Abstract: To reduce model size during post-training, compression methods, including knowledge distillation, low-rank approximation, and pruning, are often applied after fine-tuning the model. However, sequential fine-tuning and compression sacrifices performance, while creating a larger than necessary model as an intermediate step. In this work, we aim to reduce this gap, by directly constructing a smaller… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: Preliminary Work

  17. arXiv:2505.21233  [pdf, ps, other

    cs.CV

    CROP: Contextual Region-Oriented Visual Token Pruning

    Authors: Jiawei Guo, Feifei Zhai, Pu Jian, Qianrun Wei, Yu Zhou

    Abstract: Current VLM-based VQA methods often process entire images, leading to excessive visual tokens that include redundant information irrelevant to the posed question. This abundance of unnecessary image details creates numerous visual tokens, drastically increasing memory and computational requirements in VLMs. To address this, we propose Contextual Region-Oriented Visual Token Pruning (CROP), a novel… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  18. arXiv:2505.20655  [pdf, ps, other

    cs.CV

    Photography Perspective Composition: Towards Aesthetic Perspective Recommendation

    Authors: Lujian Yao, Siming Zheng, Xinbin Yuan, Zhuoxuan Cai, Pu Wu, Jinwei Chen, Bo Li, Peng-Tao Jiang

    Abstract: Traditional photography composition approaches are dominated by 2D cropping-based methods. However, these methods fall short when scenes contain poorly arranged subjects. Professional photographers often employ perspective adjustment as a form of 3D recomposition, modifying the projected 2D relationships between subjects while maintaining their actual spatial positions to achieve better compositio… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  19. arXiv:2505.20613  [pdf, ps, other

    cs.CL cs.AI cs.LG cs.LO

    REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning

    Authors: Ziju Shen, Naohao Huang, Fanyi Yang, Yutong Wang, Guoxiong Gao, Tianyi Xu, Jiedong Jiang, Wanyi He, Pu Yang, Mengzhou Sun, Haocheng Ju, Peihao Wu, Bryan Dai, Bin Dong

    Abstract: Nowadays, formal theorem provers have made monumental progress on high-school and competition-level mathematics, but few of them generalize to more advanced mathematics. In this paper, we present REAL-Prover, a new open-source stepwise theorem prover for Lean 4 to push this boundary. This prover, based on our fine-tuned large language model (REAL-Prover-v1) and integrated with a retrieval system (… ▽ More

    Submitted 16 June, 2025; v1 submitted 26 May, 2025; originally announced May 2025.

  20. arXiv:2505.20341  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    Towards Emotionally Consistent Text-Based Speech Editing: Introducing EmoCorrector and The ECD-TSE Dataset

    Authors: Rui Liu, Pu Gao, Jiatian Xi, Berrak Sisman, Carlos Busso, Haizhou Li

    Abstract: Text-based speech editing (TSE) modifies speech using only text, eliminating re-recording. However, existing TSE methods, mainly focus on the content accuracy and acoustic consistency of synthetic speech segments, and often overlook the emotional shifts or inconsistency issues introduced by text changes. To address this issue, we propose EmoCorrector, a novel post-correction scheme for TSE. EmoCor… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

    Comments: INTERSPEECH2025. Code and audio examples: https://github.com/AI-S2-Lab/EmoCorrector

  21. arXiv:2505.20246  [pdf, ps, other

    cs.AI cs.CL

    On Path to Multimodal Historical Reasoning: HistBench and HistAgent

    Authors: Jiahao Qiu, Fulian Xiao, Yimin Wang, Yuchen Mao, Yijia Chen, Xinzhe Juan, Shu Zhang, Siran Wang, Xuan Qi, Tongcheng Zhang, Zixin Yao, Jiacheng Guo, Yifu Lu, Charles Argon, Jundi Cui, Daixin Chen, Junran Zhou, Shuyao Zhou, Zhanpeng Zhou, Ling Yang, Shilong Liu, Hongru Wang, Kaixuan Huang, Xun Jiang, Yuming Cao , et al. (74 additional authors not shown)

    Abstract: Recent advances in large language models (LLMs) have led to remarkable progress across domains, yet their capabilities in the humanities, particularly history, remain underexplored. Historical reasoning poses unique challenges for AI, involving multimodal source interpretation, temporal inference, and cross-linguistic analysis. While general-purpose agents perform well on many existing benchmarks,… ▽ More

    Submitted 19 June, 2025; v1 submitted 26 May, 2025; originally announced May 2025.

    Comments: 17 pages, 7 figures

  22. arXiv:2505.20089  [pdf, ps, other

    cs.SI cs.AI

    Homophily Enhanced Graph Domain Adaptation

    Authors: Ruiyi Fang, Bingheng Li, Jingyu Zhao, Ruizhi Pu, Qiuhao Zeng, Gezheng Xu, Charles Ling, Boyu Wang

    Abstract: Graph Domain Adaptation (GDA) transfers knowledge from labeled source graphs to unlabeled target graphs, addressing the challenge of label scarcity. In this paper, we highlight the significance of graph homophily, a pivotal factor for graph domain alignment, which, however, has long been overlooked in existing approaches. Specifically, our analysis first reveals that homophily discrepancies exist… ▽ More

    Submitted 31 May, 2025; v1 submitted 26 May, 2025; originally announced May 2025.

    Comments: Accepted at ICML2025

  23. arXiv:2505.19821  [pdf, ps, other

    cs.CR cs.LG

    Poison in the Well: Feature Embedding Disruption in Backdoor Attacks

    Authors: Zhou Feng, Jiahao Chen, Chunyi Zhou, Yuwen Pu, Qingming Li, Shouling Ji

    Abstract: Backdoor attacks embed malicious triggers into training data, enabling attackers to manipulate neural network behavior during inference while maintaining high accuracy on benign inputs. However, existing backdoor attacks face limitations manifesting in excessive reliance on training data, poor stealth, and instability, which hinder their effectiveness in real-world applications. Therefore, this pa… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

    Comments: Accepted to ICME 2025

    ACM Class: I.2.6; I.5.1; D.4.6

  24. arXiv:2505.19648  [pdf, other

    cs.LO cs.AI

    Model Enumeration of Two-Variable Logic with Quadratic Delay Complexity

    Authors: Qiaolan Meng, Juhua Pu, Hongting Niu, Yuyi Wang, Yuanhong Wang, Ondřej Kuželka

    Abstract: We study the model enumeration problem of the function-free, finite domain fragment of first-order logic with two variables ($FO^2$). Specifically, given an $FO^2$ sentence $Γ$ and a positive integer $n$, how can one enumerate all the models of $Γ$ over a domain of size $n$? In this paper, we devise a novel algorithm to address this problem. The delay complexity, the time required between producin… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

    Comments: 16 pages, 4 figures and to be published in Fortieth Annual ACM/IEEE Symposium on Logic in Computer Science (LICS)

  25. arXiv:2505.19430  [pdf, ps, other

    cs.CL cs.AI

    Deriving Strategic Market Insights with Large Language Models: A Benchmark for Forward Counterfactual Generation

    Authors: Keane Ong, Rui Mao, Deeksha Varshney, Paul Pu Liang, Erik Cambria, Gianmarco Mengaldo

    Abstract: Counterfactual reasoning typically involves considering alternatives to actual events. While often applied to understand past events, a distinct form-forward counterfactual reasoning-focuses on anticipating plausible future developments. This type of reasoning is invaluable in dynamic financial markets, where anticipating market developments can powerfully unveil potential risks and opportunities… ▽ More

    Submitted 5 June, 2025; v1 submitted 25 May, 2025; originally announced May 2025.

  26. arXiv:2505.18880  [pdf, ps, other

    cs.CV cs.AI

    REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing

    Authors: Weihan Xu, Yimeng Ma, Jingyue Huang, Yang Li, Wenye Ma, Taylor Berg-Kirkpatrick, Julian McAuley, Paul Pu Liang, Hao-Wen Dong

    Abstract: Short videos are an effective tool for promoting contents and improving knowledge accessibility. While existing extractive video summarization methods struggle to produce a coherent narrative, existing abstractive methods cannot `quote' from the input videos, i.e., inserting short video clips in their outputs. In this work, we explore novel video editing models for generating shorts that feature a… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

  27. arXiv:2505.18413  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    LatentLLM: Attention-Aware Joint Tensor Compression

    Authors: Toshiaki Koike-Akino, Xiangyu Chen, Jing Liu, Ye Wang, Pu, Wang, Matthew Brand

    Abstract: Modern foundation models such as large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources. We propose a new framework to convert such LLMs/LMMs into a reduced-dimension latent structure. Our method extends a local activation-aware tensor decomposition to a global attention-aware joint tensor de-composition. Our framework can si… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

    Comments: 37 pages, 16 figures

  28. arXiv:2505.18399  [pdf, ps, other

    cs.CV cs.AI cs.LG

    Taming Diffusion for Dataset Distillation with High Representativeness

    Authors: Lin Zhao, Yushu Wu, Xinru Jiang, Jianyang Gu, Yanzhi Wang, Xiaolin Xu, Pu Zhao, Xue Lin

    Abstract: Recent deep learning models demand larger datasets, driving the need for dataset distillation to create compact, cost-efficient datasets while maintaining performance. Due to the powerful image generation capability of diffusion, it has been introduced to this field for generating distilled images. In this paper, we systematically investigate issues present in current diffusion-based dataset disti… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

    Comments: The paper is accepted by ICML 2025

  29. arXiv:2505.18227  [pdf, ps, other

    cs.LG cs.AI

    Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality

    Authors: Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik

    Abstract: In Transformer architectures, tokens\textemdash discrete units derived from raw data\textemdash are formed by segmenting inputs into fixed-length chunks. Each token is then mapped to an embedding, enabling parallel attention computations while preserving the input's essential information. Due to the quadratic computational complexity of transformer self-attention mechanisms, token reduction has pr… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

  30. arXiv:2505.17553  [pdf, ps, other

    cs.LG cs.CL

    CoMoE: Contrastive Representation for Mixture-of-Experts in Parameter-Efficient Fine-tuning

    Authors: Jinyuan Feng, Chaopeng Wei, Tenghai Qiu, Tianyi Hu, Zhiqiang Pu

    Abstract: In parameter-efficient fine-tuning, mixture-of-experts (MoE), which involves specializing functionalities into different experts and sparsely activating them appropriately, has been widely adopted as a promising approach to trade-off between model capacity and computation overhead. However, current MoE variants fall short on heterogeneous datasets, ignoring the fact that experts may learn similar… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

  31. arXiv:2505.17540  [pdf, ps, other

    cs.CV cs.AI

    RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning

    Authors: Mingrui Wu, Lu Wang, Pu Zhao, Fangkai Yang, Jianjin Zhang, Jianfeng Liu, Yuefeng Zhan, Weihao Han, Hao Sun, Jiayi Ji, Xiaoshuai Sun, Qingwei Lin, Weiwei Deng, Dongmei Zhang, Feng Sun, Qi Zhang, Rongrong Ji

    Abstract: Despite recent progress in text-to-image (T2I) generation, existing models often struggle to faithfully capture user intentions from short and under-specified prompts. While prior work has attempted to enhance prompts using large language models (LLMs), these methods frequently generate stylistic or unrealistic content due to insufficient grounding in visual semantics and real-world composition. I… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

    Comments: Code is available at: https://github.com/microsoft/DKI_LLM/tree/main/RePrompt

  32. arXiv:2505.17307  [pdf, other

    cs.LG

    Wavelet Probabilistic Recurrent Convolutional Network for Multivariate Time Series Classification

    Authors: Pu Yang, J. A. Barria

    Abstract: This paper presents a Wavelet Probabilistic Recurrent Convolutional Network (WPRCN) for Multivariate Time Series Classification (MTSC), especially effective in handling non-stationary environments, data scarcity and noise perturbations. We introduce a versatile wavelet probabilistic module designed to extract and analyse the probabilistic features, which can seamlessly integrate with a variety of… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

  33. arXiv:2505.16826  [pdf, ps, other

    cs.AI cs.CL

    KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical Reasoning

    Authors: Wei Sun, Wen Yang, Pu Jian, Qianlong Du, Fuwei Cui, Shuo Ren, Jiajun Zhang

    Abstract: Recent advances have demonstrated that integrating reinforcement learning with rule-based rewards can significantly enhance the reasoning capabilities of large language models, even without supervised fine-tuning. However, prevalent reinforcement learning algorithms such as GRPO and its variants like DAPO, suffer from a coarse granularity issue when computing the advantage. Specifically, they comp… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

  34. arXiv:2505.16149  [pdf, ps, other

    cs.CV cs.AI cs.CL

    When VLMs Meet Image Classification: Test Sets Renovation via Missing Label Identification

    Authors: Zirui Pang, Haosheng Tan, Yuhan Pu, Zhijie Deng, Zhouan Shen, Keyu Hu, Jiaheng Wei

    Abstract: Image classification benchmark datasets such as CIFAR, MNIST, and ImageNet serve as critical tools for model evaluation. However, despite the cleaning efforts, these datasets still suffer from pervasive noisy labels and often contain missing labels due to the co-existing image pattern where multiple classes appear in an image sample. This results in misleading model comparisons and unfair evaluati… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

  35. arXiv:2505.15047  [pdf, ps, other

    cs.LG cs.AI

    PiFlow: Principle-aware Scientific Discovery with Multi-Agent Collaboration

    Authors: Yingming Pu, Tao Lin, Hongyu Chen

    Abstract: Large Language Model (LLM)-based multi-agent systems (MAS) demonstrate remarkable potential for scientific discovery. Existing approaches, however, often automate scientific discovery using predefined workflows that lack rationality constraints. This often leads to aimless hypothesizing and a failure to consistently link hypotheses with evidence, thereby hindering systematic uncertainty reduction.… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

  36. arXiv:2505.14709  [pdf, ps, other

    cs.CV cs.AI

    FastCar: Cache Attentive Replay for Fast Auto-Regressive Video Generation on the Edge

    Authors: Xuan Shen, Weize Ma, Yufa Zhou, Enhao Tang, Yanyue Xie, Zhengang Li, Yifan Gong, Quanyi Wang, Henghui Ding, Yiwei Wang, Yanzhi Wang, Pu Zhao, Jun Lin, Jiuxiang Gu

    Abstract: Auto-regressive (AR) models, initially successful in language generation, have recently shown promise in visual generation tasks due to their superior sampling efficiency. Unlike image generation, video generation requires a substantially larger number of tokens to produce coherent temporal frames, resulting in significant overhead during the decoding phase. Our key observations are: (i) MLP modul… ▽ More

    Submitted 17 May, 2025; originally announced May 2025.

    Comments: Preprint Version

  37. arXiv:2505.14708  [pdf, ps, other

    cs.CV cs.AI

    DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance

    Authors: Xuan Shen, Chenxia Han, Yufa Zhou, Yanyue Xie, Yifan Gong, Quanyi Wang, Yiwei Wang, Yanzhi Wang, Pu Zhao, Jiuxiang Gu

    Abstract: Diffusion transformer-based video generation models (DiTs) have recently attracted widespread attention for their excellent generation quality. However, their computational cost remains a major bottleneck-attention alone accounts for over 80% of total latency, and generating just 8 seconds of 720p video takes tens of minutes-posing serious challenges to practical application and scalability. To ad… ▽ More

    Submitted 17 May, 2025; originally announced May 2025.

    Comments: Preprint Version

  38. arXiv:2505.14462  [pdf, ps, other

    cs.CV cs.CL

    RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding

    Authors: Jiaang Li, Yifei Yuan, Wenyan Li, Mohammad Aliannejadi, Daniel Hershcovich, Anders Søgaard, Ivan Vulić, Wenxuan Zhang, Paul Pu Liang, Yang Deng, Serge Belongie

    Abstract: As vision-language models (VLMs) become increasingly integrated into daily life, the need for accurate visual culture understanding is becoming critical. Yet, these models frequently fall short in interpreting cultural nuances effectively. Prior work has demonstrated the effectiveness of retrieval-augmented generation (RAG) in enhancing cultural understanding in text-only settings, while its appli… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

  39. arXiv:2505.14010  [pdf, ps, other

    cs.CV

    UHD Image Dehazing via anDehazeFormer with Atmospheric-aware KV Cache

    Authors: Pu Wang, Pengwen Dai, Chen Wu, Yeying Jin, Dianjie Lu, Guijuan Zhang, Youshan Zhang, Zhuoran Zheng

    Abstract: In this paper, we propose an efficient visual transformer framework for ultra-high-definition (UHD) image dehazing that addresses the key challenges of slow training speed and high memory consumption for existing methods. Our approach introduces two key innovations: 1) an \textbf{a}daptive \textbf{n}ormalization mechanism inspired by the nGPT architecture that enables ultra-fast and stable trainin… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

    Comments: Under review

  40. arXiv:2505.13820  [pdf, ps, other

    cs.LG cs.AI cs.CL

    Structured Agent Distillation for Large Language Model

    Authors: Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Tianqi Li, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Pu Zhao, Xue Lin, Dong Huang, Yanzhi Wang

    Abstract: Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks. Yet, their practical deployment is constrained by high inference costs and large model sizes. We propose Structured Agent Distillation, a framework that compresses large LLM-based agents into smaller student models while preserving both reason… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

  41. arXiv:2505.12545  [pdf, other

    cs.CL

    Towards Reliable and Interpretable Traffic Crash Pattern Prediction and Safety Interventions Using Customized Large Language Models

    Authors: Yang Zhao, Pu Wang, Yibo Zhao, Hongru Du, Hao Frank Yang

    Abstract: Predicting crash events is crucial for understanding crash distributions and their contributing factors, thereby enabling the design of proactive traffic safety policy interventions. However, existing methods struggle to interpret the complex interplay among various sources of traffic crash data, including numeric characteristics, textual reports, crash imagery, environmental conditions, and drive… ▽ More

    Submitted 21 May, 2025; v1 submitted 18 May, 2025; originally announced May 2025.

    Comments: Last revised 13 Feb 2025. Under review in Nature portfolio

  42. arXiv:2505.11815  [pdf, ps, other

    cs.CV

    UniMoCo: Unified Modality Completion for Robust Multi-Modal Embeddings

    Authors: Jiajun Qin, Yuan Pu, Zhuolun He, Seunggeun Kim, David Z. Pan, Bei Yu

    Abstract: Current research has explored vision-language models for multi-modal embedding tasks, such as information retrieval, visual grounding, and classification. However, real-world scenarios often involve diverse modality combinations between queries and targets, such as text and image to text, text and image to text and image, and text to text and image. These diverse combinations pose significant chal… ▽ More

    Submitted 16 May, 2025; originally announced May 2025.

  43. arXiv:2505.11321  [pdf, other

    cs.LG eess.SP

    Anomaly Detection for Non-stationary Time Series using Recurrent Wavelet Probabilistic Neural Network

    Authors: Pu Yang, J. A. Barria

    Abstract: In this paper, an unsupervised Recurrent Wavelet Probabilistic Neural Network (RWPNN) is proposed, which aims at detecting anomalies in non-stationary environments by modelling the temporal features using a nonparametric density estimation network. The novel framework consists of two components, a Stacked Recurrent Encoder-Decoder (SREnc-Dec) module that captures temporal features in a latent spac… ▽ More

    Submitted 16 May, 2025; originally announced May 2025.

  44. arXiv:2505.10732  [pdf

    cs.CR cs.AI

    Automating Security Audit Using Large Language Model based Agent: An Exploration Experiment

    Authors: Jia Hui Chin, Pu Zhang, Yu Xin Cheong, Jonathan Pan

    Abstract: In the current rapidly changing digital environment, businesses are under constant stress to ensure that their systems are secured. Security audits help to maintain a strong security posture by ensuring that policies are in place, controls are implemented, gaps are identified for cybersecurity risks mitigation. However, audits are usually manual, requiring much time and costs. This paper looks at… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  45. arXiv:2505.10604  [pdf, ps, other

    cs.CV cs.AI

    MIRAGE: A Multi-modal Benchmark for Spatial Perception, Reasoning, and Intelligence

    Authors: Chonghan Liu, Haoran Wang, Felix Henry, Pu Miao, Yajie Zhang, Yu Zhao, Peiran Wu

    Abstract: Spatial perception and reasoning are core components of human cognition, encompassing object recognition, spatial relational understanding, and dynamic reasoning. Despite progress in computer vision, existing benchmarks reveal significant gaps in models' abilities to accurately recognize object attributes and reason about spatial relationships, both essential for dynamic reasoning. To address thes… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  46. arXiv:2505.10367  [pdf, ps, other

    eess.SY cs.LG

    A Hybrid Strategy for Aggregated Probabilistic Forecasting and Energy Trading in HEFTCom2024

    Authors: Chuanqing Pu, Feilong Fan, Nengling Tai, Songyuan Liu, Jinming Yu

    Abstract: Obtaining accurate probabilistic energy forecasts and making effective decisions amid diverse uncertainties are routine challenges in future energy systems. This paper presents the solution of team GEB, which ranked 3rd in trading, 4th in forecasting, and 1st among student teams in the IEEE Hybrid Energy Forecasting and Trading Competition 2024 (HEFTCom2024). The solution provides accurate probabi… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

    Comments: Solution description of IEEE Hybrid Energy Forecasting and Trading Competition (HEFTCom)

  47. arXiv:2505.10322  [pdf, other

    cs.LG math.OC

    Asynchronous Decentralized SGD under Non-Convexity: A Block-Coordinate Descent Framework

    Authors: Yijie Zhou, Shi Pu

    Abstract: Decentralized optimization has become vital for leveraging distributed data without central control, enhancing scalability and privacy. However, practical deployments face fundamental challenges due to heterogeneous computation speeds and unpredictable communication delays. This paper introduces a refined model of Asynchronous Decentralized Stochastic Gradient Descent (ADSGD) under practical assum… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  48. arXiv:2505.09943  [pdf, ps, other

    cs.CV

    CSPENet: Contour-Aware and Saliency Priors Embedding Network for Infrared Small Target Detection

    Authors: Jiakun Deng, Kexuan Li, Xingye Cui, Jiaxuan Li, Chang Long, Tian Pu, Zhenming Peng

    Abstract: Infrared small target detection (ISTD) plays a critical role in a wide range of civilian and military applications. Existing methods suffer from deficiencies in the localization of dim targets and the perception of contour information under dense clutter environments, severely limiting their detection performance. To tackle these issues, we propose a contour-aware and saliency priors embedding net… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

  49. arXiv:2505.09331  [pdf, other

    cs.LG

    MUST: Multi-Scale Structural-Temporal Link Prediction Model for UAV Ad Hoc Networks

    Authors: Cunlai Pu, Fangrui Wu, Rajput Ramiz Sharafat, Guangzhao Dai, Xiangbo Shu

    Abstract: Link prediction in unmanned aerial vehicle (UAV) ad hoc networks (UANETs) aims to predict the potential formation of future links between UAVs. In adversarial environments where the route information of UAVs is unavailable, predicting future links must rely solely on the observed historical topological information of UANETs. However, the highly dynamic and sparse nature of UANET topologies present… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

  50. arXiv:2505.07819  [pdf, ps, other

    cs.RO cs.AI cs.CV

    H$^3$DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning

    Authors: Yiyang Lu, Yufeng Tian, Zhecheng Yuan, Xianbang Wang, Pu Hua, Zhengrong Xue, Huazhe Xu

    Abstract: Visuomotor policy learning has witnessed substantial progress in robotic manipulation, with recent approaches predominantly relying on generative models to model the action distribution. However, these methods often overlook the critical coupling between visual perception and action prediction. In this work, we introduce $\textbf{Triply-Hierarchical Diffusion Policy}~(\textbf{H$^{\mathbf{3}}… ▽ More

    Submitted 17 June, 2025; v1 submitted 12 May, 2025; originally announced May 2025.