Skip to main content

Showing 1–50 of 849 results for author: Lin, W

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.00490  [pdf, ps, other

    cs.CV eess.IV

    Just Noticeable Difference for Large Multimodal Models

    Authors: Zijian Chen, Yuan Tian, Yuze Sun, Wei Sun, Zicheng Zhang, Weisi Lin, Guangtao Zhai, Wenjun Zhang

    Abstract: Just noticeable difference (JND), the minimum change that the human visual system (HVS) can perceive, has been studied for decades. Although recent work has extended this line of research into machine vision, there has been a scarcity of studies systematically exploring its perceptual boundaries across multiple tasks and stimulus types, particularly in the current era of rapidly advancing large mu… ▽ More

    Submitted 2 July, 2025; v1 submitted 1 July, 2025; originally announced July 2025.

    Comments: 19 pages, 19 figures

  2. arXiv:2507.00373  [pdf, ps, other

    cs.CV eess.IV

    Customizable ROI-Based Deep Image Compression

    Authors: Jian Jin, Fanxin Xia, Feng Ding, Xinfeng Zhang, Meiqin Liu, Yao Zhao, Weisi Lin, Lili Meng

    Abstract: Region of Interest (ROI)-based image compression optimizes bit allocation by prioritizing ROI for higher-quality reconstruction. However, as the users (including human clients and downstream machine tasks) become more diverse, ROI-based image compression needs to be customizable to support various preferences. For example, different users may define distinct ROI or require different quality trade-… ▽ More

    Submitted 1 July, 2025; v1 submitted 30 June, 2025; originally announced July 2025.

  3. arXiv:2506.19767  [pdf, ps, other

    cs.CL cs.AI cs.LG

    SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning

    Authors: Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, Dongbin Zhao

    Abstract: Large language models (LLMs) have achieved remarkable progress in reasoning tasks, yet the optimal integration of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) remains a fundamental challenge. Through comprehensive analysis of token distributions, learning dynamics, and integration mechanisms from entropy-based perspectives, we reveal key differences between these paradigms: SFT ind… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  4. arXiv:2506.16495  [pdf, ps, other

    cs.MM cs.CV

    DT-UFC: Universal Large Model Feature Coding via Peaky-to-Balanced Distribution Transformation

    Authors: Changsheng Gao, Zijie Liu, Li Li, Dong Liu, Xiaoyan Sun, Weisi Lin

    Abstract: Like image coding in visual data transmission, feature coding is essential for the distributed deployment of large models by significantly reducing transmission and storage overhead. However, prior studies have mostly targeted task- or model-specific scenarios, leaving the challenge of universal feature coding across diverse large models largely unaddressed. In this paper, we present the first sys… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

  5. arXiv:2506.15617  [pdf, ps, other

    cs.CL cs.AI cs.LG

    The Compositional Architecture of Regret in Large Language Models

    Authors: Xiangxiang Cui, Shu Yang, Tianjin Huang, Wanyu Lin, Lijie Hu, Di Wang

    Abstract: Regret in Large Language Models refers to their explicit regret expression when presented with evidence contradicting their previously generated misinformation. Studying the regret mechanism is crucial for enhancing model reliability and helps in revealing how cognition is coded in neural networks. To understand this mechanism, we need to first identify regret expressions in model outputs, then an… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

    Comments: 23 pages

  6. arXiv:2506.15228  [pdf, ps, other

    eess.IV cs.MM

    ABC: Adaptive BayesNet Structure Learning for Computational Scalable Multi-task Image Compression

    Authors: Yufeng Zhang, Wenrui Dai, Hang Yu, Shizhan Liu, Junhui Hou, Jianguo Li, Weiyao Lin

    Abstract: Neural Image Compression (NIC) has revolutionized image compression with its superior rate-distortion performance and multi-task capabilities, supporting both human visual perception and machine vision tasks. However, its widespread adoption is hindered by substantial computational demands. While existing approaches attempt to address this challenge through module-specific optimizations or pre-def… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

  7. arXiv:2506.14363  [pdf, ps, other

    cs.LO

    OSTRICH2: Solver for Complex String Constraints

    Authors: Matthew Hague, Denghang Hu, Artur Jeż, Anthony W. Lin, Oliver Markgraf, Philipp Rümmer, Zhilin Wu

    Abstract: We present OSTRICH2, the latest evolution of the SMT solver OSTRICH for string constraints. OSTRICH2 supports a wide range of complex functions on strings and provides completeness guarantees for a substantial fragment of string constraints, including the straight-line fragment and the chain-free fragment. OSTRICH2 provides full support for the SMT-LIB theory of Unicode strings, extending the stan… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

  8. Towards Robust Learning to Optimize with Theoretical Guarantees

    Authors: Qingyu Song, Wei Lin, Juncheng Wang, Hong Xu

    Abstract: Learning to optimize (L2O) is an emerging technique to solve mathematical optimization problems with learning-based methods. Although with great success in many real-world scenarios such as wireless communications, computer networks, and electronic design, existing L2O works lack theoretical demonstration of their performance and robustness in out-of-distribution (OOD) scenarios. We address this g… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

    Comments: Published in CVPR 2024, 55 pages, 17 figures, this version fixed some typo

    Journal ref: In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, 17297-17306

  9. arXiv:2506.12737  [pdf, ps, other

    cs.CV cs.DC

    Cross-architecture universal feature coding via distribution alignment

    Authors: Changsheng Gao, Shan Liu, Feng Wu, Weisi Lin

    Abstract: Feature coding has become increasingly important in scenarios where semantic representations rather than raw pixels are transmitted and stored. However, most existing methods are architecture-specific, targeting either CNNs or Transformers. This design limits their applicability in real-world scenarios where features from both architectures coexist. To address this gap, we introduce a new research… ▽ More

    Submitted 15 June, 2025; originally announced June 2025.

  10. arXiv:2506.12103  [pdf, other

    cs.AI cs.CY cs.LG

    The Amazon Nova Family of Models: Technical Report and Model Card

    Authors: Amazon AGI, Aaron Langford, Aayush Shah, Abhanshu Gupta, Abhimanyu Bhatter, Abhinav Goyal, Abhinav Mathur, Abhinav Mohanty, Abhishek Kumar, Abhishek Sethi, Abi Komma, Abner Pena, Achin Jain, Adam Kunysz, Adam Opyrchal, Adarsh Singh, Aditya Rawal, Adok Achar Budihal Prasad, Adrià de Gispert, Agnika Kumar, Aishwarya Aryamane, Ajay Nair, Akilan M, Akshaya Iyengar, Akshaya Vishnu Kudlu Shanbhogue , et al. (761 additional authors not shown)

    Abstract: We present Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. Amazon Nova Pro is a highly-capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Lite is a low-cost multimodal model that is lightning fast for processing images, video, documents… ▽ More

    Submitted 17 March, 2025; originally announced June 2025.

    Comments: 48 pages, 10 figures

    Report number: 20250317

  11. arXiv:2506.11997  [pdf, ps, other

    cs.LG stat.ML

    pLSTM: parallelizable Linear Source Transition Mark networks

    Authors: Korbinian Pöppel, Richard Freinschlag, Thomas Schmied, Wei Lin, Sepp Hochreiter

    Abstract: Modern recurrent architectures, such as xLSTM and Mamba, have recently challenged the Transformer in language modeling. However, their structure constrains their applicability to sequences only or requires processing multi-dimensional data structures, such as images or molecular graphs, in a pre-defined sequential order. In contrast, Multi-Dimensional RNNs (MDRNNs) are well suited for data with a… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

  12. arXiv:2506.10516  [pdf, ps, other

    cs.CV cs.AI

    CogStream: Context-guided Streaming Video Question Answering

    Authors: Zicheng Zhao, Kangyu Wang, Shijie Li, Rui Qian, Weiyao Lin, Huabin Liu

    Abstract: Despite advancements in Video Large Language Models (Vid-LLMs) improving multimodal understanding, challenges persist in streaming video reasoning due to its reliance on contextual information. Existing paradigms feed all available historical contextual information into Vid-LLMs, resulting in a significant computational burden for visual data processing. Furthermore, the inclusion of irrelevant co… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

  13. arXiv:2506.10235  [pdf, ps, other

    cs.LG cs.AI cs.AR

    LaMAGIC2: Advanced Circuit Formulations for Language Model-Based Analog Topology Generation

    Authors: Chen-Chia Chang, Wan-Hsuan Lin, Yikang Shen, Yiran Chen, Xin Zhang

    Abstract: Automation of analog topology design is crucial due to customized requirements of modern applications with heavily manual engineering efforts. The state-of-the-art work applies a sequence-to-sequence approach and supervised finetuning on language models to generate topologies given user specifications. However, its circuit formulation is inefficient due to O(|V |2) token length and suffers from lo… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: Accepted at 42nd International Conference on Machine Learning (ICML) 2025

  14. arXiv:2506.08528  [pdf, ps, other

    cs.DC cs.LG cs.OS

    PerfTracker: Online Performance Troubleshooting for Large-scale Model Training in Production

    Authors: Yu Guan, Zhiyu Yin, Haoyu Chen, Sheng Cheng, Chaojie Yang, Kun Qian, Tianyin Xu, Yang Zhang, Hanyu Zhao, Yong Li, Wei Lin, Dennis Cai, Ennan Zhai

    Abstract: Troubleshooting performance problems of large model training (LMT) is immensely challenging, due to unprecedented scales of modern GPU clusters, the complexity of software-hardware interactions, and the data intensity of the training process. Existing troubleshooting approaches designed for traditional distributed systems or datacenter networks fall short and can hardly apply to real-world trainin… ▽ More

    Submitted 11 June, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

  15. arXiv:2506.07811  [pdf, ps, other

    cs.CV

    Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning

    Authors: Tieyuan Chen, Huabin Liu, Yi Wang, Chaofan Gan, Mingxi Lyu, Gui Zou, Weiyao Lin

    Abstract: Video Question Answering (VideoQA) aims to answer natural language questions based on the given video, with prior work primarily focusing on identifying the duration of relevant segments, referred to as explicit visual evidence. However, explicit visual evidence is not always directly available, particularly when questions target symbolic meanings or deeper intentions, leading to significant perfo… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

    Comments: Preprint

  16. arXiv:2506.07412  [pdf, ps, other

    cs.CV

    Compressed Feature Quality Assessment: Dataset and Baselines

    Authors: Changsheng Gao, Wei Zhou, Guosheng Lin, Weisi Lin

    Abstract: The widespread deployment of large models in resource-constrained environments has underscored the need for efficient transmission of intermediate feature representations. In this context, feature coding, which compresses features into compact bitstreams, becomes a critical component for scenarios involving feature transmission, storage, and reuse. However, this compression process introduces inhe… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  17. arXiv:2506.06218  [pdf, ps, other

    cs.CV

    STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving

    Authors: Christian Fruhwirth-Reisinger, Dušan Malić, Wei Lin, David Schinagl, Samuel Schulter, Horst Possegger

    Abstract: We introduce STSBench, a scenario-based framework to benchmark the holistic understanding of vision-language models (VLMs) for autonomous driving. The framework automatically mines pre-defined traffic scenarios from any dataset using ground-truth annotations, provides an intuitive user interface for efficient human verification, and generates multiple-choice questions for model evaluation. Applied… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

    Comments: Dataset: https://huggingface.co/datasets/ivc-lrp/STSBench, Code: https://github.com/LRP-IVC/STSBench

  18. arXiv:2506.05782  [pdf, ps, other

    cs.CV

    GazeNLQ @ Ego4D Natural Language Queries Challenge 2025

    Authors: Wei-Cheng Lin, Chih-Ming Lien, Chen Lo, Chia-Hung Yeh

    Abstract: This report presents our solution to the Ego4D Natural Language Queries (NLQ) Challenge at CVPR 2025. Egocentric video captures the scene from the wearer's perspective, where gaze serves as a key non-verbal communication cue that reflects visual attention and offer insights into human intention and cognition. Motivated by this, we propose a novel approach, GazeNLQ, which leverages gaze to retrieve… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

  19. arXiv:2506.05682  [pdf, ps, other

    cs.AR

    Lumina: Real-Time Mobile Neural Rendering by Exploiting Computational Redundancy

    Authors: Yu Feng, Weikai Lin, Yuge Cheng, Zihan Liu, Jingwen Leng, Minyi Guo, Chen Chen, Shixuan Sun, Yuhao Zhu

    Abstract: 3D Gaussian Splatting (3DGS) has vastly advanced the pace of neural rendering, but it remains computationally demanding on today's mobile SoCs. To address this challenge, we propose Lumina, a hardware-algorithm co-designed system, which integrates two principal optimizations: a novel algorithm, S^2, and a radiance caching mechanism, RC, to improve the efficiency of neural rendering. S2 algorithm e… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

  20. arXiv:2506.05343  [pdf, ps, other

    cs.CV

    ContentV: Efficient Training of Video Generation Models with Limited Compute

    Authors: Wenfeng Lin, Renjie Chen, Boyuan Liu, Shiyue Yan, Ruoyu Feng, Jiangchuan Wei, Yichen Zhang, Yimeng Zhou, Chao Feng, Jiao Ran, Qi Wu, Zuotao Liu, Mingyu Guo

    Abstract: Recent advances in video generation demand increasingly efficient training recipes to mitigate escalating computational costs. In this report, we present ContentV, an 8B-parameter text-to-video model that achieves state-of-the-art performance (85.14 on VBench) after training on 256 x 64GB Neural Processing Units (NPUs) for merely four weeks. ContentV generates diverse, high-quality videos across m… ▽ More

    Submitted 11 June, 2025; v1 submitted 5 June, 2025; originally announced June 2025.

    Comments: Project Page: https://contentv.github.io

  21. arXiv:2506.05331  [pdf, ps, other

    cs.CV

    MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

    Authors: Xinyan Chen, Renrui Zhang, Dongzhi Jiang, Aojun Zhou, Shilin Yan, Weifeng Lin, Hongsheng Li

    Abstract: Chain-of-Thought (CoT) has widely enhanced mathematical reasoning in Large Language Models (LLMs), but it still remains challenging for extending it to multimodal domains. Existing works either adopt a similar textual reasoning for image input, or seek to interleave visual signals into mathematical CoT. However, they face three key limitations for math problem-solving: reliance on coarse-grained b… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

    Comments: Code is released at https://github.com/xinyan-cxy/MINT-CoT

  22. arXiv:2506.05302  [pdf, ps, other

    cs.CV

    Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos

    Authors: Weifeng Lin, Xinyu Wei, Ruichuan An, Tianhe Ren, Tingwei Chen, Renrui Zhang, Ziyu Guo, Wentao Zhang, Lei Zhang, Hongsheng Li

    Abstract: We present Perceive Anything Model (PAM), a conceptually straightforward and efficient framework for comprehensive region-level visual understanding in images and videos. Our approach extends the powerful segmentation model SAM 2 by integrating Large Language Models (LLMs), enabling simultaneous object segmentation with the generation of diverse, region-specific semantic outputs, including categor… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

    Comments: 19 pages, 13 figures, Website: https://Perceive-Anything.github.io

  23. arXiv:2506.04734  [pdf, ps, other

    cs.AI cs.CL cs.LG

    Evaluation is All You Need: Strategic Overclaiming of LLM Reasoning Capabilities Through Evaluation Design

    Authors: Lin Sun, Weihong Lin, Jinzhu Wu, Yongfu Zhu, Xiaoqi Jian, Guangxiang Zhao, Change Jia, Linglin Zhang, Sai-er Hu, Yuhan Wu, Xiangzheng Zhang

    Abstract: Reasoning models represented by the Deepseek-R1-Distill series have been widely adopted by the open-source community due to their strong performance in mathematics, science, programming, and other domains. However, our study reveals that their benchmark evaluation results are subject to significant fluctuations caused by various factors. Subtle differences in evaluation conditions can lead to subs… ▽ More

    Submitted 10 June, 2025; v1 submitted 5 June, 2025; originally announced June 2025.

  24. arXiv:2506.01194  [pdf, ps, other

    cs.LG cs.DC

    FedRPCA: Enhancing Federated LoRA Aggregation Using Robust PCA

    Authors: Divyansh Jhunjhunwala, Arian Raje, Madan Ravi Ganesh, Chaithanya Kumar Mummadi, Chaoqun Dong, Jiawei Zhou, Wan-Yi Lin, Gauri Joshi, Zhenzhen Li

    Abstract: LoRA has emerged as one of the most promising fine-tuning techniques, especially for federated learning (FL), since it significantly reduces communication and computation costs at resource-constrained clients. However, data heterogeneity remains a significant challenge for LoRA-based FL, and the conventional aggregation strategy based on FedAvg suffers from slow convergence and suboptimal accuracy… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

  25. arXiv:2506.00959  [pdf, other

    cs.LG

    Hidden Representation Clustering with Multi-Task Representation Learning towards Robust Online Budget Allocation

    Authors: Xiaohan Wang, Yu Zhang, Guibin Jiang, Bing Cheng, Wei Lin

    Abstract: Marketing optimization, commonly formulated as an online budget allocation problem, has emerged as a pivotal factor in driving user growth. Most existing research addresses this problem by following the principle of 'first predict then optimize' for each individual, which presents challenges related to large-scale counterfactual prediction and solving complexity trade-offs. Note that the practical… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

  26. arXiv:2506.00759  [pdf, ps, other

    cs.CL

    Understanding and Mitigating Cross-lingual Privacy Leakage via Language-specific and Universal Privacy Neurons

    Authors: Wenshuo Dong, Qingsong Yang, Shu Yang, Lijie Hu, Meng Ding, Wanyu Lin, Tianhang Zheng, Di Wang

    Abstract: Large Language Models (LLMs) trained on massive data capture rich information embedded in the training data. However, this also introduces the risk of privacy leakage, particularly involving personally identifiable information (PII). Although previous studies have shown that this risk can be mitigated through methods such as privacy neurons, they all assume that both the (sensitive) training data… ▽ More

    Submitted 8 June, 2025; v1 submitted 31 May, 2025; originally announced June 2025.

  27. arXiv:2506.00439  [pdf, ps, other

    cs.LG cs.AI

    RLAE: Reinforcement Learning-Assisted Ensemble for LLMs

    Authors: Yuqian Fu, Yuanheng Zhu, Jiajun Chai, Guojun Yin, Wei Lin, Qichao Zhang, Dongbin Zhao

    Abstract: Ensembling large language models (LLMs) can effectively combine diverse strengths of different models, offering a promising approach to enhance performance across various tasks. However, existing methods typically rely on fixed weighting strategies that fail to adapt to the dynamic, context-dependent characteristics of LLM capabilities. In this work, we propose Reinforcement Learning-Assisted Ense… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

  28. arXiv:2505.24406  [pdf, ps, other

    cs.CV

    IRBridge: Solving Image Restoration Bridge with Pre-trained Generative Diffusion Models

    Authors: Hanting Wang, Tao Jin, Wang Lin, Shulei Wang, Hai Huang, Shengpeng Ji, Zhou Zhao

    Abstract: Bridge models in image restoration construct a diffusion process from degraded to clear images. However, existing methods typically require training a bridge model from scratch for each specific type of degradation, resulting in high computational costs and limited performance. This work aims to efficiently leverage pretrained generative priors within existing image restoration bridges to eliminat… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

  29. arXiv:2505.24137  [pdf

    cs.AR

    Energy-Oriented Computing Architecture Simulator for SNN Training

    Authors: Yunhao Ma, Wanyi Jia, Yanyu Lin, Wenjie Lin, Xueke Zhu, Huihui Zhou, Fengwei An

    Abstract: With the growing demand for intelligent computing, neuromorphic computing, a paradigm that mimics the structure and functionality of the human brain, offers a promising approach to developing new high-efficiency intelligent computing systems. Spiking Neural Networks (SNNs), the foundation of neuromorphic computing, have garnered significant attention due to their unique potential in energy efficie… ▽ More

    Submitted 5 June, 2025; v1 submitted 29 May, 2025; originally announced May 2025.

  30. arXiv:2505.23799  [pdf, ps, other

    cs.CL cs.AI cs.HC cs.LG

    Estimating LLM Consistency: A User Baseline vs Surrogate Metrics

    Authors: Xiaoyuan Wu, Weiran Lin, Omer Akgul, Lujo Bauer

    Abstract: Large language models (LLMs) are prone to hallucinations and sensitive to prompt perturbations, often resulting in inconsistent or unreliable generated text. Different methods have been proposed to mitigate such hallucinations and fragility -- one of them being measuring the consistency (the model's confidence in the response, or likelihood of generating a similar response when resampled) of LLM r… ▽ More

    Submitted 2 June, 2025; v1 submitted 26 May, 2025; originally announced May 2025.

  31. arXiv:2505.23399  [pdf

    cs.AI

    GAM-Agent: Game-Theoretic and Uncertainty-Aware Collaboration for Complex Visual Reasoning

    Authors: Jusheng Zhang, Yijia Fan, Wenjun Lin, Ruiqi Chen, Haoyi Jiang, Wenhao Chai, Jian Wang, Keze Wang

    Abstract: We propose GAM-Agent, a game-theoretic multi-agent framework for enhancing vision-language reasoning. Unlike prior single-agent or monolithic models, GAM-Agent formulates the reasoning process as a non-zero-sum game between base agents--each specializing in visual perception subtasks--and a critical agent that verifies logic consistency and factual correctness. Agents communicate via structured cl… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

  32. arXiv:2505.22715  [pdf, ps, other

    quant-ph cs.ET

    Routing-Aware Placement for Zoned Neutral Atom-based Quantum Computing

    Authors: Yannick Stade, Wan-Hsuan Lin, Jason Cong, Robert Wille

    Abstract: Quantum computing promises to solve previously intractable problems, with neutral atoms emerging as a promising technology. Zoned neutral atom architectures allow for immense parallelism and higher coherence times by shielding idling atoms from interference with laser beams. However, in addition to hardware, successful quantum computation requires sophisticated software support, particularly compi… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

    Comments: 9 pages, 10 figures

  33. arXiv:2505.22048  [pdf, ps, other

    stat.ML cs.LG

    Learning Curves of Stochastic Gradient Descent in Kernel Regression

    Authors: Haihan Zhang, Weicheng Lin, Yuanshi Liu, Cong Fang

    Abstract: This paper considers a canonical problem in kernel regression: how good are the model performances when it is trained by the popular online first-order algorithms, compared to the offline ones, such as ridge and ridgeless regression? In this paper, we analyze the foundational single-pass Stochastic Gradient Descent (SGD) in kernel regression under source condition where the optimal predictor can e… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  34. arXiv:2505.21943  [pdf, ps, other

    cs.CV

    Point-to-Region Loss for Semi-Supervised Point-Based Crowd Counting

    Authors: Wei Lin, Chenyang Zhao, Antoni B. Chan

    Abstract: Point detection has been developed to locate pedestrians in crowded scenes by training a counter through a point-to-point (P2P) supervision scheme. Despite its excellent localization and counting performance, training a point-based counter still faces challenges concerning annotation labor: hundreds to thousands of points are required to annotate a single sample capturing a dense crowd. In this pa… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: accepted by CVPR-2025(highlight)

  35. arXiv:2505.21496  [pdf, ps, other

    cs.CL cs.CV cs.LG

    UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents

    Authors: Han Xiao, Guozhi Wang, Yuxiang Chai, Zimu Lu, Weifeng Lin, Hao He, Lue Fan, Liuyang Bian, Rui Hu, Liang Liu, Shuai Ren, Yafei Wen, Xiaoxin Chen, Aojun Zhou, Hongsheng Li

    Abstract: In this paper, we introduce UI-Genie, a self-improving framework addressing two key challenges in GUI agents: verification of trajectory outcome is challenging and high-quality training data are not scalable. These challenges are addressed by a reward model and a self-improving pipeline, respectively. The reward model, UI-Genie-RM, features an image-text interleaved architecture that efficiently p… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: https://github.com/Euphoria16/UI-Genie

  36. arXiv:2505.19997  [pdf, ps, other

    cs.LG cs.CL cs.CY

    Embracing Imperfection: Simulating Students with Diverse Cognitive Levels Using LLM-based Agents

    Authors: Tao Wu, Jingyuan Chen, Wang Lin, Mengze Li, Yumeng Zhu, Ang Li, Kun Kuang, Fei Wu

    Abstract: Large language models (LLMs) are revolutionizing education, with LLM-based agents playing a key role in simulating student behavior. A major challenge in student simulation is modeling the diverse learning patterns of students at various cognitive levels. However, current LLMs, typically trained as ``helpful assistants'', target at generating perfect responses. As a result, they struggle to simula… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  37. arXiv:2505.19151  [pdf, ps, other

    cs.GR cs.AI cs.CV

    SRDiffusion: Accelerate Video Diffusion Inference via Sketching-Rendering Cooperation

    Authors: Shenggan Cheng, Yuanxin Wei, Lansong Diao, Yong Liu, Bujiao Chen, Lianghua Huang, Yu Liu, Wenyuan Yu, Jiangsu Du, Wei Lin, Yang You

    Abstract: Leveraging the diffusion transformer (DiT) architecture, models like Sora, CogVideoX and Wan have achieved remarkable progress in text-to-video, image-to-video, and video editing tasks. Despite these advances, diffusion-based video generation remains computationally intensive, especially for high-resolution, long-duration videos. Prior work accelerates its inference by skipping computation, usuall… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

    Comments: 9 pages, 6 figures

  38. arXiv:2505.18654  [pdf, ps, other

    cs.IR

    MTGR: Industrial-Scale Generative Recommendation Framework in Meituan

    Authors: Ruidong Han, Bin Yin, Shangyu Chen, He Jiang, Fei Jiang, Xiang Li, Chi Ma, Mincong Huang, Xiaoguang Li, Chunzhen Jing, Yueming Han, Menglei Zhou, Lei Yu, Chuan Liu, Wei Lin

    Abstract: Scaling law has been extensively validated in many domains such as natural language processing and computer vision. In the recommendation system, recent work has adopted generative recommendations to achieve scalability, but their generative approaches require abandoning the carefully constructed cross features of traditional recommendation models. We found that this approach significantly degrade… ▽ More

    Submitted 20 June, 2025; v1 submitted 24 May, 2025; originally announced May 2025.

  39. arXiv:2505.18584  [pdf, ps, other

    cs.CV

    Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations

    Authors: Chaofan Gan, Yuanpeng Tu, Xi Chen, Tieyuan Chen, Yuxi Li, Mehrtash Harandi, Weiyao Lin

    Abstract: Pre-trained stable diffusion models (SD) have shown great advances in visual correspondence. In this paper, we investigate the capabilities of Diffusion Transformers (DiTs) for accurate dense correspondence. Distinct from SD, DiTs exhibit a critical phenomenon in which very few feature activations exhibit significantly larger values than others, known as \textit{massive activations}, leading to un… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

    Comments: Under Review

  40. arXiv:2505.18502  [pdf, other

    cs.AI cs.CL cs.LG

    Knowledge Grafting of Large Language Models

    Authors: Guodong Du, Xuanning Zhou, Junlin Li, Zhuo Li, Zesheng Shi, Wanyu Lin, Ho-Kin Tang, Xiucheng Li, Fangming Liu, Wenya Wang, Min Zhang, Jing Li

    Abstract: Cross-capability transfer is a key challenge in large language model (LLM) research, with applications in multi-task integration, model compression, and continual learning. Recent works like FuseLLM and FuseChat have demonstrated the potential of transferring multiple model capabilities to lightweight models, enhancing adaptability and efficiency, which motivates our investigation into more effici… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

  41. arXiv:2505.18115  [pdf, ps, other

    cs.CV

    Instructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion

    Authors: Jacob Hansen, Wei Lin, Junmo Kang, Muhammad Jehanzeb Mirza, Hongyin Luo, Rogerio Feris, Alan Ritter, James Glass, Leonid Karlinsky

    Abstract: Visual Instruction Tuning (VisIT) data, commonly available as human-assistant conversations with images interleaved in the human turns, are currently the most widespread vehicle for aligning strong LLMs to understand visual inputs, converting them to strong LMMs. While many VisIT datasets are available, most are constructed using ad-hoc techniques developed independently by different groups. They… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

  42. arXiv:2505.17524  [pdf, other

    cs.CE

    Latent Imputation before Prediction: A New Computational Paradigm for De Novo Peptide Sequencing

    Authors: Ye Du, Chen Yang, Nanxi Yu, Wanyu Lin, Qian Zhao, Shujun Wang

    Abstract: De novo peptide sequencing is a fundamental computational technique for ascertaining amino acid sequences of peptides directly from tandem mass spectrometry data, eliminating the need for reference databases. Cutting-edge models usually encode the observed mass spectra into latent representations from which peptides are predicted autoregressively. However, the issue of missing fragmentation, attri… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

    Comments: Accepted by ICML 2025

  43. arXiv:2505.17155  [pdf, ps, other

    cs.LG cs.AI cs.CL

    TrimR: Verifier-based Training-Free Thinking Compression for Efficient Test-Time Scaling

    Authors: Weizhe Lin, Xing Li, Zhiyuan Yang, Xiaojin Fu, Hui-Ling Zhen, Yaoyuan Wang, Xianzhi Yu, Wulong Liu, Xiaosong Li, Mingxuan Yuan

    Abstract: Large Reasoning Models (LRMs) demonstrate exceptional capability in tackling complex mathematical, logical, and coding tasks by leveraging extended Chain-of-Thought (CoT) reasoning. Test-time scaling methods, such as prolonging CoT with explicit token-level exploration, can push LRMs' accuracy boundaries, but they incur significant decoding overhead. A key inefficiency source is LRMs often generat… ▽ More

    Submitted 31 May, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

  44. arXiv:2505.16815  [pdf, other

    cs.CV cs.RO

    Perceptual Quality Assessment for Embodied AI

    Authors: Chunyi Li, Jiaohao Xiao, Jianbo Zhang, Farong Wen, Zicheng Zhang, Yuan Tian, Xiangyang Zhu, Xiaohong Liu, Zhengxue Cheng, Weisi Lin, Guangtao Zhai

    Abstract: Embodied AI has developed rapidly in recent years, but it is still mainly deployed in laboratories, with various distortions in the Real-world limiting its application. Traditionally, Image Quality Assessment (IQA) methods are applied to predict human preferences for distorted images; however, there is no IQA method to assess the usability of an image in embodied tasks, namely, the perceptual qual… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

  45. arXiv:2505.16429  [pdf, other

    cs.CL cs.AI

    Beyond Static Testbeds: An Interaction-Centric Agent Simulation Platform for Dynamic Recommender Systems

    Authors: Song Jin, Juntian Zhang, Yuhan Liu, Xun Zhang, Yufei Zhang, Guojun Yin, Fei Jiang, Wei Lin, Rui Yan

    Abstract: Evaluating and iterating upon recommender systems is crucial, yet traditional A/B testing is resource-intensive, and offline methods struggle with dynamic user-platform interactions. While agent-based simulation is promising, existing platforms often lack a mechanism for user actions to dynamically reshape the environment. To bridge this gap, we introduce RecInter, a novel agent-based simulation p… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

  46. arXiv:2505.15959  [pdf, ps, other

    cs.LO cs.FL

    HornStr: Invariant Synthesis for Regular Model Checking as Constrained Horn Clauses(Technical Report)

    Authors: Hongjian Jiang, Anthony W. Lin, Oliver Markgraf, Philipp Rümmer, Daniel Stan

    Abstract: We present HornStr, the first solver for invariant synthesis for Regular Model Checking (RMC) with the specification provided in the SMT-LIB 2.6 theory of strings. It is well-known that invariant synthesis for RMC subsumes various important verification problems, including safety verification for parameterized systems. To achieve a simple and standardized file format, we treat the invariant synthe… ▽ More

    Submitted 23 May, 2025; v1 submitted 21 May, 2025; originally announced May 2025.

  47. arXiv:2505.15315  [pdf, other

    cs.CE

    Local-Global Associative Frames for Symmetry-Preserving Crystal Structure Modeling

    Authors: Haowei Hua, Wanyu Lin

    Abstract: Crystal structures are defined by the periodic arrangement of atoms in 3D space, inherently making them equivariant to SO(3) group. A fundamental requirement for crystal property prediction is that the model's output should remain invariant to arbitrary rotational transformations of the input structure. One promising strategy to achieve this invariance is to align the given crystal structure into… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

  48. arXiv:2505.14899  [pdf, ps, other

    cs.RO cs.CL

    Think, Reflect, Create: Metacognitive Learning for Zero-Shot Robotic Planning with LLMs

    Authors: Wenjie Lin, Jin Wei-Kocsis

    Abstract: While large language models (LLMs) have shown great potential across various domains, their applications in robotics remain largely limited to static, prompt-based behaviors and still face challenges in handling complex tasks under zero-shot or few-shot settings. Inspired by human metacognitive learning and creative problem-solving, we address this limitation by exploring a fundamental research qu… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

  49. arXiv:2505.13489  [pdf, ps, other

    cs.AI cs.CL

    Contrastive Cross-Course Knowledge Tracing via Concept Graph Guided Knowledge Transfer

    Authors: Wenkang Han, Wang Lin, Liya Hu, Zhenlong Dai, Yiyun Zhou, Mengze Li, Zemin Liu, Chang Yao, Jingyuan Chen

    Abstract: Knowledge tracing (KT) aims to predict learners' future performance based on historical learning interactions. However, existing KT models predominantly focus on data from a single course, limiting their ability to capture a comprehensive understanding of learners' knowledge states. In this paper, we propose TransKT, a contrastive cross-course knowledge tracing method that leverages concept graph… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

    Comments: Accepted by IJCAI 2025

  50. arXiv:2505.13389  [pdf, ps, other

    cs.CV

    VSA: Faster Video Diffusion with Trainable Sparse Attention

    Authors: Peiyuan Zhang, Haofeng Huang, Yongqi Chen, Will Lin, Zhengzhong Liu, Ion Stoica, Eric Xing, Hao Zhang

    Abstract: Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions. We turn this observation into VSA, a trainable, hardware-efficient sparse attention that replaces full attention at \emph{both} training and inference. In VSA, a lightweight coarse stage pools tokens into tiles and identifies hi… ▽ More

    Submitted 26 May, 2025; v1 submitted 19 May, 2025; originally announced May 2025.