Skip to main content

Showing 1–50 of 1,603 results for author: Lin, Z

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.01368  [pdf, ps, other

    cs.CV cs.LG

    Activation Reward Models for Few-Shot Model Alignment

    Authors: Tianning Chai, Chancharik Mitra, Brandon Huang, Gautam Rajendrakumar Gare, Zhiqiu Lin, Assaf Arbelle, Leonid Karlinsky, Rogerio Feris, Trevor Darrell, Deva Ramanan, Roei Herzig

    Abstract: Aligning Large Language Models (LLMs) and Large Multimodal Models (LMMs) to human preferences is a central challenge in improving the quality of the models' generative outputs for real-world applications. A common approach is to use reward modeling to encode preferences, enabling alignment via post-training using reinforcement learning. However, traditional reward modeling is not easily adaptable… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  2. arXiv:2507.00498  [pdf, ps, other

    cs.SD cs.CV cs.LG cs.MM eess.AS

    MuteSwap: Silent Face-based Voice Conversion

    Authors: Yifan Liu, Yu Fang, Zhouhan Lin

    Abstract: Conventional voice conversion modifies voice characteristics from a source speaker to a target speaker, relying on audio input from both sides. However, this process becomes infeasible when clean audio is unavailable, such as in silent videos or noisy environments. In this work, we focus on the task of Silent Face-based Voice Conversion (SFVC), which does voice conversion entirely from visual inpu… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  3. arXiv:2507.00261  [pdf, ps, other

    cs.CV cs.GR

    VirtualFencer: Generating Fencing Bouts based on Strategies Extracted from In-the-Wild Videos

    Authors: Zhiyin Lin, Purvi Goel, Joy Yun, C. Karen Liu, Joao Pedro Araujo

    Abstract: Fencing is a sport where athletes engage in diverse yet strategically logical motions. While most motions fall into a few high-level actions (e.g. step, lunge, parry), the execution can vary widely-fast vs. slow, large vs. small, offensive vs. defensive. Moreover, a fencer's actions are informed by a strategy that often comes in response to the opponent's behavior. This combination of motion diver… ▽ More

    Submitted 30 June, 2025; originally announced July 2025.

  4. arXiv:2506.23474  [pdf, ps, other

    cs.CR

    A Large-Scale Evolvable Dataset for Model Context Protocol Ecosystem and Security Analysis

    Authors: Zhiwei Lin, Bonan Ruan, Jiahao Liu, Weibo Zhao

    Abstract: The Model Context Protocol (MCP) has recently emerged as a standardized interface for connecting language models with external tools and data. As the ecosystem rapidly expands, the lack of a structured, comprehensive view of existing MCP artifacts presents challenges for research. To bridge this gap, we introduce MCPCorpus, a large-scale dataset containing around 14K MCP servers and 300 MCP client… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

  5. arXiv:2506.23361  [pdf, ps, other

    cs.CV

    OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions

    Authors: Yuanhao Cai, He Zhang, Xi Chen, Jinbo Xing, Yiwei Hu, Yuqian Zhou, Kai Zhang, Zhifei Zhang, Soo Ye Kim, Tianyu Wang, Yulun Zhang, Xiaokang Yang, Zhe Lin, Alan Yuille

    Abstract: Existing feedforward subject-driven video customization methods mainly study single-subject scenarios due to the difficulty of constructing multi-subject training data pairs. Another challenging problem that how to use the signals such as depth, mask, camera, and text prompts to control and edit the subject in the customized video is still less explored. In this paper, we first propose a data cons… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

    Comments: A data construction pipeline and a diffusion Transformer framework for controllable subject-driven video customization

  6. arXiv:2506.23347  [pdf, ps, other

    cs.CV

    CycleVAR: Repurposing Autoregressive Model for Unsupervised One-Step Image Translation

    Authors: Yi Liu, Shengqian Li, Zuzeng Lin, Feng Wang, Si Liu

    Abstract: The current conditional autoregressive image generation methods have shown promising results, yet their potential remains largely unexplored in the practical unsupervised image translation domain, which operates without explicit cross-domain correspondences. A critical limitation stems from the discrete quantization inherent in traditional Vector Quantization-based frameworks, which disrupts gradi… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

  7. arXiv:2506.21188  [pdf, ps, other

    cs.CV

    GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding

    Authors: Zijun Lin, Shuting He, Cheston Tan, Bihan Wen

    Abstract: Sequential grounding in 3D point clouds (SG3D) refers to locating sequences of objects by following text instructions for a daily activity with detailed steps. Current 3D visual grounding (3DVG) methods treat text instructions with multiple steps as a whole, without extracting useful temporal information from each step. However, the instructions in SG3D often contain pronouns such as "it", "here"… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

  8. arXiv:2506.21049  [pdf, ps, other

    cs.CL cs.AI cs.IR

    A Semi-supervised Scalable Unified Framework for E-commerce Query Classification

    Authors: Chunyuan Yuan, Chong Zhang, Zheng Fang, Ming Pang, Xue Jiang, Changping Peng, Zhangang Lin, Ching Law

    Abstract: Query classification, including multiple subtasks such as intent and category prediction, is vital to e-commerce applications. E-commerce queries are usually short and lack context, and the information between labels cannot be used, resulting in insufficient prior information for modeling. Most existing industrial query classification methods rely on users' posterior click behavior to construct tr… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

    Comments: Accepted by ACL 2025

  9. arXiv:2506.20170  [pdf, ps, other

    cs.CR

    JsDeObsBench: Measuring and Benchmarking LLMs for JavaScript Deobfuscation

    Authors: Guoqiang Chen, Xin Jin, Zhiqiang Lin

    Abstract: Deobfuscating JavaScript (JS) code poses a significant challenge in web security, particularly as obfuscation techniques are frequently used to conceal malicious activities within scripts. While Large Language Models (LLMs) have recently shown promise in automating the deobfuscation process, transforming detection and mitigation strategies against these obfuscated threats, a systematic benchmark t… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

    Comments: Accepted by ACM CCS 2025

  10. arXiv:2506.19563  [pdf, ps, other

    cs.CR cs.AI

    PrivacyXray: Detecting Privacy Breaches in LLMs through Semantic Consistency and Probability Certainty

    Authors: Jinwen He, Yiyang Lu, Zijin Lin, Kai Chen, Yue Zhao

    Abstract: Large Language Models (LLMs) are widely used in sensitive domains, including healthcare, finance, and legal services, raising concerns about potential private information leaks during inference. Privacy extraction attacks, such as jailbreaking, expose vulnerabilities in LLMs by crafting inputs that force the models to output sensitive information. However, these attacks cannot verify whether the e… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  11. arXiv:2506.18932  [pdf, ps, other

    cs.CY cs.AI cs.CR

    AI Safety vs. AI Security: Demystifying the Distinction and Boundaries

    Authors: Zhiqiang Lin, Huan Sun, Ness Shroff

    Abstract: Artificial Intelligence (AI) is rapidly being integrated into critical systems across various domains, from healthcare to autonomous vehicles. While its integration brings immense benefits, it also introduces significant risks, including those arising from AI misuse. Within the discourse on managing these risks, the terms "AI Safety" and "AI Security" are often used, sometimes interchangeably, res… ▽ More

    Submitted 21 June, 2025; originally announced June 2025.

  12. arXiv:2506.18899  [pdf, ps, other

    cs.CV

    FilMaster: Bridging Cinematic Principles and Generative AI for Automated Film Generation

    Authors: Kaiyi Huang, Yukun Huang, Xintao Wang, Zinan Lin, Xuefei Ning, Pengfei Wan, Di Zhang, Yu Wang, Xihui Liu

    Abstract: AI-driven content creation has shown potential in film production. However, existing film generation systems struggle to implement cinematic principles and thus fail to generate professional-quality films, particularly lacking diverse camera language and cinematic rhythm. This results in templated visuals and unengaging narratives. To address this, we introduce FilMaster, an end-to-end AI system t… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

    Comments: Project Page: https://filmaster-ai.github.io/

  13. arXiv:2506.18781  [pdf, ps, other

    cs.CL

    Existing LLMs Are Not Self-Consistent For Simple Tasks

    Authors: Zhenru Lin, Jiawen Tao, Yang Yuan, Andrew Chi-Chih Yao

    Abstract: Large Language Models (LLMs) have grown increasingly powerful, yet ensuring their decisions remain transparent and trustworthy requires self-consistency -- no contradictions in their internal reasoning. Our study reveals that even on simple tasks, such as comparing points on a line or a plane, or reasoning in a family tree, all smaller models are highly inconsistent, and even state-of-the-art mode… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

    Comments: 10 pages, 6 figures

  14. arXiv:2506.18565  [pdf, ps, other

    cs.CE

    A Physics-Informed Neural Network Framework for Simulating Creep Buckling in Growing Viscoelastic Biological Tissues

    Authors: Zhongya Lin, Jinshuai Bai, Shuang Li, Xindong Chen, Bo Li, Xi-Qiao Feng

    Abstract: Modeling viscoelastic behavior is crucial in engineering and biomechanics, where materials undergo time-dependent deformations, including stress relaxation, creep buckling and biological tissue development. Traditional numerical methods, like the finite element method, often require explicit meshing, artificial perturbations or embedding customised programs to capture these phenomena, adding compu… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

  15. arXiv:2506.18234  [pdf, ps, other

    cs.CV cs.RO

    Drive-R1: Bridging Reasoning and Planning in VLMs for Autonomous Driving with Reinforcement Learning

    Authors: Yue Li, Meng Tian, Dechang Zhu, Jiangtong Zhu, Zhenyu Lin, Zhiwei Xiong, Xinhai Zhao

    Abstract: Large vision-language models (VLMs) for autonomous driving (AD) are evolving beyond perception and cognition tasks toward motion planning. However, we identify two critical challenges in this direction: (1) VLMs tend to learn shortcuts by relying heavily on history input information, achieving seemingly strong planning results without genuinely understanding the visual inputs; and (2) the chain-of… ▽ More

    Submitted 22 June, 2025; originally announced June 2025.

  16. arXiv:2506.17869  [pdf, ps, other

    cs.CV cs.RO

    Cross-modal State Space Modeling for Real-time RGB-thermal Wild Scene Semantic Segmentation

    Authors: Xiaodong Guo, Zi'ang Lin, Luwen Hu, Zhihong Deng, Tong Liu, Wujie Zhou

    Abstract: The integration of RGB and thermal data can significantly improve semantic segmentation performance in wild environments for field robots. Nevertheless, multi-source data processing (e.g. Transformer-based approaches) imposes significant computational overhead, presenting challenges for resource-constrained systems. To resolve this critical limitation, we introduced CM-SSM, an efficient RGB-therma… ▽ More

    Submitted 21 June, 2025; originally announced June 2025.

  17. arXiv:2506.17612  [pdf, ps, other

    cs.CV

    JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent

    Authors: Yunlong Lin, Zixu Lin, Kunjie Lin, Jinbin Bai, Panwang Pan, Chenxin Li, Haoyu Chen, Zhongdao Wang, Xinghao Ding, Wenbo Li, Shuicheng Yan

    Abstract: Photo retouching has become integral to contemporary visual storytelling, enabling users to capture aesthetics and express creativity. While professional tools such as Adobe Lightroom offer powerful capabilities, they demand substantial expertise and manual effort. In contrast, existing AI-based solutions provide automation but often suffer from limited adjustability and poor generalization, faili… ▽ More

    Submitted 21 June, 2025; originally announced June 2025.

    Comments: 40 pages, 26 figures

  18. arXiv:2506.16702  [pdf

    cs.CY cs.AI cs.CL cs.HC

    Large Language Models as Psychological Simulators: A Methodological Guide

    Authors: Zhicheng Lin

    Abstract: Large language models (LLMs) offer emerging opportunities for psychological and behavioral research, but methodological guidance is lacking. This article provides a framework for using LLMs as psychological simulators across two primary applications: simulating roles and personas to explore diverse contexts, and serving as computational models to investigate cognitive processes. For simulation, we… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

  19. arXiv:2506.16697  [pdf

    cs.CY cs.AI cs.CL cs.HC

    From Prompts to Constructs: A Dual-Validity Framework for LLM Research in Psychology

    Authors: Zhicheng Lin

    Abstract: Large language models (LLMs) are rapidly being adopted across psychology, serving as research tools, experimental subjects, human simulators, and computational models of cognition. However, the application of human measurement tools to these systems can produce contradictory results, raising concerns that many findings are measurement phantoms--statistical artifacts rather than genuine psychologic… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

  20. arXiv:2506.16037  [pdf, ps, other

    cs.CL cs.LG

    Enhancing Document-Level Question Answering via Multi-Hop Retrieval-Augmented Generation with LLaMA 3

    Authors: Xinyue Huang, Ziqi Lin, Fang Sun, Wenchao Zhang, Kejian Tong, Yunbo Liu

    Abstract: This paper presents a novel Retrieval-Augmented Generation (RAG) framework tailored for complex question answering tasks, addressing challenges in multi-hop reasoning and contextual understanding across lengthy documents. Built upon LLaMA 3, the framework integrates a dense retrieval module with advanced context fusion and multi-hop reasoning mechanisms, enabling more accurate and coherent respons… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

  21. arXiv:2506.15050  [pdf, ps, other

    cs.AI

    Truncated Proximal Policy Optimization

    Authors: Tiantian Fan, Lingjun Liu, Yu Yue, Jiaze Chen, Chengyi Wang, Qiying Yu, Chi Zhang, Zhiqi Lin, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Bole Ma, Mofan Zhang, Gaohong Liu, Ru Zhang, Haotian Zhou, Cong Xie, Ruidong Zhu, Zhi Zhang, Xin Liu, Mingxuan Wang, Lin Yan, Yonghui Wu

    Abstract: Recently, test-time scaling Large Language Models (LLMs) have demonstrated exceptional reasoning capabilities across scientific and professional tasks by generating long chains-of-thought (CoT). As a crucial component for developing these reasoning models, reinforcement learning (RL), exemplified by Proximal Policy Optimization (PPO) and its variants, allows models to learn through trial and error… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

  22. arXiv:2506.14973  [pdf, ps, other

    eess.AS cs.AI

    Thinking in Directivity: Speech Large Language Model for Multi-Talker Directional Speech Recognition

    Authors: Jiamin Xie, Ju Lin, Yiteng Huang, Tyler Vuong, Zhaojiang Lin, Zhaojun Yang, Peng Su, Prashant Rawat, Sangeeta Srivastava, Ming Sun, Florian Metze

    Abstract: Recent studies have demonstrated that prompting large language models (LLM) with audio encodings enables effective speech recognition capabilities. However, the ability of Speech LLMs to comprehend and process multi-channel audio with spatial cues remains a relatively uninvestigated area of research. In this work, we present directional-SpeechLlama, a novel approach that leverages the microphone a… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

    Comments: Accepted to Interspeech 2025

  23. SimSpark: Interactive Simulation of Social Media Behaviors

    Authors: Ziyue Lin, Yi Shan, Lin Gao, Xinghua Jia, Siming Chen

    Abstract: Understanding user behaviors on social media has garnered significant scholarly attention, enhancing our comprehension of how virtual platforms impact society and empowering decision-makers. Simulating social media behaviors provides a robust tool for capturing the patterns of social media behaviors, testing hypotheses, and predicting the effects of various interventions, ultimately contributing t… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

    Comments: 32 pages, 7 figures

    Journal ref: Proc. ACM Hum.-Comput. Interact. 9, 2, Article CSCW168 (April 2025), 32 pages

  24. arXiv:2506.13229  [pdf, ps, other

    cs.CL

    IGD: Token Decisiveness Modeling via Information Gain in LLMs for Personalized Recommendation

    Authors: Zijie Lin, Yang Zhang, Xiaoyan Zhao, Fengbin Zhu, Fuli Feng, Tat-Seng Chua

    Abstract: Large Language Models (LLMs) have shown strong potential for recommendation by framing item prediction as a token-by-token language generation task. However, existing methods treat all item tokens equally, simply pursuing likelihood maximization during both optimization and decoding. This overlooks crucial token-level differences in decisiveness-many tokens contribute little to item discrimination… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  25. arXiv:2506.13061  [pdf, ps, other

    cs.LG math.CA math.NA

    Fast Convergence for High-Order ODE Solvers in Diffusion Probabilistic Models

    Authors: Daniel Zhengyu Huang, Jiaoyang Huang, Zhengjiang Lin

    Abstract: Diffusion probabilistic models generate samples by learning to reverse a noise-injection process that transforms data into noise. Reformulating this reverse process as a deterministic probability flow ordinary differential equation (ODE) enables efficient sampling using high-order solvers, often requiring only $\mathcal{O}(10)$ steps. Since the score function is typically approximated by a neural… ▽ More

    Submitted 18 June, 2025; v1 submitted 15 June, 2025; originally announced June 2025.

    Comments: 64 pages, 7 figures

  26. arXiv:2506.12735  [pdf, ps, other

    cs.LG cs.AI

    Revealing the Challenges of Sim-to-Real Transfer in Model-Based Reinforcement Learning via Latent Space Modeling

    Authors: Zhilin Lin, Shiliang Sun

    Abstract: Reinforcement learning (RL) is playing an increasingly important role in fields such as robotic control and autonomous driving. However, the gap between simulation and the real environment remains a major obstacle to the practical deployment of RL. Agents trained in simulators often struggle to maintain performance when transferred to real-world physical environments. In this paper, we propose a l… ▽ More

    Submitted 15 June, 2025; originally announced June 2025.

  27. arXiv:2506.10685  [pdf, ps, other

    cs.CV cs.CR

    Defensive Adversarial CAPTCHA: A Semantics-Driven Framework for Natural Adversarial Example Generation

    Authors: Xia Du, Xiaoyuan Liu, Jizhe Zhou, Zheng Lin, Chi-man Pun, Cong Wu, Tao Li, Zhe Chen, Wei Ni, Jun Luo

    Abstract: Traditional CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart) schemes are increasingly vulnerable to automated attacks powered by deep neural networks (DNNs). Existing adversarial attack methods often rely on the original image characteristics, resulting in distortions that hinder human interpretation and limit their applicability in scenarios where no initial in… ▽ More

    Submitted 1 July, 2025; v1 submitted 12 June, 2025; originally announced June 2025.

    Comments: 13 pages, 6 figures

  28. arXiv:2506.10395  [pdf, ps, other

    cs.CV cs.AI

    Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation

    Authors: Zhiyang Xu, Jiuhai Chen, Zhaojiang Lin, Xichen Pan, Lifu Huang, Tianyi Zhou, Madian Khabsa, Qifan Wang, Di Jin, Michihiro Yasunaga, Lili Yu, Xi Victoria Lin, Shaoliang Nie

    Abstract: Recent advances in large language models (LLMs) have enabled multimodal foundation models to tackle both image understanding and generation within a unified framework. Despite these gains, unified models often underperform compared to specialized models in either task. A key challenge in developing unified models lies in the inherent differences between the visual features needed for image underst… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

    Comments: Unified image understanding and generation model

  29. arXiv:2506.10352  [pdf, ps, other

    cs.LG

    History-Aware Neural Operator: Robust Data-Driven Constitutive Modeling of Path-Dependent Materials

    Authors: Binyao Guo, Zihan Lin, QiZhi He

    Abstract: This study presents an end-to-end learning framework for data-driven modeling of path-dependent inelastic materials using neural operators. The framework is built on the premise that irreversible evolution of material responses, governed by hidden dynamics, can be inferred from observable data. We develop the History-Aware Neural Operator (HANO), an autoregressive model that predicts path-depend… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

  30. arXiv:2506.10323  [pdf, ps, other

    cs.CR cs.SE

    ELFuzz: Efficient Input Generation via LLM-driven Synthesis Over Fuzzer Space

    Authors: Chuyang Chen, Brendan Dolan-Gavitt, Zhiqiang Lin

    Abstract: Generation-based fuzzing produces appropriate testing cases according to specifications of input grammars and semantic constraints to test systems and software. However, these specifications require significant manual efforts to construct. This paper proposes a new approach, ELFuzz (Evolution Through Large Language Models for Fuzzing), that automatically synthesizes generation-based fuzzers tailor… ▽ More

    Submitted 26 June, 2025; v1 submitted 11 June, 2025; originally announced June 2025.

    Comments: Accepted by USENIX Security'25 Cycle 2

    Journal ref: The 34th USENIX Security Symposium, 2025

  31. arXiv:2506.10022  [pdf, ps, other

    cs.CR cs.AI

    LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges

    Authors: Haoyang Li, Huan Gao, Zhiyuan Zhao, Zhiyu Lin, Junyu Gao, Xuelong Li

    Abstract: The widespread adoption of Large Language Models (LLMs) has heightened concerns about their security, particularly their vulnerability to jailbreak attacks that leverage crafted prompts to generate malicious outputs. While prior research has been conducted on general security capabilities of LLMs, their specific susceptibility to jailbreak attacks in code generation remains largely unexplored. To… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

    Comments: Accepted as ACL 2025 main conference

  32. arXiv:2506.09638  [pdf, ps, other

    cs.LG cs.CV

    FedVLMBench: Benchmarking Federated Fine-Tuning of Vision-Language Models

    Authors: Weiying Zheng, Ziyue Lin, Pengxin Guo, Yuyin Zhou, Feifei Wang, Liangqiong Qu

    Abstract: Vision-Language Models (VLMs) have demonstrated remarkable capabilities in cross-modal understanding and generation by integrating visual and textual information. While instruction tuning and parameter-efficient fine-tuning methods have substantially improved the generalization of VLMs, most existing approaches rely on centralized training, posing challenges for deployment in domains with strict p… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

  33. arXiv:2506.09351  [pdf, ps, other

    cs.CL

    DIVE into MoE: Diversity-Enhanced Reconstruction of Large Language Models from Dense into Mixture-of-Experts

    Authors: Yuchen Feng, Bowen Shen, Naibin Gu, Jiaxuan Zhao, Peng Fu, Zheng Lin, Weiping Wang

    Abstract: Large language models (LLMs) with the Mixture-of-Experts (MoE) architecture achieve high cost-efficiency by selectively activating a subset of the parameters. Despite the inference efficiency of MoE LLMs, the training of extensive experts from scratch incurs substantial overhead, whereas reconstructing a dense LLM into an MoE LLM significantly reduces the training budget. However, existing reconst… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

    Comments: ACL 2025

  34. arXiv:2506.09217  [pdf, ps, other

    cs.RO cs.CV stat.AP

    Perception Characteristics Distance: Measuring Stability and Robustness of Perception System in Dynamic Conditions under a Certain Decision Rule

    Authors: Boyu Jiang, Liang Shi, Zhengzhi Lin, Loren Stowe, Feng Guo

    Abstract: The performance of perception systems in autonomous driving systems (ADS) is strongly influenced by object distance, scene dynamics, and environmental conditions such as weather. AI-based perception outputs are inherently stochastic, with variability driven by these external factors, while traditional evaluation metrics remain static and event-independent, failing to capture fluctuations in confid… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

  35. arXiv:2506.09113  [pdf, ps, other

    cs.CV

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Authors: Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, Xunsong Li, Yifu Li, Shanchuan Lin, Zhijie Lin, Jiawei Liu, Shu Liu, Xiaonan Nie, Zhiwu Qing, Yuxi Ren, Li Sun, Zhi Tian, Rui Wang, Sen Wang, Guoqiang Wei, Guohong Wu , et al. (19 additional authors not shown)

    Abstract: Notable breakthroughs in diffusion modeling have propelled rapid improvements in video generation, yet current foundational model still face critical challenges in simultaneously balancing prompt following, motion plausibility, and visual quality. In this report, we introduce Seedance 1.0, a high-performance and inference-efficient video foundation generation model that integrates several core tec… ▽ More

    Submitted 28 June, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

    Comments: Seedance 1.0 Technical Report

  36. arXiv:2506.08646  [pdf, ps, other

    cs.CL cs.AI cs.LG

    TableDreamer: Progressive and Weakness-guided Data Synthesis from Scratch for Table Instruction Tuning

    Authors: Mingyu Zheng, Zhifan Feng, Jia Wang, Lanrui Wang, Zheng Lin, Yang Hao, Weiping Wang

    Abstract: Despite the commendable progress of recent LLM-based data synthesis methods, they face two limitations in generating table instruction tuning data. First, they can not thoroughly explore the vast input space of table understanding tasks, leading to limited data diversity. Second, they ignore the weaknesses in table understanding ability of the target LLM and blindly pursue the increase of data qua… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

    Comments: 27 pages, 19 figures, Findings of ACL 2025

  37. arXiv:2506.08426  [pdf, ps, other

    cs.LG cs.AI cs.DC

    HASFL: Heterogeneity-aware Split Federated Learning over Edge Computing Systems

    Authors: Zheng Lin, Zhe Chen, Xianhao Chen, Wei Ni, Yue Gao

    Abstract: Split federated learning (SFL) has emerged as a promising paradigm to democratize machine learning (ML) on edge devices by enabling layer-wise model partitioning. However, existing SFL approaches suffer significantly from the straggler effect due to the heterogeneous capabilities of edge devices. To address the fundamental challenge, we propose adaptively controlling batch sizes (BSs) and model sp… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

    Comments: 16 pages, 11 figures. arXiv admin note: text overlap with arXiv:2403.13101

  38. arXiv:2506.07818  [pdf, ps, other

    cs.CL

    WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code

    Authors: Zhiyu Lin, Zhengda Zhou, Zhiyuan Zhao, Tianrui Wan, Yilun Ma, Junyu Gao, Xuelong Li

    Abstract: With the rapid advancement of Generative AI technology, Multimodal Large Language Models(MLLMs) have the potential to act as AI software engineers capable of executing complex web application development. Considering that the model requires a confluence of multidimensional sub-capabilities to address the challenges of various development phases, constructing a multi-view evaluation framework is cr… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  39. arXiv:2506.07555  [pdf, ps, other

    cs.CV cs.AI

    Synthesize Privacy-Preserving High-Resolution Images via Private Textual Intermediaries

    Authors: Haoxiang Wang, Zinan Lin, Da Yu, Huishuai Zhang

    Abstract: Generating high fidelity, differentially private (DP) synthetic images offers a promising route to share and analyze sensitive visual data without compromising individual privacy. However, existing DP image synthesis methods struggle to produce high resolution outputs that faithfully capture the structure of the original data. In this paper, we introduce a novel method, referred to as Synthesis vi… ▽ More

    Submitted 13 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

  40. arXiv:2506.07520  [pdf, ps, other

    cs.SD cs.AI eess.AS

    LeVo: High-Quality Song Generation with Multi-Preference Alignment

    Authors: Shun Lei, Yaoxun Xu, Zhiwei Lin, Huaicheng Zhang, Wei Tan, Hangting Chen, Jianwei Yu, Yixuan Zhang, Chenyu Yang, Haina Zhu, Shuai Wang, Zhiyong Wu, Dong Yu

    Abstract: Recent advances in large language models (LLMs) and audio language models have significantly improved music generation, particularly in lyrics-to-song generation. However, existing approaches still struggle with the complex composition of songs and the scarcity of high-quality data, leading to limitations in sound quality, musicality, instruction following, and vocal-instrument harmony. To address… ▽ More

    Submitted 15 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

  41. arXiv:2506.07309  [pdf, other

    cs.CL

    ConfQA: Answer Only If You Are Confident

    Authors: Yin Huang, Yifan Ethan Xu, Kai Sun, Vera Yan, Alicia Sun, Haidar Khan, Jimmy Nguyen, Mohammad Kachuee, Zhaojiang Lin, Yue Liu, Aaron Colak, Anuj Kumar, Wen-tau Yih, Xin Luna Dong

    Abstract: Can we teach Large Language Models (LLMs) to refrain from hallucinating factual statements? In this paper we present a fine-tuning strategy that we call ConfQA, which can reduce hallucination rate from 20-40% to under 5% across multiple factuality benchmarks. The core idea is simple: when the LLM answers a question correctly, it is trained to continue with the answer; otherwise, it is trained to a… ▽ More

    Submitted 8 June, 2025; originally announced June 2025.

    Comments: 10 pages main content, 10 pages appendix, 5 figures, 7 tables

  42. arXiv:2506.07177  [pdf, ps, other

    cs.CV cs.AI

    Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models

    Authors: Sangwon Jang, Taekyung Ki, Jaehyeong Jo, Jaehong Yoon, Soo Ye Kim, Zhe Lin, Sung Ju Hwang

    Abstract: Advancements in diffusion models have significantly improved video quality, directing attention to fine-grained controllability. However, many existing methods depend on fine-tuning large-scale video models for specific tasks, which becomes increasingly impractical as model sizes continue to grow. In this work, we present Frame Guidance, a training-free guidance for controllable video generation b… ▽ More

    Submitted 8 June, 2025; originally announced June 2025.

    Comments: Project page: https://frame-guidance-video.github.io/

  43. arXiv:2506.06844  [pdf, ps, other

    cs.CL

    Adapt Once, Thrive with Updates: Transferable Parameter-Efficient Fine-Tuning on Evolving Base Models

    Authors: Naibin Gu, Peng Fu, Xiyu Liu, Ke Ma, Zheng Lin, Weiping Wang

    Abstract: Parameter-efficient fine-tuning (PEFT) has become a common method for fine-tuning large language models, where a base model can serve multiple users through PEFT module switching. To enhance user experience, base models require periodic updates. However, once updated, PEFT modules fine-tuned on previous versions often suffer substantial performance degradation on newer versions. Re-tuning these nu… ▽ More

    Submitted 7 June, 2025; originally announced June 2025.

    Comments: Accepted by ACL 2025

  44. arXiv:2506.05904  [pdf, ps, other

    cs.AI cs.CL cs.CV cs.HC

    Proactive Assistant Dialogue Generation from Streaming Egocentric Videos

    Authors: Yichi Zhang, Xin Luna Dong, Zhaojiang Lin, Andrea Madotto, Anuj Kumar, Babak Damavandi, Joyce Chai, Seungwhan Moon

    Abstract: Recent advances in conversational AI have been substantial, but developing real-time systems for perceptual task guidance remains challenging. These systems must provide interactive, proactive assistance based on streaming visual inputs, yet their development is constrained by the costly and labor-intensive process of data collection and system evaluation. To address these limitations, we present… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

  45. arXiv:2506.05301  [pdf, other

    cs.CV

    SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training

    Authors: Jianyi Wang, Shanchuan Lin, Zhijie Lin, Yuxi Ren, Meng Wei, Zongsheng Yue, Shangchen Zhou, Hao Chen, Yang Zhao, Ceyuan Yang, Xuefeng Xiao, Chen Change Loy, Lu Jiang

    Abstract: Recent advances in diffusion-based video restoration (VR) demonstrate significant improvement in visual quality, yet yield a prohibitive computational cost during inference. While several distillation-based approaches have exhibited the potential of one-step image restoration, extending existing approaches to VR remains challenging and underexplored, particularly when dealing with high-resolution… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

    Comments: Draft Ver. Project page: https://iceclear.github.io/projects/seedvr2/

  46. arXiv:2506.05175  [pdf, ps, other

    cs.CV

    Track Any Anomalous Object: A Granular Video Anomaly Detection Pipeline

    Authors: Yuzhi Huang, Chenxin Li, Haitao Zhang, Zixu Lin, Yunlong Lin, Hengyu Liu, Wuyang Li, Xinyu Liu, Jiechao Gao, Yue Huang, Xinghao Ding, Yixuan Yuan

    Abstract: Video anomaly detection (VAD) is crucial in scenarios such as surveillance and autonomous driving, where timely detection of unexpected activities is essential. Although existing methods have primarily focused on detecting anomalous objects in videos -- either by identifying anomalous frames or objects -- they often neglect finer-grained analysis, such as anomalous pixels, which limits their abili… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

  47. arXiv:2506.04042  [pdf, ps, other

    cs.CL

    Unveiling and Eliminating the Shortcut Learning for Locate-Then-Edit Knowledge Editing via Both Subject and Relation Awareness

    Authors: Xiyu Liu, Zhengxiao Liu, Naibin Gu, Zheng Lin, Ji Xiang, Weiping Wang

    Abstract: Knowledge editing aims to alternate the target knowledge predicted by large language models while ensuring the least side effects on unrelated knowledge. An effective way to achieve knowledge editing is to identify pivotal parameters for predicting factual associations and modify them with an optimization process to update the predictions. However, these locate-then-edit methods are uncontrollable… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

  48. arXiv:2506.03827  [pdf, other

    cs.CL cs.AI cs.IR

    Multi-objective Aligned Bidword Generation Model for E-commerce Search Advertising

    Authors: Zhenhui Liu, Chunyuan Yuan, Ming Pang, Zheng Fang, Li Yuan, Xue Jiang, Changping Peng, Zhangang Lin, Zheng Luo, Jingping Shao

    Abstract: Retrieval systems primarily address the challenge of matching user queries with the most relevant advertisements, playing a crucial role in e-commerce search advertising. The diversity of user needs and expressions often produces massive long-tail queries that cannot be matched with merchant bidwords or product titles, which results in some advertisements not being recalled, ultimately harming use… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

    Comments: Accepted by SIGIR2025

  49. arXiv:2506.03569  [pdf, ps, other

    cs.CL

    MiMo-VL Technical Report

    Authors: Xiaomi LLM-Core Team, :, Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zihan Jiang, Zhixian Zheng, Zhichao Song , et al. (50 additional authors not shown)

    Abstract: We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-language models delivering state-of-the-art performance in both general visual understanding and multimodal reasoning. MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 evaluated tasks, and scores 59.4 on OlympiadBench, surpassing models with up to 78B parameters. For GUI grounding applications, it sets a new standard with… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

    Comments: 32 pages

  50. arXiv:2506.03483  [pdf, ps, other

    cs.CL

    APT: Improving Specialist LLM Performance with Weakness Case Acquisition and Iterative Preference Training

    Authors: Jun Rao, Zepeng Lin, Xuebo Liu, Xiaopeng Ke, Lian Lian, Dong Jin, Shengjun Cheng, Jun Yu, Min Zhang

    Abstract: Large Language Models (LLMs) often require domain-specific fine-tuning to address targeted tasks, which risks degrading their general capabilities. Maintaining a balance between domain-specific enhancements and general model utility is a key challenge. This paper proposes a novel approach named APT (Weakness Case Acquisition and Iterative Preference Training) to enhance domain-specific performance… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: ACL2025 Findings