Skip to main content

Showing 1–50 of 648 results for author: Yan, Z

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.01743  [pdf, other

    cs.CV cs.AI cs.LG

    An LLM-Empowered Low-Resolution Vision System for On-Device Human Behavior Understanding

    Authors: Siyang Jiang, Bufang Yang, Lilin Xu, Mu Yuan, Yeerzhati Abudunuer, Kaiwei Liu, Liekang Zeng, Hongkai Chen, Zhenyu Yan, Xiaofan Jiang, Guoliang Xing

    Abstract: The rapid advancements in Large Vision Language Models (LVLMs) offer the potential to surpass conventional labeling by generating richer, more detailed descriptions of on-device human behavior understanding (HBU) in low-resolution vision systems, such as depth, thermal, and infrared. However, existing large vision language model (LVLM) approaches are unable to understand low-resolution data well a… ▽ More

    Submitted 3 May, 2025; originally announced May 2025.

  2. arXiv:2504.18068  [pdf, other

    cs.CV cs.AI

    S3MOT: Monocular 3D Object Tracking with Selective State Space Model

    Authors: Zhuohao Yan, Shaoquan Feng, Xingxing Li, Yuxuan Zhou, Chunxi Xia, Shengyu Li

    Abstract: Accurate and reliable multi-object tracking (MOT) in 3D space is essential for advancing robotics and computer vision applications. However, it remains a significant challenge in monocular setups due to the difficulty of mining 3D spatiotemporal associations from 2D video streams. In this work, we present three innovative techniques to enhance the fusion and exploitation of heterogeneous cues for… ▽ More

    Submitted 25 April, 2025; originally announced April 2025.

  3. arXiv:2504.15699  [pdf, other

    cs.AI

    Advancing Embodied Agent Security: From Safety Benchmarks to Input Moderation

    Authors: Ning Wang, Zihan Yan, Weiyang Li, Chuan Ma, He Chen, Tao Xiang

    Abstract: Embodied agents exhibit immense potential across a multitude of domains, making the assurance of their behavioral safety a fundamental prerequisite for their widespread deployment. However, existing research predominantly concentrates on the security of general large language models, lacking specialized methodologies for establishing safety benchmarks and input moderation tailored to embodied agen… ▽ More

    Submitted 8 May, 2025; v1 submitted 22 April, 2025; originally announced April 2025.

    Comments: 9 pages

  4. arXiv:2504.15587  [pdf, other

    cs.LG cs.AI

    MetaMolGen: A Neural Graph Motif Generation Model for De Novo Molecular Design

    Authors: Zimo Yan, Jie Zhang, Zheng Xie, Chang Liu, Yizhen Liu, Yiping Song

    Abstract: Molecular generation plays an important role in drug discovery and materials science, especially in data-scarce scenarios where traditional generative models often struggle to achieve satisfactory conditional generalization. To address this challenge, we propose MetaMolGen, a first-order meta-learning-based molecular generator designed for few-shot and property-conditioned molecular generation. Me… ▽ More

    Submitted 12 May, 2025; v1 submitted 22 April, 2025; originally announced April 2025.

  5. arXiv:2504.14960  [pdf, other

    cs.LG cs.DC

    MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core

    Authors: Dennis Liu, Zijie Yan, Xin Yao, Tong Liu, Vijay Korthikanti, Evan Wu, Shiqing Fan, Gao Deng, Hongxiao Bai, Jianbin Chang, Ashwath Aithal, Michael Andersch, Mohammad Shoeybi, Jiajie Yao, Chandler Zhou, David Wu, Xipeng Li, June Yang

    Abstract: Mixture of Experts (MoE) models enhance neural network scalability by dynamically selecting relevant experts per input token, enabling larger model sizes while maintaining manageable computation costs. However, efficient training of large-scale MoE models across thousands of GPUs presents significant challenges due to limitations in existing parallelism strategies. We introduce an end-to-end train… ▽ More

    Submitted 23 April, 2025; v1 submitted 21 April, 2025; originally announced April 2025.

  6. arXiv:2504.14906  [pdf, other

    eess.AS cs.CV cs.SD

    OmniAudio: Generating Spatial Audio from 360-Degree Video

    Authors: Huadai Liu, Tianyi Luo, Qikai Jiang, Kaicheng Luo, Peiwen Sun, Jialei Wan, Rongjie Huang, Qian Chen, Wen Wang, Xiangtai Li, Shiliang Zhang, Zhijie Yan, Zhou Zhao, Wei Xue

    Abstract: Traditional video-to-audio generation techniques primarily focus on field-of-view (FoV) video and non-spatial audio, often missing the spatial cues necessary for accurately representing sound sources in 3D environments. To address this limitation, we introduce a novel task, 360V2SA, to generate spatial audio from 360-degree videos, specifically producing First-order Ambisonics (FOA) audio - a stan… ▽ More

    Submitted 11 May, 2025; v1 submitted 21 April, 2025; originally announced April 2025.

    Comments: ICML 2025

  7. arXiv:2504.13482  [pdf, other

    cs.IR

    Improving Sequential Recommenders through Counterfactual Augmentation of System Exposure

    Authors: Ziqi Zhao, Zhaochun Ren, Jiyuan Yang, Zuming Yan, Zihan Wang, Liu Yang, Pengjie Ren, Zhumin Chen, Maarten de Rijke, Xin Xin

    Abstract: In sequential recommendation (SR), system exposure refers to items that are exposed to the user. Typically, only a few of the exposed items would be interacted with by the user. Although SR has achieved great success in predicting future user interests, existing SR methods still fail to fully exploit system exposure data. Most methods only model items that have been interacted with, while the larg… ▽ More

    Submitted 18 April, 2025; originally announced April 2025.

    Comments: accepted at SIGIR 2025 (Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval)

  8. arXiv:2504.13092  [pdf, other

    cs.CV

    EventVAD: Training-Free Event-Aware Video Anomaly Detection

    Authors: Yihua Shao, Haojin He, Sijie Li, Siyu Chen, Xinwei Long, Fanhu Zeng, Yuxuan Fan, Muyang Zhang, Ziyang Yan, Ao Ma, Xiaochen Wang, Hao Tang, Yan Wang, Shuyan Li

    Abstract: Video Anomaly Detection~(VAD) focuses on identifying anomalies within videos. Supervised methods require an amount of in-domain training data and often struggle to generalize to unseen anomalies. In contrast, training-free methods leverage the intrinsic world knowledge of large language models (LLMs) to detect anomalies but face challenges in localizing fine-grained visual transitions and diverse… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

  9. arXiv:2504.12314  [pdf, other

    cs.CL cs.AI

    How to Detect and Defeat Molecular Mirage: A Metric-Driven Benchmark for Hallucination in LLM-based Molecular Comprehension

    Authors: Hao Li, Liuzhenghao Lv, He Cao, Zijing Liu, Zhiyuan Yan, Yu Wang, Yonghong Tian, Yu Li, Li Yuan

    Abstract: Large language models are increasingly used in scientific domains, especially for molecular understanding and analysis. However, existing models are affected by hallucination issues, resulting in errors in drug design and utilization. In this paper, we first analyze the sources of hallucination in LLMs for molecular comprehension tasks, specifically the knowledge shortcut phenomenon observed in th… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

    Comments: 17 pages

  10. arXiv:2504.11999  [pdf, other

    cs.CV

    A Complex-valued SAR Foundation Model Based on Physically Inspired Representation Learning

    Authors: Mengyu Wang, Hanbo Bi, Yingchao Feng, Linlin Xin, Shuo Gong, Tianqi Wang, Zhiyuan Yan, Peijin Wang, Wenhui Diao, Xian Sun

    Abstract: Vision foundation models in remote sensing have been extensively studied due to their superior generalization on various downstream tasks. Synthetic Aperture Radar (SAR) offers all-day, all-weather imaging capabilities, providing significant advantages for Earth observation. However, establishing a foundation model for SAR image interpretation inevitably encounters the challenges of insufficient i… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

  11. arXiv:2504.09807  [pdf, other

    physics.flu-dyn cs.LG

    Virtual domain extension for imposing boundary conditions in flow simulation using pre-trained local neural operator

    Authors: Ximeng Ye, Hongyu Li, Zhen-Guo Yan

    Abstract: This paper builds up a virtual domain extension (VDE) framework for imposing boundary conditions (BCs) in flow simulation using pre-trained local neural operator (LNO). It creates extended virtual domains to the input function to compensate for the corrosion nature of computational domains during LNO inference, thus turns the implementation of BC into the determination of field values on the exten… ▽ More

    Submitted 13 April, 2025; originally announced April 2025.

  12. arXiv:2504.09588  [pdf, other

    cs.CV cs.AI

    TextSplat: Text-Guided Semantic Fusion for Generalizable Gaussian Splatting

    Authors: Zhicong Wu, Hongbin Xu, Gang Xu, Ping Nie, Zhixin Yan, Jinkai Zheng, Liangqiong Qu, Ming Li, Liqiang Nie

    Abstract: Recent advancements in Generalizable Gaussian Splatting have enabled robust 3D reconstruction from sparse input views by utilizing feed-forward Gaussian Splatting models, achieving superior cross-scene generalization. However, while many methods focus on geometric consistency, they often neglect the potential of text-driven guidance to enhance semantic understanding, which is crucial for accuratel… ▽ More

    Submitted 13 April, 2025; originally announced April 2025.

  13. arXiv:2504.06958  [pdf, other

    cs.CV

    VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

    Authors: Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, Limin Wang

    Abstract: Recent advancements in reinforcement learning have significantly advanced the reasoning capabilities of multimodal large language models (MLLMs). While approaches such as Group Relative Policy Optimization (GRPO) and rule-based reward mechanisms demonstrate promise in text and image domains, their application to video understanding remains limited. This paper presents a systematic exploration of R… ▽ More

    Submitted 13 April, 2025; v1 submitted 9 April, 2025; originally announced April 2025.

  14. arXiv:2504.04827  [pdf, other

    cs.CV cs.AI

    From Specificity to Generality: Revisiting Generalizable Artifacts in Detecting Face Deepfakes

    Authors: Long Ma, Zhiyuan Yan, Yize Chen, Jin Xu, Qinglang Guo, Hu Huang, Yong Liao, Hui Lin

    Abstract: Detecting deepfakes has been an increasingly important topic, especially given the rapid development of AI generation techniques. In this paper, we ask: How can we build a universal detection framework that is effective for most facial deepfakes? One significant challenge is the wide variety of deepfake generators available, resulting in varying forgery artifacts (e.g., lighting inconsistency, col… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

  15. arXiv:2504.02782  [pdf, other

    cs.CV

    GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation

    Authors: Zhiyuan Yan, Junyan Ye, Weijia Li, Zilong Huang, Shenghai Yuan, Xiangyang He, Kaiqing Lin, Jun He, Conghui He, Li Yuan

    Abstract: The recent breakthroughs in OpenAI's GPT4o model have demonstrated surprisingly good capabilities in image generation and editing, resulting in significant excitement in the community. This technical report presents the first-look evaluation benchmark (named GPT-ImgEval), quantitatively and qualitatively diagnosing GPT-4o's performance across three critical dimensions: (1) generation quality, (2)… ▽ More

    Submitted 2 May, 2025; v1 submitted 3 April, 2025; originally announced April 2025.

  16. arXiv:2504.02624  [pdf, other

    cs.HC

    EmbodiedSense: Understanding Embodied Activities with Earphones

    Authors: Lixing He, Bufang Yang, Di Duan, Zhenyu Yan, Guoliang Xing

    Abstract: In this paper, we propose EmbodiedSense, a sensing system based on commercial earphones, which enables fine-grained activity logs using existing sensors. The activity logs record both user activities and the scenario in which the activities took place, benefiting detailed behavior understanding. By understanding both the user and the environment, EmbodiedSense addresses three main challenges: the… ▽ More

    Submitted 3 April, 2025; originally announced April 2025.

  17. arXiv:2504.01979  [pdf, other

    cs.SI cs.AI

    Correlation-Attention Masked Temporal Transformer for User Identity Linkage Using Heterogeneous Mobility Data

    Authors: Ziang Yan, Xingyu Zhao, Hanqing Ma, Wei Chen, Jianpeng Qi, Yanwei Yu, Junyu Dong

    Abstract: With the rise of social media and Location-Based Social Networks (LBSN), check-in data across platforms has become crucial for User Identity Linkage (UIL). These data not only reveal users' spatio-temporal information but also provide insights into their behavior patterns and interests. However, cross-platform identity linkage faces challenges like poor data quality, high sparsity, and noise inter… ▽ More

    Submitted 27 March, 2025; originally announced April 2025.

    Comments: 9 pages, 5 figures, 3 tables

  18. arXiv:2504.01396  [pdf, other

    cs.CV

    All Patches Matter, More Patches Better: Enhance AI-Generated Image Detection via Panoptic Patch Learning

    Authors: Zheng Yang, Ruoxin Chen, Zhiyuan Yan, Ke-Yue Zhang, Xinghe Fu, Shuang Wu, Xiujun Shu, Taiping Yao, Junchi Yan, Shouhong Ding, Xi Li

    Abstract: The exponential growth of AI-generated images (AIGIs) underscores the urgent need for robust and generalizable detection methods. In this paper, we establish two key principles for AIGI detection through systematic analysis: \textbf{(1) All Patches Matter:} Unlike conventional image classification where discriminative features concentrate on object-centric regions, each patch in AIGIs inherently c… ▽ More

    Submitted 2 April, 2025; originally announced April 2025.

  19. arXiv:2503.24065  [pdf

    cs.CV cs.RO

    COSMO: Combination of Selective Memorization for Low-cost Vision-and-Language Navigation

    Authors: Siqi Zhang, Yanyuan Qiao, Qunbo Wang, Zike Yan, Qi Wu, Zhihua Wei, Jing Liu

    Abstract: Vision-and-Language Navigation (VLN) tasks have gained prominence within artificial intelligence research due to their potential application in fields like home assistants. Many contemporary VLN approaches, while based on transformer architectures, have increasingly incorporated additional components such as external knowledge bases or map information to enhance performance. These additions, while… ▽ More

    Submitted 31 March, 2025; originally announced March 2025.

  20. arXiv:2503.23365  [pdf, other

    cs.CV cs.RO

    OnSiteVRU: A High-Resolution Trajectory Dataset for High-Density Vulnerable Road Users

    Authors: Zhangcun Yan, Jianqing Li, Peng Hang, Jian Sun

    Abstract: With the acceleration of urbanization and the growth of transportation demands, the safety of vulnerable road users (VRUs, such as pedestrians and cyclists) in mixed traffic flows has become increasingly prominent, necessitating high-precision and diverse trajectory data to support the development and optimization of autonomous driving systems. However, existing datasets fall short in capturing th… ▽ More

    Submitted 30 March, 2025; originally announced March 2025.

  21. arXiv:2503.21766  [pdf, other

    cs.CV cs.AI

    Stable-SCore: A Stable Registration-based Framework for 3D Shape Correspondence

    Authors: Haolin Liu, Xiaohang Zhan, Zizheng Yan, Zhongjin Luo, Yuxin Wen, Xiaoguang Han

    Abstract: Establishing character shape correspondence is a critical and fundamental task in computer vision and graphics, with diverse applications including re-topology, attribute transfer, and shape interpolation. Current dominant functional map methods, while effective in controlled scenarios, struggle in real situations with more complex challenges such as non-isometric shape discrepancies. In response,… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

    Comments: Accepted by CVPR 2025. Homepage: https://haolinliu97.github.io/Stable-Score/

  22. arXiv:2503.21269  [pdf, other

    cs.CV

    Delving Deep into Semantic Relation Distillation

    Authors: Zhaoyi Yan, Kangjun Liu, Qixiang Ye

    Abstract: Knowledge distillation has become a cornerstone technique in deep learning, facilitating the transfer of knowledge from complex models to lightweight counterparts. Traditional distillation approaches focus on transferring knowledge at the instance level, but fail to capture nuanced semantic relationships within the data. In response, this paper introduces a novel methodology, Semantics-based Relat… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

  23. arXiv:2503.19349  [pdf, other

    eess.SY cs.LG math.OC

    Optimal Parameter Adaptation for Safety-Critical Control via Safe Barrier Bayesian Optimization

    Authors: Shengbo Wang, Ke Li, Zheng Yan, Zhenyuan Guo, Song Zhu, Guanghui Wen, Shiping Wen

    Abstract: Safety is of paramount importance in control systems to avoid costly risks and catastrophic damages. The control barrier function (CBF) method, a promising solution for safety-critical control, poses a new challenge of enhancing control performance due to its direct modification of original control design and the introduction of uncalibrated parameters. In this work, we shed light on the crucial r… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

    Comments: Preprent manuscript, review only

  24. arXiv:2503.19332  [pdf, other

    cs.CV

    Divide-and-Conquer: Dual-Hierarchical Optimization for Semantic 4D Gaussian Spatting

    Authors: Zhiying Yan, Yiyuan Liang, Shilv Cai, Tao Zhang, Sheng Zhong, Luxin Yan, Xu Zou

    Abstract: Semantic 4D Gaussians can be used for reconstructing and understanding dynamic scenes, with temporal variations than static scenes. Directly applying static methods to understand dynamic scenes will fail to capture the temporal features. Few works focus on dynamic scene understanding based on Gaussian Splatting, since once the same update strategy is employed for both dynamic and static parts, reg… ▽ More

    Submitted 24 March, 2025; originally announced March 2025.

    Comments: ICME 2025

  25. arXiv:2503.12927  [pdf, other

    cs.CV cs.AI

    MMLNB: Multi-Modal Learning for Neuroblastoma Subtyping Classification Assisted with Textual Description Generation

    Authors: Huangwei Chen, Yifei Chen, Zhenyu Yan, Mingyang Ding, Chenlei Li, Zhu Zhu, Feiwei Qin

    Abstract: Neuroblastoma (NB), a leading cause of childhood cancer mortality, exhibits significant histopathological variability, necessitating precise subtyping for accurate prognosis and treatment. Traditional diagnostic methods rely on subjective evaluations that are time-consuming and inconsistent. To address these challenges, we introduce MMLNB, a multi-modal learning (MML) model that integrates patholo… ▽ More

    Submitted 19 March, 2025; v1 submitted 17 March, 2025; originally announced March 2025.

    Comments: 25 pages, 7 figures

  26. arXiv:2503.12199  [pdf, other

    eess.SY cs.RO

    Formation Control of Multi-agent System with Local Interaction and Artificial Potential Field

    Authors: Luoyin Zhao, Zheping Yan, Yuqing Wang, Raye Chen-Hua Yeow

    Abstract: A novel local interaction control method (LICM) is proposed in this paper to realize the formation control of multi-agent system (MAS). A local interaction leader follower (LILF) structure is provided by coupling the advantages of information consensus and leader follower frame, the agents can obtain the state information of the leader by interacting with their neighbours, which will reduce the co… ▽ More

    Submitted 15 March, 2025; originally announced March 2025.

  27. arXiv:2503.10567  [pdf, other

    cs.LG

    FedPCA: Noise-Robust Fair Federated Learning via Performance-Capacity Analysis

    Authors: Nannan Wu, Zengqiang Yan, Nong Sang, Li Yu, Chang Wen Chen

    Abstract: Training a model that effectively handles both common and rare data-i.e., achieving performance fairness-is crucial in federated learning (FL). While existing fair FL methods have shown effectiveness, they remain vulnerable to mislabeled data. Ensuring robustness in fair FL is therefore essential. However, fairness and robustness inherently compete, which causes robust strategies to hinder fairnes… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

    Comments: Preprint

  28. arXiv:2503.10270  [pdf, other

    cs.CV

    EEdit: Rethinking the Spatial and Temporal Redundancy for Efficient Image Editing

    Authors: Zexuan Yan, Yue Ma, Chang Zou, Wenteng Chen, Qifeng Chen, Linfeng Zhang

    Abstract: Inversion-based image editing is rapidly gaining momentum while suffering from significant computation overhead, hindering its application in real-time interactive scenarios. In this paper, we rethink that the redundancy in inversion-based image editing exists in both the spatial and temporal dimensions, such as the unnecessary computation in unedited regions and the redundancy in the inversion pr… ▽ More

    Submitted 30 March, 2025; v1 submitted 13 March, 2025; originally announced March 2025.

    Comments: 17 pages,fix figure mistake(inv/fwd skipping) in fig2

  29. arXiv:2503.10118  [pdf, other

    cs.RO cs.LG

    An Real-Sim-Real (RSR) Loop Framework for Generalizable Robotic Policy Transfer with Differentiable Simulation

    Authors: Lu Shi, Yuxuan Xu, Shiyu Wang, Jinhao Huang, Wenhao Zhao, Yufei Jia, Zike Yan, Weibin Gu, Guyue Zhou

    Abstract: The sim-to-real gap remains a critical challenge in robotics, hindering the deployment of algorithms trained in simulation to real-world systems. This paper introduces a novel Real-Sim-Real (RSR) loop framework leveraging differentiable simulation to address this gap by iteratively refining simulation parameters, aligning them with real-world conditions, and enabling robust and efficient policy tr… ▽ More

    Submitted 18 March, 2025; v1 submitted 13 March, 2025; originally announced March 2025.

  30. arXiv:2503.09587  [pdf, other

    eess.IV cs.CV cs.LG

    Fair Federated Medical Image Classification Against Quality Shift via Inter-Client Progressive State Matching

    Authors: Nannan Wu, Zhuo Kuang, Zengqiang Yan, Ping Wang, Li Yu

    Abstract: Despite the potential of federated learning in medical applications, inconsistent imaging quality across institutions-stemming from lower-quality data from a minority of clients-biases federated models toward more common high-quality images. This raises significant fairness concerns. Existing fair federated learning methods have demonstrated some effectiveness in solving this problem by aligning a… ▽ More

    Submitted 12 March, 2025; originally announced March 2025.

    Comments: Preprint

  31. arXiv:2503.07417  [pdf, other

    cs.CV

    GM-MoE: Low-Light Enhancement with Gated-Mechanism Mixture-of-Experts

    Authors: Minwen Liao, Hao Bo Dong, Xinyi Wang, Ziyang Yan, Yihua Shao

    Abstract: Low-light enhancement has wide applications in autonomous driving, 3D reconstruction, remote sensing, surveillance, and so on, which can significantly improve information utilization. However, most existing methods lack generalization and are limited to specific tasks such as image recovery. To address these issues, we propose Gated-Mechanism Mixture-of-Experts (GM-MoE), the first framework to int… ▽ More

    Submitted 26 March, 2025; v1 submitted 10 March, 2025; originally announced March 2025.

  32. arXiv:2503.07215  [pdf, other

    cs.SE cs.PL

    Control Flow-Augmented Decompiler based on Large Language Model

    Authors: Peipei Liu, Jian Sun, Li Chen, Zhaoteng Yan, Peizheng Zhang, Dapeng Sun, Dawei Wang, Dan Li

    Abstract: Binary decompilation plays a crucial role in various tasks related to security threat analysis and software engineering, such as binary vulnerability detection and software supply chain analysis. Current prevalent binary decompilation methods primarily rely on large language models (LLMs) and can be broadly classified into two main approaches: prompt-based decompilation and end-toend decompilation… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

  33. arXiv:2503.06564  [pdf, other

    cs.CV

    TR-DQ: Time-Rotation Diffusion Quantization

    Authors: Yihua Shao, Deyang Lin, Fanhu Zeng, Minxi Yan, Muyang Zhang, Siyu Chen, Yuxuan Fan, Ziyang Yan, Haozhe Wang, Jingcai Guo, Yan Wang, Haotong Qin, Hao Tang

    Abstract: Diffusion models have been widely adopted in image and video generation. However, their complex network architecture leads to high inference overhead for its generation process. Existing diffusion quantization methods primarily focus on the quantization of the model structure while ignoring the impact of time-steps variation during sampling. At the same time, most current approaches fail to accoun… ▽ More

    Submitted 9 March, 2025; originally announced March 2025.

  34. arXiv:2503.04557  [pdf, other

    cs.RO

    Learning Generalizable Language-Conditioned Cloth Manipulation from Long Demonstrations

    Authors: Hanyi Zhao, Jinxuan Zhu, Zihao Yan, Yichen Li, Yuhong Deng, Xueqian Wang

    Abstract: Multi-step cloth manipulation is a challenging problem for robots due to the high-dimensional state spaces and the dynamics of cloth. Despite recent significant advances in end-to-end imitation learning for multi-step cloth manipulation skills, these methods fail to generalize to unseen tasks. Our insight in tackling the challenge of generalizable multi-step cloth manipulation is decomposition. We… ▽ More

    Submitted 6 March, 2025; originally announced March 2025.

  35. arXiv:2503.04171  [pdf, other

    cs.CV

    DuCos: Duality Constrained Depth Super-Resolution via Foundation Model

    Authors: Zhiqiang Yan, Zhengxue Wang, Haoye Dong, Jun Li, Jian Yang, Gim Hee Lee

    Abstract: We introduce DuCos, a novel depth super-resolution framework grounded in Lagrangian duality theory, offering a flexible integration of multiple constraints and reconstruction objectives to enhance accuracy and robustness. Our DuCos is the first to significantly improve generalization across diverse scenarios with foundation models as prompts. The prompt design consists of two key components: Corre… ▽ More

    Submitted 6 March, 2025; originally announced March 2025.

  36. arXiv:2503.02330  [pdf, other

    cs.CV

    Exploring Simple Siamese Network for High-Resolution Video Quality Assessment

    Authors: Guotao Shen, Ziheng Yan, Xin Jin, Longhai Wu, Jie Chen, Ilhyun Cho, Cheul-Hee Hahm

    Abstract: In the research of video quality assessment (VQA), two-branch network has emerged as a promising solution. It decouples VQA with separate technical and aesthetic branches to measure the perception of low-level distortions and high-level semantics respectively. However, we argue that while technical and aesthetic perspectives are complementary, the technical perspective itself should be measured in… ▽ More

    Submitted 4 March, 2025; originally announced March 2025.

    Comments: Accepted by ICASSP 2025

  37. arXiv:2502.21060  [pdf, other

    cs.LG cs.IT

    Efficient Transformer-based Decoder for Varshamov-Tenengolts Codes

    Authors: Yali Wei, Alan J. X. Guo, Zihui Yan, Yufan Dai

    Abstract: In recent years, the rise of DNA data storage technology has brought significant attention to the challenge of correcting insertion, deletion, and substitution (IDS) errors. Among various coding methods for IDS correction, Varshamov-Tenengolts (VT) codes, primarily designed for single-error correction, have emerged as a central research focus. While existing decoding methods achieve high accuracy… ▽ More

    Submitted 28 February, 2025; originally announced February 2025.

    Comments: 9 pages, 2 figures, 9 tables

  38. arXiv:2502.15768  [pdf

    cs.LG cs.CV physics.optics

    Exploring the Role of Artificial Intelligence and Machine Learning in Process Optimization for Chemical Industry

    Authors: Zishuo Lin, Jiajie Wang, Zhe Yan, Peiyong Ma

    Abstract: The crucial field of Optical Chemical Structure Recognition (OCSR) aims to transform chemical structure photographs into machine-readable formats so that chemical databases may be efficiently stored and queried. Although a number of OCSR technologies have been created, little is known about how well they work in different picture deterioration scenarios. In this work, a new dataset of chemically s… ▽ More

    Submitted 15 February, 2025; originally announced February 2025.

  39. arXiv:2502.13255  [pdf, other

    cs.HC cs.CY cs.RO

    PCB Renewal: Iterative Reuse of PCB Substrates for Sustainable Electronic Making

    Authors: Zeyu Yan, Advait Vartak, Jiasheng Li, Zining Zhang, Huaishu Peng

    Abstract: PCB (printed circuit board) substrates are often single-use, leading to material waste in electronics making. We introduce PCB Renewal, a novel technique that "erases" and "reconfigures" PCB traces by selectively depositing conductive epoxy onto outdated areas, transforming isolated paths into conductive planes that support new traces. We present the PCB Renewal workflow, evaluate its electrical p… ▽ More

    Submitted 18 February, 2025; originally announced February 2025.

    Journal ref: ACM CHI 2025

  40. Make Making Sustainable: Exploring Sustainability Practices, Challenges, and Opportunities in Making Activities

    Authors: Zeyu Yan, Mrunal Dhaygude, Huaishu Peng

    Abstract: The recent democratization of personal fabrication has significantly advanced the maker movement and reshaped applied research in HCI and beyond. However, this growth has also raised increasing sustainability concerns, as material waste is an inevitable byproduct of making and rapid prototyping. In this work, we examine the sustainability landscape within the modern maker community, focusing on gr… ▽ More

    Submitted 18 February, 2025; originally announced February 2025.

    Journal ref: CHI 2025

  41. arXiv:2502.11094  [pdf, other

    cs.SD cs.AI

    SyncSpeech: Low-Latency and Efficient Dual-Stream Text-to-Speech based on Temporal Masked Transformer

    Authors: Zhengyan Sheng, Zhihao Du, Shiliang Zhang, Zhijie Yan, Yexin Yang, Zhenhua Ling

    Abstract: This paper presents a dual-stream text-to-speech (TTS) model, SyncSpeech, capable of receiving streaming text input from upstream models while simultaneously generating streaming speech, facilitating seamless interaction with large language models. SyncSpeech has the following advantages: Low latency, as it begins generating streaming speech upon receiving the second text token; High efficiency, a… ▽ More

    Submitted 16 February, 2025; originally announced February 2025.

  42. arXiv:2502.09723  [pdf, other

    cs.CR cs.AI cs.CL

    Making Them a Malicious Database: Exploiting Query Code to Jailbreak Aligned Large Language Models

    Authors: Qingsong Zou, Jingyu Xiao, Qing Li, Zhi Yan, Yuhang Wang, Li Xu, Wenxuan Wang, Kuofeng Gao, Ruoyu Li, Yong Jiang

    Abstract: Recent advances in large language models (LLMs) have demonstrated remarkable potential in the field of natural language processing. Unfortunately, LLMs face significant security and ethical risks. Although techniques such as safety alignment are developed for defense, prior researches reveal the possibility of bypassing such defenses through well-designed jailbreak attacks. In this paper, we propo… ▽ More

    Submitted 20 February, 2025; v1 submitted 13 February, 2025; originally announced February 2025.

    Comments: 15 pages, 11 figures

  43. arXiv:2502.07289  [pdf, other

    cs.CV

    Learning Inverse Laplacian Pyramid for Progressive Depth Completion

    Authors: Kun Wang, Zhiqiang Yan, Junkai Fan, Jun Li, Jian Yang

    Abstract: Depth completion endeavors to reconstruct a dense depth map from sparse depth measurements, leveraging the information provided by a corresponding color image. Existing approaches mostly hinge on single-scale propagation strategies that iteratively ameliorate initial coarse depth estimates through pixel-level message passing. Despite their commendable outcomes, these techniques are frequently hamp… ▽ More

    Submitted 11 February, 2025; originally announced February 2025.

  44. arXiv:2502.05695  [pdf, other

    cs.MM cs.AI cs.CV cs.LG eess.IV

    Semantic-Aware Adaptive Video Streaming Using Latent Diffusion Models for Wireless Networks

    Authors: Zijiang Yan, Jianhua Pei, Hongda Wu, Hina Tabassum, Ping Wang

    Abstract: This paper proposes a novel framework for real-time adaptive-bitrate video streaming by integrating latent diffusion models (LDMs) within the FFmpeg techniques. This solution addresses the challenges of high bandwidth usage, storage inefficiencies, and quality of experience (QoE) degradation associated with traditional constant bitrate streaming (CBS) and adaptive bitrate streaming (ABS). The prop… ▽ More

    Submitted 8 February, 2025; originally announced February 2025.

    Comments: Submission for possible publication

  45. arXiv:2501.17878  [pdf, other

    eess.SP cs.LG

    Collaborative Channel Access and Transmission for NR Sidelink and Wi-Fi Coexistence over Unlicensed Spectrum

    Authors: Zhuangzhuang Yan, Xinyu Gu, Zhenyu Liu, Liyang Lu

    Abstract: With the rapid development of various internet of things (IoT) applications, including industrial IoT (IIoT) and visual IoT (VIoT), the demand for direct device-to-device communication to support high data rates continues to grow. To address this demand, 5G-Advanced has introduced sidelink communication over the unlicensed spectrum (SL-U) to increase data rates. However, the primary challenge of S… ▽ More

    Submitted 14 February, 2025; v1 submitted 19 January, 2025; originally announced January 2025.

  46. arXiv:2501.17635  [pdf, other

    cs.CL cs.AI cs.CV

    In-Context Meta LoRA Generation

    Authors: Yihua Shao, Minxi Yan, Yang Liu, Siyu Chen, Wenjie Chen, Xinwei Long, Ziyang Yan, Lei Li, Chenyu Zhang, Nicu Sebe, Hao Tang, Yan Wang, Hao Zhao, Mengzhu Wang, Jingcai Guo

    Abstract: Low-rank Adaptation (LoRA) has demonstrated remarkable capabilities for task specific fine-tuning. However, in scenarios that involve multiple tasks, training a separate LoRA model for each one results in considerable inefficiency in terms of storage and inference. Moreover, existing parameter generation methods fail to capture the correlations among these tasks, making multi-task LoRA parameter g… ▽ More

    Submitted 30 January, 2025; v1 submitted 29 January, 2025; originally announced January 2025.

  47. arXiv:2501.15134  [pdf, other

    cs.SE

    BitsAI-CR: Automated Code Review via LLM in Practice

    Authors: Tao Sun, Jian Xu, Yuanpeng Li, Zhao Yan, Ge Zhang, Lintao Xie, Lu Geng, Zheng Wang, Yueyan Chen, Qin Lin, Wenbo Duan, Kaixin Sui

    Abstract: Code review remains a critical yet resource-intensive process in software development, particularly challenging in large-scale industrial environments. While Large Language Models (LLMs) show promise for automating code review, existing solutions face significant limitations in precision and practicality. This paper presents BitsAI-CR, an innovative framework that enhances code review through a tw… ▽ More

    Submitted 25 January, 2025; originally announced January 2025.

  48. arXiv:2501.14728  [pdf, other

    cs.MM cs.CL cs.CV cs.CY

    Mitigating GenAI-powered Evidence Pollution for Out-of-Context Multimodal Misinformation Detection

    Authors: Zehong Yan, Peng Qi, Wynne Hsu, Mong Li Lee

    Abstract: While large generative artificial intelligence (GenAI) models have achieved significant success, they also raise growing concerns about online information security due to their potential misuse for generating deceptive content. Out-of-context (OOC) multimodal misinformation detection, which often retrieves Web evidence to identify the repurposing of images in false contexts, faces the issue of rea… ▽ More

    Submitted 24 January, 2025; originally announced January 2025.

    Comments: 12 pages, 11 figures

  49. arXiv:2501.12948  [pdf, other

    cs.CL cs.AI cs.LG

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Authors: DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu , et al. (175 additional authors not shown)

    Abstract: We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters… ▽ More

    Submitted 22 January, 2025; originally announced January 2025.

  50. arXiv:2501.12386  [pdf, other

    cs.CV

    InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

    Authors: Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, Min Dou, Kai Chen, Wenhai Wang, Yu Qiao, Yali Wang, Limin Wang

    Abstract: This paper aims to improve the performance of video multimodal large language models (MLLM) via long and rich context (LRC) modeling. As a result, we develop a new version of InternVideo2.5 with a focus on enhancing the original MLLMs' ability to perceive fine-grained details and capture long-form temporal structure in videos. Specifically, our approach incorporates dense vision task annotations i… ▽ More

    Submitted 22 January, 2025; v1 submitted 21 January, 2025; originally announced January 2025.

    Comments: technical report