Skip to main content

Showing 1–50 of 1,022 results for author: Cheng, Y

Searching in archive cs. Search in all archives.
.
  1. 3D-Fixup: Advancing Photo Editing with 3D Priors

    Authors: Yen-Chi Cheng, Krishna Kumar Singh, Jae Shin Yoon, Alex Schwing, Liangyan Gui, Matheus Gadelha, Paul Guerrero, Nanxuan Zhao

    Abstract: Despite significant advances in modeling image priors via diffusion models, 3D-aware image editing remains challenging, in part because the object is only specified via a single image. To tackle this challenge, we propose 3D-Fixup, a new framework for editing 2D images guided by learned 3D priors. The framework supports difficult editing situations such as object translation and 3D rotation. To ac… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

    Comments: SIGGRAPH 2025. Project page: https://3dfixup.github.io/

  2. arXiv:2505.10202  [pdf, other

    cs.CL

    VQ-Logits: Compressing the Output Bottleneck of Large Language Models via Vector Quantized Logits

    Authors: Jintian Shao, Hongyi Huang, Jiayi Wu, YiMing Cheng, ZhiYu Wu, You Shan, MingKai Zheng

    Abstract: Large Language Models (LLMs) have achieved remarkable success but face significant computational and memory challenges, particularly due to their extensive output vocabularies. The final linear projection layer, mapping hidden states to vocabulary-sized logits, often constitutes a substantial portion of the model's parameters and computational cost during inference. Existing methods like adaptive… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  3. arXiv:2505.08617  [pdf, ps, other

    cs.CV

    OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

    Authors: Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, Yu Cheng

    Abstract: While humans can flexibly leverage interactive visual cognition for complex problem-solving, enabling Large Vision-Language Models (LVLMs) to learn similarly adaptive behaviors with visual tools remains challenging. A significant hurdle is the current lack of standardized infrastructure, which hinders integrating diverse tools, generating rich interaction data, and training robust agents effective… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

    Comments: Work in progress

  4. arXiv:2505.08084  [pdf, other

    cs.CV

    Visually Interpretable Subtask Reasoning for Visual Question Answering

    Authors: Yu Cheng, Arushi Goel, Hakan Bilen

    Abstract: Answering complex visual questions like `Which red furniture can be used for sitting?' requires multi-step reasoning, including object recognition, attribute filtering, and relational understanding. Recent work improves interpretability in multimodal large language models (MLLMs) by decomposing tasks into sub-task programs, but these methods are computationally expensive and less accurate due to p… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  5. arXiv:2505.07375  [pdf, ps, other

    cs.CV

    Boosting Global-Local Feature Matching via Anomaly Synthesis for Multi-Class Point Cloud Anomaly Detection

    Authors: Yuqi Cheng, Yunkang Cao, Dongfang Wang, Weiming Shen, Wenlong Li

    Abstract: Point cloud anomaly detection is essential for various industrial applications. The huge computation and storage costs caused by the increasing product classes limit the application of single-class unsupervised methods, necessitating the development of multi-class unsupervised methods. However, the feature similarity between normal and anomalous points from different class data leads to the featur… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

    Comments: 12 pages, 12 figures

  6. arXiv:2505.07203  [pdf, other

    cs.DC

    PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications

    Authors: Kuntai Du, Bowen Wang, Chen Zhang, Yiming Cheng, Qing Lan, Hejian Sang, Yihua Cheng, Jiayi Yao, Xiaoxuan Liu, Yifan Qiao, Ion Stoica, Junchen Jiang

    Abstract: Besides typical generative applications, like ChatGPT, GitHub Copilot, and Cursor, we observe an emerging trend that LLMs are increasingly used in traditional discriminative tasks, such as recommendation, credit verification, and data labeling. The key characteristic of these emerging use cases is that the LLM generates only a single output token, rather than an arbitrarily long sequence of tokens… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

  7. arXiv:2505.06588  [pdf, ps, other

    cs.RO cs.MA eess.SY nlin.AO

    Emergent Multi-View Fidelity in Autonomous UAV Swarm Sport Injury Detection

    Authors: Yu Cheng, Harun Šiljak

    Abstract: Accurate, real-time collision detection is essential for ensuring player safety and effective refereeing in high-contact sports such as rugby, particularly given the severe risks associated with traumatic brain injuries (TBI). Traditional collision-monitoring methods employing fixed cameras or wearable sensors face limitations in visibility, coverage, and responsiveness. Previously, we introduced… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

    Comments: Accepted for 2025 8th International Balkan Conference on Communications and Networking (Balkancom)

  8. arXiv:2505.06252  [pdf, other

    cs.DB cs.DC

    Towards Efficient LLM Storage Reduction via Tensor Deduplication and Delta Compression

    Authors: Zirui Wang, Tingfeng Lan, Zhaoyuan Su, Juncheng Yang, Yue Cheng

    Abstract: Modern model hubs, such as Hugging Face, store tens of petabytes of LLMs, with fine-tuned variants vastly outnumbering base models and dominating storage consumption. Existing storage reduction techniques -- such as deduplication and compression -- are either LLM oblivious or not compatible with each other, limiting data reduction effectiveness. Our large-scale characterization study across all pu… ▽ More

    Submitted 30 April, 2025; originally announced May 2025.

  9. arXiv:2505.05190  [pdf, other

    cs.LG cs.AI cs.CL cs.CR

    Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks

    Authors: Yixin Cheng, Hongcheng Guo, Yangming Li, Leonid Sigal

    Abstract: Text watermarking aims to subtly embed statistical signals into text by controlling the Large Language Model (LLM)'s sampling process, enabling watermark detectors to verify that the output was generated by the specified model. The robustness of these watermarking algorithms has become a key factor in evaluating their effectiveness. Current text watermarking algorithms embed watermarks in high-ent… ▽ More

    Submitted 11 May, 2025; v1 submitted 8 May, 2025; originally announced May 2025.

    Comments: ICML 2025 Accpeted

  10. arXiv:2505.04996  [pdf, other

    cs.GR cs.CV cs.SD eess.AS

    Inter-Diffusion Generation Model of Speakers and Listeners for Effective Communication

    Authors: Jinhe Huang, Yongkang Cheng, Yuming Hang, Gaoge Han, Jinewei Li, Jing Zhang, Xingjian Gu

    Abstract: Full-body gestures play a pivotal role in natural interactions and are crucial for achieving effective communication. Nevertheless, most existing studies primarily focus on the gesture generation of speakers, overlooking the vital role of listeners in the interaction process and failing to fully explore the dynamic interaction between them. This paper innovatively proposes an Inter-Diffusion Gener… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

    Comments: accepted by ICMR 2025

  11. arXiv:2505.04653  [pdf, ps, other

    cs.CL cs.AI cs.CV cs.LG

    Advancing Conversational Diagnostic AI with Multimodal Reasoning

    Authors: Khaled Saab, Jan Freyberg, Chunjong Park, Tim Strother, Yong Cheng, Wei-Hung Weng, David G. T. Barrett, David Stutz, Nenad Tomasev, Anil Palepu, Valentin Liévin, Yash Sharma, Roma Ruparel, Abdullah Ahmed, Elahe Vedadi, Kimberly Kanada, Cian Hughes, Yun Liu, Geoff Brown, Yang Gao, Sean Li, S. Sara Mahdavi, James Manyika, Katherine Chou, Yossi Matias , et al. (11 additional authors not shown)

    Abstract: Large Language Models (LLMs) have demonstrated great potential for conducting diagnostic conversations but evaluation has been largely limited to language-only interactions, deviating from the real-world requirements of remote care delivery. Instant messaging platforms permit clinicians and patients to upload and discuss multimodal medical artifacts seamlessly in medical consultation, but the abil… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

  12. arXiv:2505.04116  [pdf, other

    cs.MM

    RFNNS: Robust Fixed Neural Network Steganography with Popular Deep Generative Models

    Authors: Yu Cheng, Jiuan Zhou, Jiawei Chen, Zhaoxia Yin, Xinpeng Zhang

    Abstract: Image steganography is a technique that conceals secret information in a cover image to achieve covert communication. Recent research has demonstrated that Fixed Neural Network Steganography (FNNS) exhibits significant practical advantages, as it enables stable and efficient steganographic embedding and extraction without requiring neural network training. However, the stego image generated by exi… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

  13. arXiv:2504.17577  [pdf, other

    cs.LG

    TileLang: A Composable Tiled Programming Model for AI Systems

    Authors: Lei Wang, Yu Cheng, Yining Shi, Zhengju Tang, Zhiwen Mo, Wenhao Xie, Lingxiao Ma, Yuqing Xia, Jilong Xue, Fan Yang, Zhi Yang

    Abstract: Modern AI workloads rely heavily on optimized computing kernels for both training and inference. These AI kernels follow well-defined data-flow patterns, such as moving tiles between DRAM and SRAM and performing a sequence of computations on those tiles. However, writing high-performance kernels remains complex despite the clarity of these patterns. Achieving peak performance requires careful, har… ▽ More

    Submitted 27 April, 2025; v1 submitted 24 April, 2025; originally announced April 2025.

  14. arXiv:2504.16915  [pdf, other

    cs.CV

    DreamO: A Unified Framework for Image Customization

    Authors: Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, Mengtian Li, Mingcong Liu, Yi Zhang, Shaojin Wu, Songtao Zhao, Jian Zhang, Qian He, Xinglong Wu

    Abstract: Recently, extensive research on image customization (e.g., identity, subject, style, background, etc.) demonstrates strong customization capabilities in large-scale generative models. However, most approaches are designed for specific tasks, restricting their generalizability to combine different types of condition. Developing a unified framework for image customization remains an open challenge.… ▽ More

    Submitted 13 May, 2025; v1 submitted 23 April, 2025; originally announced April 2025.

  15. arXiv:2504.16320  [pdf, other

    cs.RO cs.LG

    PCF-Grasp: Converting Point Completion to Geometry Feature to Enhance 6-DoF Grasp

    Authors: Yaofeng Cheng, Fusheng Zha, Wei Guo, Pengfei Wang, Chao Zeng, Lining Sun, Chenguang Yang

    Abstract: The 6-Degree of Freedom (DoF) grasp method based on point clouds has shown significant potential in enabling robots to grasp target objects. However, most existing methods are based on the point clouds (2.5D points) generated from single-view depth images. These point clouds only have one surface side of the object providing incomplete geometry information, which mislead the grasping algorithm to… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

  16. arXiv:2504.15721  [pdf, other

    cs.AR

    BBAL: A Bidirectional Block Floating Point-Based Quantisation Accelerator for Large Language Models

    Authors: Xiaomeng Han, Yuan Cheng, Jing Wang, Junyang Lu, Hui Wang, X. x. Zhang, Ning Xu, Dawei Yang, Zhe Jiang

    Abstract: Large language models (LLMs), with their billions of parameters, pose substantial challenges for deployment on edge devices, straining both memory capacity and computational resources. Block Floating Point (BFP) quantisation reduces memory and computational overhead by converting high-overhead floating point operations into low-bit fixed point operations. However, BFP requires aligning all data to… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

  17. arXiv:2504.15223  [pdf

    cs.LG

    A Deep Learning Framework for Sequence Mining with Bidirectional LSTM and Multi-Scale Attention

    Authors: Tao Yang, Yu Cheng, Yaokun Ren, Yujia Lou, Minggu Wei, Honghui Xin

    Abstract: This paper addresses the challenges of mining latent patterns and modeling contextual dependencies in complex sequence data. A sequence pattern mining algorithm is proposed by integrating Bidirectional Long Short-Term Memory (BiLSTM) with a multi-scale attention mechanism. The BiLSTM captures both forward and backward dependencies in sequences, enhancing the model's ability to perceive global cont… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

  18. arXiv:2504.15046  [pdf, other

    cs.AI

    Text-to-Decision Agent: Learning Generalist Policies from Natural Language Supervision

    Authors: Shilin Zhang, Zican Hu, Wenhao Wu, Xinyi Xie, Jianxiang Tang, Chunlin Chen, Daoyi Dong, Yu Cheng, Zhenhong Sun, Zhi Wang

    Abstract: RL systems usually tackle generalization by inferring task beliefs from high-quality samples or warmup explorations. The restricted form limits their generality and usability since these supervision signals are expensive and even infeasible to acquire in advance for unseen tasks. Learning directly from the raw text about decision tasks is a promising alternative to leverage a much broader source o… ▽ More

    Submitted 22 April, 2025; v1 submitted 21 April, 2025; originally announced April 2025.

    Comments: 18 pages, 8 figures

  19. arXiv:2504.14948  [pdf, ps, other

    cs.GT

    Mechanism Design for Auctions with Externalities on Budgets

    Authors: Yusen Zheng, Yukun Cheng, Chenyang Xu, Xiaotie Deng

    Abstract: This paper studies mechanism design for auctions with externalities on budgets, a novel setting where the budgets that bidders commit are adjusted due to the externality of the competitors' allocation outcomes-a departure from traditional auctions with fixed budgets. This setting is motivated by real-world scenarios, for example, participants may increase their budgets in response to competitors'… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

  20. arXiv:2504.14946  [pdf, other

    cs.LG

    Symmetry-Preserving Architecture for Multi-NUMA Environments (SPANE): A Deep Reinforcement Learning Approach for Dynamic VM Scheduling

    Authors: Tin Ping Chan, Yunlong Cheng, Yizhan Zhu, Xiaofeng Gao, Guihai Chen

    Abstract: As cloud computing continues to evolve, the adoption of multi-NUMA (Non-Uniform Memory Access) architecture by cloud service providers has introduced new challenges in virtual machine (VM) scheduling. To address these challenges and more accurately reflect the complexities faced by modern cloud environments, we introduce the Dynamic VM Allocation problem in Multi-NUMA PM (DVAMP). We formally defin… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

    Comments: 10 pages, 7 figures. Accepted to IEEE INFOCOM 2025

  21. arXiv:2504.14945  [pdf, other

    cs.LG cs.AI cs.CL

    Learning to Reason under Off-Policy Guidance

    Authors: Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, Yue Zhang

    Abstract: Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning (RL) with simple rule-based rewards. However, existing zero-RL approaches are inherently ``on-policy'', limiting learning to a model's own outputs and failing to acquire reasoning abilities beyond its initial capabilities.… ▽ More

    Submitted 22 April, 2025; v1 submitted 21 April, 2025; originally announced April 2025.

    Comments: Work in progress

  22. arXiv:2504.13961  [pdf, other

    cs.LG cs.AI stat.ML

    CONTINA: Confidence Interval for Traffic Demand Prediction with Coverage Guarantee

    Authors: Chao Yang, Xiannan Huang, Shuhan Qiu, Yan Cheng

    Abstract: Accurate short-term traffic demand prediction is critical for the operation of traffic systems. Besides point estimation, the confidence interval of the prediction is also of great importance. Many models for traffic operations, such as shared bike rebalancing and taxi dispatching, take into account the uncertainty of future demand and require confidence intervals as the input. However, existing m… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

  23. arXiv:2504.12856  [pdf, other

    cs.GR cs.AI cs.CV cs.LG cs.RO

    3D-PNAS: 3D Industrial Surface Anomaly Synthesis with Perlin Noise

    Authors: Yifeng Cheng, Juan Du

    Abstract: Large pretrained vision foundation models have shown significant potential in various vision tasks. However, for industrial anomaly detection, the scarcity of real defect samples poses a critical challenge in leveraging these models. While 2D anomaly generation has significantly advanced with established generative models, the adoption of 3D sensors in industrial manufacturing has made leveraging… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

    ACM Class: I.5.4

  24. arXiv:2504.12395  [pdf, other

    cs.CV

    InstantCharacter: Personalize Any Characters with a Scalable Diffusion Transformer Framework

    Authors: Jiale Tao, Yanbing Zhang, Qixun Wang, Yiji Cheng, Haofan Wang, Xu Bai, Zhengguang Zhou, Ruihuang Li, Linqing Wang, Chunyu Wang, Qin Lin, Qinglin Lu

    Abstract: Current learning-based subject customization approaches, predominantly relying on U-Net architectures, suffer from limited generalization ability and compromised image quality. Meanwhile, optimization-based methods require subject-specific fine-tuning, which inevitably degrades textual controllability. To address these challenges, we propose InstantCharacter, a scalable framework for character cus… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

    Comments: Tech Report. Code is available at https://github.com/Tencent/InstantCharacter

  25. arXiv:2504.10686  [pdf, other

    cs.CV eess.IV

    The Tenth NTIRE 2025 Efficient Super-Resolution Challenge Report

    Authors: Bin Ren, Hang Guo, Lei Sun, Zongwei Wu, Radu Timofte, Yawei Li, Yao Zhang, Xinning Chai, Zhengxue Cheng, Yingsheng Qin, Yucai Yang, Li Song, Hongyuan Yu, Pufan Xu, Cheng Wan, Zhijuan Huang, Peng Guo, Shuyuan Cui, Chenjun Li, Xuehai Hu, Pan Pan, Xin Zhang, Heng Zhang, Qing Luo, Linyan Jiang , et al. (122 additional authors not shown)

    Abstract: This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

    Comments: Accepted by CVPR2025 NTIRE Workshop, Efficient Super-Resolution Challenge Report. 50 pages

  26. arXiv:2504.09993  [pdf, other

    cs.LG

    AimTS: Augmented Series and Image Contrastive Learning for Time Series Classification

    Authors: Yuxuan Chen, Shanshan Huang, Yunyao Cheng, Peng Chen, Zhongwen Rao, Yang Shu, Bin Yang, Lujia Pan, Chenjuan Guo

    Abstract: Time series classification (TSC) is an important task in time series analysis. Existing TSC methods mainly train on each single domain separately, suffering from a degradation in accuracy when the samples for training are insufficient in certain domains. The pre-training and fine-tuning paradigm provides a promising direction for solving this problem. However, time series from different domains ar… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

  27. arXiv:2504.08252  [pdf, other

    cs.CV

    Stereophotoclinometry Revisited

    Authors: Travis Driver, Andrew Vaughan, Yang Cheng, Adnan Ansar, John Christian, Panagiotis Tsiotras

    Abstract: Image-based surface reconstruction and characterization is crucial for missions to small celestial bodies, as it informs mission planning, navigation, and scientific analysis. However, current state-of-the-practice methods, such as stereophotoclinometry (SPC), rely heavily on human-in-the-loop verification and high-fidelity a priori information. This paper proposes Photoclinometry-from-Motion (Pho… ▽ More

    Submitted 11 April, 2025; originally announced April 2025.

    Comments: arXiv admin note: substantial text overlap with arXiv:2312.06865

  28. arXiv:2504.07406  [pdf, other

    cs.SD eess.AS

    Towards Generalizability to Tone and Content Variations in the Transcription of Amplifier Rendered Electric Guitar Audio

    Authors: Yu-Hua Chen, Yuan-Chiao Cheng, Yen-Tung Yeh, Jui-Te Wu, Jyh-Shing Roger Jang, Yi-Hsuan Yang

    Abstract: Transcribing electric guitar recordings is challenging due to the scarcity of diverse datasets and the complex tone-related variations introduced by amplifiers, cabinets, and effect pedals. To address these issues, we introduce EGDB-PG, a novel dataset designed to capture a wide range of tone-related characteristics across various amplifier-cabinet configurations. In addition, we propose the Tone-… ▽ More

    Submitted 9 April, 2025; originally announced April 2025.

  29. arXiv:2504.06780  [pdf, ps, other

    cs.IR

    CHIME: A Compressive Framework for Holistic Interest Modeling

    Authors: Yong Bai, Rui Xiang, Kaiyuan Li, Yongxiang Tang, Yanhua Cheng, Xialong Liu, Peng Jiang, Kun Gai

    Abstract: Modeling holistic user interests is important for improving recommendation systems but is challenged by high computational cost and difficulty in handling diverse information with full behavior context. Existing search-based methods might lose critical signals during behavior selection. To overcome these limitations, we propose CHIME: A Compressive Framework for Holistic Interest Modeling. It uses… ▽ More

    Submitted 9 April, 2025; originally announced April 2025.

  30. arXiv:2504.06664  [pdf, other

    cs.CL cs.LG

    SEE: Continual Fine-tuning with Sequential Ensemble of Experts

    Authors: Zhilin Wang, Yafu Li, Xiaoye Qu, Yu Cheng

    Abstract: Continual fine-tuning of large language models (LLMs) suffers from catastrophic forgetting. Rehearsal-based methods mitigate this problem by retaining a small set of old data. Nevertheless, they still suffer inevitable performance loss. Although training separate experts for each task can help prevent forgetting, effectively assembling them remains a challenge. Some approaches use routers to assig… ▽ More

    Submitted 9 April, 2025; originally announced April 2025.

    Comments: 9pages

  31. arXiv:2504.06636  [pdf, other

    cs.IR

    BBQRec: Behavior-Bind Quantization for Multi-Modal Sequential Recommendation

    Authors: Kaiyuan Li, Rui Xiang, Yong Bai, Yongxiang Tang, Yanhua Cheng, Xialong Liu, Peng Jiang, Kun Gai

    Abstract: Multi-modal sequential recommendation systems leverage auxiliary signals (e.g., text, images) to alleviate data sparsity in user-item interactions. While recent methods exploit large language models to encode modalities into discrete semantic IDs for autoregressive prediction, we identify two critical limitations: (1) Existing approaches adopt fragmented quantization, where modalities are independ… ▽ More

    Submitted 9 April, 2025; originally announced April 2025.

  32. arXiv:2504.06544  [pdf, ps, other

    cs.CV

    LCGC: Learning from Consistency Gradient Conflicting for Class-Imbalanced Semi-Supervised Debiasing

    Authors: Weiwei Xing, Yue Cheng, Hongzhu Yi, Xiaohui Gao, Xiang Wei, Xiaoyu Guo, Yuming Zhang, Xinyu Pang

    Abstract: Classifiers often learn to be biased corresponding to the class-imbalanced dataset, especially under the semi-supervised learning (SSL) set. While previous work tries to appropriately re-balance the classifiers by subtracting a class-irrelevant image's logit, but lacks a firm theoretical basis. We theoretically analyze why exploiting a baseline image can refine pseudo-labels and prove that the bla… ▽ More

    Submitted 8 April, 2025; originally announced April 2025.

    Comments: This paper has been accepted by AAAI 2025

  33. arXiv:2504.06437  [pdf, other

    cs.RO eess.SY

    DBaS-Log-MPPI: Efficient and Safe Trajectory Optimization via Barrier States

    Authors: Fanxin Wang, Haolong Jiang, Chuyuan Tao, Wenbin Wan, Yikun Cheng

    Abstract: Optimizing trajectory costs for nonlinear control systems remains a significant challenge. Model Predictive Control (MPC), particularly sampling-based approaches such as the Model Predictive Path Integral (MPPI) method, has recently demonstrated considerable success by leveraging parallel computing to efficiently evaluate numerous trajectories. However, MPPI often struggles to balance safe navigat… ▽ More

    Submitted 26 March, 2025; originally announced April 2025.

    Comments: IROS 2025

  34. arXiv:2504.06358  [pdf, other

    cs.CV

    Towards Calibration Enhanced Network by Inverse Adversarial Attack

    Authors: Yupeng Cheng, Zi Pong Lim, Sarthak Ketanbhai Modi, Yon Shin Teo, Yushi Cao, Shang-Wei Lin

    Abstract: Test automation has become increasingly important as the complexity of both design and content in Human Machine Interface (HMI) software continues to grow. Current standard practice uses Optical Character Recognition (OCR) techniques to automatically extract textual information from HMI screens for validation. At present, one of the key challenges faced during the automation of HMI screen validati… ▽ More

    Submitted 8 April, 2025; originally announced April 2025.

    Comments: 11 pages

  35. arXiv:2504.03909  [pdf, other

    cs.CR cs.DC cs.ET

    Secure Federated XGBoost with CUDA-accelerated Homomorphic Encryption via NVIDIA FLARE

    Authors: Ziyue Xu, Yuan-Ting Hsieh, Zhihong Zhang, Holger R. Roth, Chester Chen, Yan Cheng, Andrew Feng

    Abstract: Federated learning (FL) enables collaborative model training across decentralized datasets. NVIDIA FLARE's Federated XGBoost extends the popular XGBoost algorithm to both vertical and horizontal federated settings, facilitating joint model development without direct data sharing. However, the initial implementation assumed mutual trust over the sharing of intermediate gradient statistics produced… ▽ More

    Submitted 4 April, 2025; originally announced April 2025.

  36. arXiv:2504.03010  [pdf, other

    cs.CV cs.LG

    Emotion Recognition Using Convolutional Neural Networks

    Authors: Shaoyuan Xu, Yang Cheng, Qian Lin, Jan P. Allebach

    Abstract: Emotion has an important role in daily life, as it helps people better communicate with and understand each other more efficiently. Facial expressions can be classified into 7 categories: angry, disgust, fear, happy, neutral, sad and surprise. How to detect and recognize these seven emotions has become a popular topic in the past decade. In this paper, we develop an emotion recognition system that… ▽ More

    Submitted 3 April, 2025; originally announced April 2025.

  37. arXiv:2504.02921  [pdf, other

    cs.CL

    HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse

    Authors: Yuwei An, Yihua Cheng, Seo Jin Park, Junchen Jiang

    Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing the performance of large language models (LLMs) by integrating external knowledge into the generation process. A key component of RAG pipelines is the reranker, which selects the most relevant documents from a pool of retrieved candidates and significantly improves the quality of the generated responses. While re… ▽ More

    Submitted 3 April, 2025; originally announced April 2025.

  38. arXiv:2504.02263  [pdf, other

    cs.DC cs.LG

    MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

    Authors: Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A. Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, Yang Cheng, Jianzhe Xiao, Xinyi Zhang, Lingjun Liu, Haibin Lin, Li-Wen Chang, Jianxi Ye, Xiao Yu, Xuanzhe Liu, Xin Jin, Xin Liu

    Abstract: Mixture-of-Experts (MoE) showcases tremendous potential to scale large language models (LLMs) with enhanced performance and reduced computational complexity. However, its sparsely activated architecture shifts feed-forward networks (FFNs) from being compute-intensive to memory-intensive during inference, leading to substantially lower GPU utilization and increased operational costs. We present Meg… ▽ More

    Submitted 23 April, 2025; v1 submitted 3 April, 2025; originally announced April 2025.

  39. arXiv:2504.02160  [pdf, other

    cs.CV cs.LG

    Less-to-More Generalization: Unlocking More Controllability by In-Context Generation

    Authors: Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, Qian He

    Abstract: Although subject-driven generation has been extensively explored in image generation due to its wide applications, it still has challenges in data scalability and subject expansibility. For the first challenge, moving from curating single-subject datasets to multiple-subject ones and scaling them is particularly difficult. For the second, most recent methods center on single-subject generation, ma… ▽ More

    Submitted 2 April, 2025; originally announced April 2025.

    Comments: Project page: https://bytedance.github.io/UNO Code and model: https://github.com/bytedance/UNO

  40. arXiv:2504.01990  [pdf, other

    cs.AI

    Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

    Authors: Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, Yuheng Cheng, Suyuchen Wang, Xiaoqiang Wang, Yuyu Luo, Haibo Jin, Peiyan Zhang, Ollie Liu, Jiaqi Chen, Huan Zhang, Zhaoyang Yu, Haochen Shi, Boyan Li, Dekun Wu, Fengwei Teng, Xiaojun Jia , et al. (22 additional authors not shown)

    Abstract: The advent of large language models (LLMs) has catalyzed a transformative shift in artificial intelligence, paving the way for advanced intelligent agents capable of sophisticated reasoning, robust perception, and versatile action across diverse domains. As these agents increasingly drive AI research and practical applications, their design, evaluation, and continuous improvement present intricate… ▽ More

    Submitted 31 March, 2025; originally announced April 2025.

  41. arXiv:2504.01234  [pdf

    cs.MA physics.optics

    First Field-Trial Demonstration of L4 Autonomous Optical Network for Distributed AI Training Communication: An LLM-Powered Multi-AI-Agent Solution

    Authors: Yihao Zhang, Qizhi Qiu, Xiaomin Liu, Dianxuan Fu, Xingyu Liu, Leyan Fei, Yuming Cheng, Lilin Yi, Weisheng Hu, Qunbi Zhuge

    Abstract: We demonstrate the first cross-domain cross-layer level-4 autonomous optical network via a multi-AI-agent system. Field trials show 98 percent task completion rate across the distributed AI training lifecycle-3.2x higher than single agents using state-of-the-art LLMs.

    Submitted 1 April, 2025; originally announced April 2025.

    Comments: Submitted to the PDP session of the Optical Fiber Communications Conference (OFC) 2025

  42. arXiv:2503.24272  [pdf, other

    cs.CV cs.LG

    Learning Velocity and Acceleration: Self-Supervised Motion Consistency for Pedestrian Trajectory Prediction

    Authors: Yizhou Huang, Yihua Cheng, Kezhi Wang

    Abstract: Understanding human motion is crucial for accurate pedestrian trajectory prediction. Conventional methods typically rely on supervised learning, where ground-truth labels are directly optimized against predicted trajectories. This amplifies the limitations caused by long-tailed data distributions, making it difficult for the model to capture abnormal behaviors. In this work, we propose a self-supe… ▽ More

    Submitted 31 March, 2025; originally announced March 2025.

  43. arXiv:2503.24067  [pdf, other

    cs.LG

    TransMamba: Flexibly Switching between Transformer and Mamba

    Authors: Yixing Li, Ruobing Xie, Zhen Yang, Xingwu Sun, Shuaipeng Li, Weidong Han, Zhanhui Kang, Yu Cheng, Chengzhong Xu, Di Wang, Jie Jiang

    Abstract: Transformers are the cornerstone of modern large language models, but their quadratic computational complexity limits efficiency in long-sequence processing. Recent advancements in Mamba, a state space model (SSM) with linear complexity, offer promising efficiency gains but suffer from unstable contextual learning and multitask generalization. This paper proposes TransMamba, a novel framework that… ▽ More

    Submitted 31 March, 2025; originally announced March 2025.

    Comments: Preprint. Under review

  44. arXiv:2503.22757  [pdf, other

    cs.RO eess.SY nlin.AO

    Strategies for decentralised UAV-based collisions monitoring in rugby

    Authors: Yu Cheng, Harun Šiljak

    Abstract: Recent advancements in unmanned aerial vehicle (UAV) technology have opened new avenues for dynamic data collection in challenging environments, such as sports fields during fast-paced sports action. For the purposes of monitoring sport events for dangerous injuries, we envision a coordinated UAV fleet designed to capture high-quality, multi-view video footage of collision events in real-time. The… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

    Comments: Submitted for publication in an IEEE publication

  45. arXiv:2503.21614  [pdf, other

    cs.CL

    A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond

    Authors: Xiaoye Qu, Yafu Li, Zhaochen Su, Weigao Sun, Jianhao Yan, Dongrui Liu, Ganqu Cui, Daizong Liu, Shuxian Liang, Junxian He, Peng Li, Wei Wei, Jing Shao, Chaochao Lu, Yue Zhang, Xian-Sheng Hua, Bowen Zhou, Yu Cheng

    Abstract: Recent Large Reasoning Models (LRMs), such as DeepSeek-R1 and OpenAI o1, have demonstrated strong performance gains by scaling up the length of Chain-of-Thought (CoT) reasoning during inference. However, a growing concern lies in their tendency to produce excessively long reasoning traces, which are often filled with redundant content (e.g., repeated definitions), over-analysis of simple problems,… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

    Comments: Survey, 32 pages, Large Reasoning Models, Efficient Reasoning for Language, Multimodality, and Beyond

  46. arXiv:2503.21401  [pdf, other

    cs.RO cs.LG eess.SY

    AcL: Action Learner for Fault-Tolerant Quadruped Locomotion Control

    Authors: Tianyu Xu, Yaoyu Cheng, Pinxi Shen, Lin Zhao

    Abstract: Quadrupedal robots can learn versatile locomotion skills but remain vulnerable when one or more joints lose power. In contrast, dogs and cats can adopt limping gaits when injured, demonstrating their remarkable ability to adapt to physical conditions. Inspired by such adaptability, this paper presents Action Learner (AcL), a novel teacher-student reinforcement learning framework that enables quadr… ▽ More

    Submitted 28 March, 2025; v1 submitted 27 March, 2025; originally announced March 2025.

  47. arXiv:2503.20591   

    cs.DC

    NotebookOS: A Notebook Operating System for Interactive Training with On-Demand GPUs

    Authors: Benjamin Carver, Jingyuan Zhang, Haoliang Wang, Kanak Mahadik, Yue Cheng

    Abstract: Interactive notebook programming is universal in modern ML (machine learning) and AI (artificial intelligence) workflows. Notebook software like Jupyter and Google Colab provides a user-friendly, interactive, web-based programming interface and is widely used across science and engineering domains. A dominant application of production notebook workloads is interactive deep learning training (IDLT)… ▽ More

    Submitted 26 March, 2025; originally announced March 2025.

    Comments: arXiv admin note: This version has been removed by arXiv administrators as the submitter did not have the rights to agree to the license at the time of submission

    ACM Class: C.2.4

  48. arXiv:2503.20265  [pdf, other

    cs.SE

    Fixseeker: An Empirical Driven Graph-based Approach for Detecting Silent Vulnerability Fixes in Open Source Software

    Authors: Yiran Cheng, Ting Zhang, Lwin Khin Shar, Zhe Lang, David Lo, Shichao Lv, Dongliang Fang, Zhiqiang Shi, Limin Sun

    Abstract: Open source software vulnerabilities pose significant security risks to downstream applications. While vulnerability databases provide valuable information for mitigation, many security patches are released silently in new commits of OSS repositories without explicit indications of their security impact. This makes it challenging for software maintainers and users to detect and address these vulne… ▽ More

    Submitted 26 March, 2025; originally announced March 2025.

  49. arXiv:2503.19839  [pdf, other

    cs.CV

    FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model

    Authors: Jun Zhou, Jiahao Li, Zunnan Xu, Hanhui Li, Yiji Cheng, Fa-Ting Hong, Qin Lin, Qinglin Lu, Xiaodan Liang

    Abstract: Currently, instruction-based image editing methods have made significant progress by leveraging the powerful cross-modal understanding capabilities of vision language models (VLMs). However, they still face challenges in three key areas: 1) complex scenarios; 2) semantic consistency; and 3) fine-grained editing. To address these issues, we propose FireEdit, an innovative Fine-grained Instruction-b… ▽ More

    Submitted 29 March, 2025; v1 submitted 25 March, 2025; originally announced March 2025.

    Comments: Accepted to CVPR 2025

  50. arXiv:2503.19404  [pdf, other

    cs.CV

    LangBridge: Interpreting Image as a Combination of Language Embeddings

    Authors: Jiaqi Liao, Yuwei Niu, Fanqing Meng, Hao Li, Changyao Tian, Yinuo Du, Yuwen Xiong, Dianqi Li, Xizhou Zhu, Li Yuan, Jifeng Dai, Yu Cheng

    Abstract: Recent years have witnessed remarkable advances in Large Vision-Language Models (LVLMs), which have achieved human-level performance across various complex vision-language tasks. Following LLaVA's paradigm, mainstream LVLMs typically employ a shallow MLP for visual-language alignment through a two-stage training process: pretraining for cross-modal alignment followed by instruction tuning. While t… ▽ More

    Submitted 25 March, 2025; v1 submitted 25 March, 2025; originally announced March 2025.

    Comments: The code and weights will be open-sourced. Project page: https://jiaqiliao77.github.io/LangBridge.github.io/