Skip to main content

Showing 1–50 of 1,940 results for author: chen, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.15160  [pdf, ps, other

    cs.CV

    Enhancing point cloud analysis via neighbor aggregation correction based on cross-stage structure correlation

    Authors: Jiaqi Shi, Jin Xiao, Xiaoguang Hu, Boyang Song, Hao Jiang, Tianyou Chen, Baochang Zhang

    Abstract: Point cloud analysis is the cornerstone of many downstream tasks, among which aggregating local structures is the basis for understanding point cloud data. While numerous works aggregate neighbor using three-dimensional relative coordinates, there are irrelevant point interference and feature hierarchy gap problems due to the limitation of local coordinates. Although some works address this limita… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

    Comments: 17 papes, 7 figures

  2. arXiv:2506.14323  [pdf, ps, other

    cs.CR cs.NI

    Vulnerability Disclosure or Notification? Best Practices for Reaching Stakeholders at Scale

    Authors: Ting-Han Chen, Jeroen van der Ham-de Vos

    Abstract: Security researchers are interested in security vulnerabilities, but these security vulnerabilities create risks for stakeholders. Coordinated Vulnerability Disclosure has been an accepted best practice for many years in disclosing newly discovered vulnerabilities. This practice has mostly worked, but it can become challenging when there are many different parties involved. There has also been r… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

    Comments: 18 pages, 1 figure

  3. arXiv:2506.14229  [pdf, ps, other

    cs.CV cs.AI

    HRGS: Hierarchical Gaussian Splatting for Memory-Efficient High-Resolution 3D Reconstruction

    Authors: Changbai Li, Haodong Zhu, Hanlin Chen, Juan Zhang, Tongfei Chen, Shuo Yang, Shuwei Shao, Wenhao Dong, Baochang Zhang

    Abstract: 3D Gaussian Splatting (3DGS) has made significant strides in real-time 3D scene reconstruction, but faces memory scalability issues in high-resolution scenarios. To address this, we propose Hierarchical Gaussian Splatting (HRGS), a memory-efficient framework with hierarchical block-level optimization. First, we generate a global, coarse Gaussian representation from low-resolution data. Then, we pa… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

  4. arXiv:2506.14009  [pdf, ps, other

    cs.RO

    GRaD-Nav++: Vision-Language Model Enabled Visual Drone Navigation with Gaussian Radiance Fields and Differentiable Dynamics

    Authors: Qianzhong Chen, Naixiang Gao, Suning Huang, JunEn Low, Timothy Chen, Jiankai Sun, Mac Schwager

    Abstract: Autonomous drones capable of interpreting and executing high-level language instructions in unstructured environments remain a long-standing goal. Yet existing approaches are constrained by their dependence on hand-crafted skills, extensive parameter tuning, or computationally intensive models unsuitable for onboard use. We introduce GRaD-Nav++, a lightweight Vision-Language-Action (VLA) framework… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  5. arXiv:2506.12210  [pdf, ps, other

    cs.ET cs.LG

    Machine Intelligence on Wireless Edge Networks

    Authors: Sri Krishna Vadlamani, Kfir Sulimany, Zhihui Gao, Tingjun Chen, Dirk Englund

    Abstract: Deep neural network (DNN) inference on power-constrained edge devices is bottlenecked by costly weight storage and data movement. We introduce MIWEN, a radio-frequency (RF) analog architecture that ``disaggregates'' memory by streaming weights wirelessly and performing classification in the analog front end of standard transceivers. By encoding weights and activations onto RF carriers and using na… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

    Comments: 13 pages, 6 figures

  6. arXiv:2506.11170  [pdf, ps, other

    cs.LG cs.AI

    PromptTSS: A Prompting-Based Approach for Interactive Multi-Granularity Time Series Segmentation

    Authors: Ching Chang, Ming-Chih Lo, Wen-Chih Peng, Tien-Fu Chen

    Abstract: Multivariate time series data, collected across various fields such as manufacturing and wearable technology, exhibit states at multiple levels of granularity, from coarse-grained system behaviors to fine-grained, detailed events. Effectively segmenting and integrating states across these different granularities is crucial for tasks like predictive maintenance and performance optimization. However… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

    Comments: This paper is currently under review. The code will be made available upon acceptance

  7. arXiv:2506.10412  [pdf, ps, other

    cs.LG cs.AI cs.CL

    Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series

    Authors: Ching Chang, Jeehyun Hwang, Yidan Shi, Haixin Wang, Wen-Chih Peng, Tien-Fu Chen, Wei Wang

    Abstract: Time series data in real-world applications such as healthcare, climate modeling, and finance are often irregular, multimodal, and messy, with varying sampling rates, asynchronous modalities, and pervasive missingness. However, existing benchmarks typically assume clean, regularly sampled, unimodal data, creating a significant gap between research and real-world deployment. We introduce Time-IMM,… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

    Comments: This paper is currently under review

  8. arXiv:2506.10389  [pdf, other

    cs.LG

    EQA-RM: A Generative Embodied Reward Model with Test-time Scaling

    Authors: Yuhang Chen, Zhen Tan, Tianlong Chen

    Abstract: Reward Models (RMs), vital for large model alignment, are underexplored for complex embodied tasks like Embodied Question Answering (EQA) where nuanced evaluation of agents' spatial, temporal, and logical understanding is critical yet not considered by generic approaches. We introduce EQA-RM, a novel generative multimodal reward model specifically architected for EQA, trained via our innovative Co… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

    Comments: preprint

  9. arXiv:2506.10030  [pdf, ps, other

    cs.CR cs.AI

    Safeguarding Multimodal Knowledge Copyright in the RAG-as-a-Service Environment

    Authors: Tianyu Chen, Jian Lou, Wenjie Wang

    Abstract: As Retrieval-Augmented Generation (RAG) evolves into service-oriented platforms (Rag-as-a-Service) with shared knowledge bases, protecting the copyright of contributed data becomes essential. Existing watermarking methods in RAG focus solely on textual knowledge, leaving image knowledge unprotected. In this work, we propose AQUA, the first watermark framework for image knowledge protection in Mult… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

  10. arXiv:2506.09991  [pdf, ps, other

    cs.LG

    Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation

    Authors: Xinyu Yang, Yuwei An, Hongyi Liu, Tianqi Chen, Beidi Chen

    Abstract: Autoregressive Large Language Models (AR-LLMs) frequently exhibit implicit parallelism in sequential generation. Inspired by this, we introduce Multiverse, a new generative model that enables natively parallel generation. Multiverse internalizes a MapReduce paradigm, generating automatically through three stages: (i) a Map stage for adaptive task decomposition, (ii) a Process stage for parallel su… ▽ More

    Submitted 13 June, 2025; v1 submitted 11 June, 2025; originally announced June 2025.

  11. arXiv:2506.09043  [pdf

    cs.PL

    Gradual Metaprogramming

    Authors: Tianyu Chen, Darshal Shetty, Jeremy G. Siek, Chao-Hong Chen, Weixi Ma, Arnaud Venet, Rocky Liu

    Abstract: Data engineers increasingly use domain-specific languages (DSLs) to generate the code for data pipelines. Such DSLs are often embedded in Python. Unfortunately, there are challenges in debugging the generation of data pipelines: an error in a Python DSL script is often detected too late, after the execution of the script, and the source code location that triggers the error is hard to pinpoint.… ▽ More

    Submitted 16 June, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

    Comments: 13 pages, 10 figures

    ACM Class: D.3

  12. arXiv:2506.09018  [pdf, ps, other

    cs.LG cs.AI

    Edit Flows: Flow Matching with Edit Operations

    Authors: Marton Havasi, Brian Karrer, Itai Gat, Ricky T. Q. Chen

    Abstract: Autoregressive generative models naturally generate variable-length sequences, while non-autoregressive models struggle, often imposing rigid, token-wise structures. We propose Edit Flows, a non-autoregressive model that overcomes these limitations by defining a discrete flow over sequences through edit operations-insertions, deletions, and substitutions. By modeling these operations within a Cont… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

  13. arXiv:2506.08418  [pdf, ps, other

    cs.CV eess.SP

    RadioDUN: A Physics-Inspired Deep Unfolding Network for Radio Map Estimation

    Authors: Taiqin Chen, Zikun Zhou, Zheng Fang, Wenzhen Zou, Kanjun Liu, Ke Chen, Yongbing Zhang, Yaowei Wang

    Abstract: The radio map represents the spatial distribution of spectrum resources within a region, supporting efficient resource allocation and interference mitigation. However, it is difficult to construct a dense radio map as a limited number of samples can be measured in practical scenarios. While existing works have used deep learning to estimate dense radio maps from sparse samples, they are hard to in… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  14. arXiv:2506.08417  [pdf, other

    cs.LG cs.AI

    Offline RL with Smooth OOD Generalization in Convex Hull and its Neighborhood

    Authors: Qingmao Yao, Zhichao Lei, Tianyuan Chen, Ziyue Yuan, Xuefan Chen, Jianxiang Liu, Faguo Wu, Xiao Zhang

    Abstract: Offline Reinforcement Learning (RL) struggles with distributional shifts, leading to the $Q$-value overestimation for out-of-distribution (OOD) actions. Existing methods address this issue by imposing constraints; however, they often become overly conservative when evaluating OOD regions, which constrains the $Q$-function generalization. This over-constraint issue results in poor $Q$-value estimat… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

    Comments: ICLR 2025

  15. arXiv:2506.07811  [pdf, ps, other

    cs.CV

    Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning

    Authors: Tieyuan Chen, Huabin Liu, Yi Wang, Chaofan Gan, Mingxi Lyu, Gui Zou, Weiyao Lin

    Abstract: Video Question Answering (VideoQA) aims to answer natural language questions based on the given video, with prior work primarily focusing on identifying the duration of relevant segments, referred to as explicit visual evidence. However, explicit visual evidence is not always directly available, particularly when questions target symbolic meanings or deeper intentions, leading to significant perfo… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

    Comments: Preprint

  16. arXiv:2506.07581  [pdf, ps, other

    cs.LG cs.AI cs.DC

    FedCGD: Collective Gradient Divergence Optimized Scheduling for Wireless Federated Learning

    Authors: Tan Chen, Jintao Yan, Yuxuan Sun, Sheng Zhou, Zhisheng Niu

    Abstract: Federated learning (FL) is a promising paradigm for multiple devices to cooperatively train a model. When applied in wireless networks, two issues consistently affect the performance of FL, i.e., data heterogeneity of devices and limited bandwidth. Many papers have investigated device scheduling strategies considering the two issues. However, most of them recognize data heterogeneity as a property… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  17. arXiv:2506.07328  [pdf, ps, other

    cs.LG

    Mobility-Aware Asynchronous Federated Learning with Dynamic Sparsification

    Authors: Jintao Yan, Tan Chen, Yuxuan Sun, Zhaojun Nan, Sheng Zhou, Zhisheng Niu

    Abstract: Asynchronous Federated Learning (AFL) enables distributed model training across multiple mobile devices, allowing each device to independently update its local model without waiting for others. However, device mobility introduces intermittent connectivity, which necessitates gradient sparsification and leads to model staleness, jointly affecting AFL convergence. This paper develops a theoretical m… ▽ More

    Submitted 8 June, 2025; originally announced June 2025.

  18. arXiv:2506.06276  [pdf, ps, other

    cs.CV cs.AI cs.LG

    STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis

    Authors: Jiatao Gu, Tianrong Chen, David Berthelot, Huangjie Zheng, Yuyang Wang, Ruixiang Zhang, Laurent Dinh, Miguel Angel Bautista, Josh Susskind, Shuangfei Zhai

    Abstract: We present STARFlow, a scalable generative model based on normalizing flows that achieves strong performance in high-resolution image synthesis. The core of STARFlow is Transformer Autoregressive Flow (TARFlow), which combines the expressive power of normalizing flows with the structured modeling capabilities of Autoregressive Transformers. We first establish the theoretical universality of TARFlo… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

    Comments: TLDR: We show for the first time that normalizing flows can be scaled for high-resolution and text-conditioned image synthesis

  19. arXiv:2506.05316  [pdf, ps, other

    cs.LG cs.AI cs.CL

    Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay

    Authors: Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang, Mingyuan Zhou, Huan Zhang

    Abstract: Reinforcement learning (RL) has become an effective approach for fine-tuning large language models (LLMs), particularly to enhance their reasoning capabilities. However, RL fine-tuning remains highly resource-intensive, and existing work has largely overlooked the problem of data efficiency. In this paper, we propose two techniques to improve data efficiency in LLM RL fine-tuning: difficulty-targe… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

  20. arXiv:2506.05302  [pdf, ps, other

    cs.CV

    Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos

    Authors: Weifeng Lin, Xinyu Wei, Ruichuan An, Tianhe Ren, Tingwei Chen, Renrui Zhang, Ziyu Guo, Wentao Zhang, Lei Zhang, Hongsheng Li

    Abstract: We present Perceive Anything Model (PAM), a conceptually straightforward and efficient framework for comprehensive region-level visual understanding in images and videos. Our approach extends the powerful segmentation model SAM 2 by integrating Large Language Models (LLMs), enabling simultaneous object segmentation with the generation of diverse, region-specific semantic outputs, including categor… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

    Comments: 19 pages, 13 figures, Website: https://Perceive-Anything.github.io

  21. arXiv:2506.05216  [pdf, ps, other

    cs.LG cs.DS quant-ph

    A Unified Framework for Provably Efficient Algorithms to Estimate Shapley Values

    Authors: Tyler Chen, Akshay Seshadri, Mattia J. Villani, Pradeep Niroula, Shouvanik Chakrabarti, Archan Ray, Pranav Deshpande, Romina Yalovetzky, Marco Pistoia, Niraj Kumar

    Abstract: Shapley values have emerged as a critical tool for explaining which features impact the decisions made by machine learning models. However, computing exact Shapley values is difficult, generally requiring an exponential (in the feature dimension) number of model evaluations. To address this, many model-agnostic randomized estimators have been developed, the most influential and widely used being t… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

    Comments: 44 pages, 7 figures, 7 tables

  22. arXiv:2506.03085  [pdf, ps, other

    cs.LG

    Non-Asymptotic Length Generalization

    Authors: Thomas Chen, Tengyu Ma, Zhiyuan Li

    Abstract: Length generalization is the ability of a learning algorithm to learn a hypothesis which generalizes to longer inputs than the inputs in the training set. In this paper, we provide provable guarantees of length generalization for various classes of functions in an idealized setting. First, we formalize the framework of non-asymptotic length generalization, which requires a computable upper bound f… ▽ More

    Submitted 6 June, 2025; v1 submitted 3 June, 2025; originally announced June 2025.

  23. arXiv:2506.03070  [pdf, ps, other

    cs.DS cs.DC math.NA

    GPU-Parallelizable Randomized Sketch-and-Precondition for Linear Regression using Sparse Sign Sketches

    Authors: Tyler Chen, Pradeep Niroula, Archan Ray, Pragna Subrahmanya, Marco Pistoia, Niraj Kumar

    Abstract: A litany of theoretical and numerical results have established the sketch-and-precondition paradigm as a powerful approach to solving large linear regression problems in standard computing environments. Perhaps surprisingly, much less work has been done on understanding how sketch-and-precondition performs on graphics processing unit (GPU) systems. We address this gap by benchmarking an implementa… ▽ More

    Submitted 6 June, 2025; v1 submitted 3 June, 2025; originally announced June 2025.

  24. arXiv:2506.03065  [pdf, ps, other

    cs.CV cs.AI cs.LG

    Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers

    Authors: Pengtao Chen, Xianfang Zeng, Maosen Zhao, Peng Ye, Mingzhu Shen, Wei Cheng, Gang Yu, Tao Chen

    Abstract: While Diffusion Transformers (DiTs) have achieved breakthroughs in video generation, this long sequence generation task remains constrained by the quadratic complexity of attention mechanisms, resulting in significant inference latency. Through detailed analysis of attention maps in Video Diffusion Transformer (vDiT), we identify three recurring sparsity patterns: diagonal, multi-diagonal, and ver… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

  25. arXiv:2506.02875  [pdf, ps, other

    cs.CV

    NTIRE 2025 XGC Quality Assessment Challenge: Methods and Results

    Authors: Xiaohong Liu, Xiongkuo Min, Qiang Hu, Xiaoyun Zhang, Jie Guo, Guangtao Zhai, Shushi Wang, Yingjie Zhou, Lu Liu, Jingxin Li, Liu Yang, Farong Wen, Li Xu, Yanwei Jiang, Xilei Zhu, Chunyi Li, Zicheng Zhang, Huiyu Duan, Xiele Wu, Yixuan Gao, Yuqin Cao, Jun Jia, Wei Sun, Jiezhang Cao, Radu Timofte , et al. (70 additional authors not shown)

    Abstract: This paper reports on the NTIRE 2025 XGC Quality Assessment Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. This challenge is to address a major challenge in the field of video and talking head processing. The challenge is divided into three tracks, including user generated video, AI generated video and talking he… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: NTIRE 2025 XGC Quality Assessment Challenge Report. arXiv admin note: text overlap with arXiv:2404.16687

  26. arXiv:2506.02210  [pdf, ps, other

    cs.LG cs.AI cs.PF

    Exchangeability in Neural Network Architectures and its Application to Dynamic Pruning

    Authors: Pu, Yi, Tianlang Chen, Yifan Yang, Sara Achour

    Abstract: Neural networks (NNs) are equipped with increasingly many parameters and require more and more resource for deployment. Researchers have explored various ways to improve the efficiency of NNs by identifying and reducing the redundancy, such as pruning or quantizing unimportant weights. Symmetry in the NN architectures has been identified by prior work as a possible type of redundancy, but exploiti… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

  27. arXiv:2506.02040  [pdf, ps, other

    cs.CR cs.SE

    Beyond the Protocol: Unveiling Attack Vectors in the Model Context Protocol Ecosystem

    Authors: Hao Song, Yiming Shen, Wenxuan Luo, Leixin Guo, Ting Chen, Jiashui Wang, Beibei Li, Xiaosong Zhang, Jiachi Chen

    Abstract: The Model Context Protocol (MCP) is an emerging standard designed to enable seamless interaction between Large Language Model (LLM) applications and external tools or resources. Within a short period, thousands of MCP services have already been developed and deployed. However, the client-server integration architecture inherent in MCP may expand the attack surface against LLM Agent systems, introd… ▽ More

    Submitted 5 June, 2025; v1 submitted 31 May, 2025; originally announced June 2025.

  28. arXiv:2506.01918  [pdf, ps, other

    cs.CL

    Spatial Coordinates as a Cell Language: A Multi-Sentence Framework for Imaging Mass Cytometry Analysis

    Authors: Chi-Jane Chen, Yuhang Chen, Sukwon Yun, Natalie Stanley, Tianlong Chen

    Abstract: Image mass cytometry (IMC) enables high-dimensional spatial profiling by combining mass cytometry's analytical power with spatial distributions of cell phenotypes. Recent studies leverage large language models (LLMs) to extract cell states by translating gene or protein expression into biological context. However, existing single-cell LLMs face two major challenges: (1) Integration of spatial info… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

  29. arXiv:2506.01710  [pdf, other

    cs.CL

    Reasoning-Table: Exploring Reinforcement Learning for Table Reasoning

    Authors: Fangyu Lei, Jinxiang Meng, Yiming Huang, Tinghong Chen, Yun Zhang, Shizhu He, Jun Zhao, Kang Liu

    Abstract: Table reasoning, encompassing tasks such as table question answering, fact verification, and text-to-SQL, requires precise understanding of structured tabular data, coupled with numerical computation and code manipulation for effective inference. Supervised fine-tuning (SFT) approaches have achieved notable success but often struggle with generalization and robustness due to biases inherent in imi… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: Work in progress

  30. arXiv:2506.00290  [pdf, ps, other

    cs.CL cs.LG stat.ML

    DLM-One: Diffusion Language Models for One-Step Sequence Generation

    Authors: Tianqi Chen, Shujian Zhang, Mingyuan Zhou

    Abstract: This paper introduces DLM-One, a score-distillation-based framework for one-step sequence generation with continuous diffusion language models (DLMs). DLM-One eliminates the need for iterative refinement by aligning the scores of a student model's outputs in the continuous token embedding space with the score function of a pretrained teacher DLM. We investigate whether DLM-One can achieve substant… ▽ More

    Submitted 30 May, 2025; originally announced June 2025.

  31. arXiv:2506.00273  [pdf, other

    eess.AS cs.LG cs.SD

    SoundSculpt: Direction and Semantics Driven Ambisonic Target Sound Extraction

    Authors: Tuochao Chen, D Shin, Hakan Erdogan, Sinan Hersek

    Abstract: This paper introduces SoundSculpt, a neural network designed to extract target sound fields from ambisonic recordings. SoundSculpt employs an ambisonic-in-ambisonic-out architecture and is conditioned on both spatial information (e.g., target direction obtained by pointing at an immersive video) and semantic embeddings (e.g., derived from image segmentation and captioning). Trained and evaluated o… ▽ More

    Submitted 30 May, 2025; originally announced June 2025.

  32. arXiv:2505.23024  [pdf, ps, other

    cs.LG

    An Empirical Study of Federated Prompt Learning for Vision Language Model

    Authors: Zhihao Wang, Wenke Huang, Tian Chen, Zekun Shi, Guancheng Wan, Yu Qiao, Bin Yang, Jian Wang, Bing Li, Mang Ye

    Abstract: The Vision Language Model (VLM) excels in aligning vision and language representations, and prompt learning has emerged as a key technique for adapting such models to downstream tasks. However, the application of prompt learning with VLM in federated learning (\fl{}) scenarios remains underexplored. This paper systematically investigates the behavioral differences between language prompt learning… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  33. arXiv:2505.23013  [pdf, other

    cs.LG

    Scalable Complexity Control Facilitates Reasoning Ability of LLMs

    Authors: Liangkai Hang, Junjie Yao, Zhiwei Bai, Tianyi Chen, Yang Chen, Rongjie Diao, Hezhou Li, Pengxiao Lin, Zhiwei Wang, Cheng Xu, Zhongwang Zhang, Zhangchen Zhou, Zhiyu Li, Zehao Lin, Kai Chen, Feiyu Xiong, Yaoyu Zhang, Weinan E, Hongkang Yang, Zhi-Qin John Xu

    Abstract: The reasoning ability of large language models (LLMs) has been rapidly advancing in recent years, attracting interest in more fundamental approaches that can reliably enhance their generalizability. This work demonstrates that model complexity control, conveniently implementable by adjusting the initialization rate and weight decay coefficient, improves the scaling law of LLMs consistently over va… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  34. arXiv:2505.22490  [pdf, ps, other

    cs.CV

    ProCrop: Learning Aesthetic Image Cropping from Professional Compositions

    Authors: Ke Zhang, Tianyu Ding, Jiachen Jiang, Tianyi Chen, Ilya Zharkov, Vishal M. Patel, Luming Liang

    Abstract: Image cropping is crucial for enhancing the visual appeal and narrative impact of photographs, yet existing rule-based and data-driven approaches often lack diversity or require annotated training data. We introduce ProCrop, a retrieval-based method that leverages professional photography to guide cropping decisions. By fusing features from professional photographs with those of the query image, P… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

    Comments: 16 pages, 15 figures

  35. arXiv:2505.22101  [pdf, other

    cs.CL

    MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models

    Authors: Zhiyu Li, Shichao Song, Hanyu Wang, Simin Niu, Ding Chen, Jiawei Yang, Chenyang Xi, Huayi Lai, Jihao Zhao, Yezhaohui Wang, Junpeng Ren, Zehao Lin, Jiahao Huo, Tianyi Chen, Kai Chen, Kehang Li, Zhiqiang Yin, Qingchen Yu, Bo Tang, Hongkang Yang, Zhi-Qin John Xu, Feiyu Xiong

    Abstract: Large Language Models (LLMs) have emerged as foundational infrastructure in the pursuit of Artificial General Intelligence (AGI). Despite their remarkable capabilities in language perception and generation, current LLMs fundamentally lack a unified and structured architecture for handling memory. They primarily rely on parametric memory (knowledge encoded in model weights) and ephemeral activation… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  36. arXiv:2505.21996  [pdf, ps, other

    cs.CV cs.AI

    Learning World Models for Interactive Video Generation

    Authors: Taiye Chen, Xun Hu, Zihan Ding, Chi Jin

    Abstract: Foundational world models must be both interactive and preserve spatiotemporal coherence for effective future planning with action choices. However, present models for long video generation have limited inherent world modeling capabilities due to two main challenges: compounding errors and insufficient memory mechanisms. We enhance image-to-video models with interactive capabilities through additi… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  37. arXiv:2505.21591  [pdf, other

    cs.LG cs.AI

    Pioneering 4-Bit FP Quantization for Diffusion Models: Mixup-Sign Quantization and Timestep-Aware Fine-Tuning

    Authors: Maosen Zhao, Pengtao Chen, Chong Yu, Yan Wen, Xudong Tan, Tao Chen

    Abstract: Model quantization reduces the bit-width of weights and activations, improving memory efficiency and inference speed in diffusion models. However, achieving 4-bit quantization remains challenging. Existing methods, primarily based on integer quantization and post-training quantization fine-tuning, struggle with inconsistent performance. Inspired by the success of floating-point (FP) quantization i… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  38. arXiv:2505.21327  [pdf, other

    cs.AI cs.CV

    MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs

    Authors: Jiakang Yuan, Tianshuo Peng, Yilei Jiang, Yiting Lu, Renrui Zhang, Kaituo Feng, Chaoyou Fu, Tao Chen, Lei Bai, Bo Zhang, Xiangyu Yue

    Abstract: Logical reasoning is a fundamental aspect of human intelligence and an essential capability for multimodal large language models (MLLMs). Despite the significant advancement in multimodal reasoning, existing benchmarks fail to comprehensively evaluate their reasoning abilities due to the lack of explicit categorization for logical reasoning types and an unclear understanding of reasoning. To addre… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  39. arXiv:2505.21200  [pdf, ps, other

    cs.CV

    Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models

    Authors: Xudong Tan, Yaoxin Yang, Peng Ye, Jialin Zheng, Bizhe Bai, Xinyi Wang, Jia Hao, Tao Chen

    Abstract: Vision-Language-Action (VLA) models have emerged as a powerful paradigm for general-purpose robot control through natural language instructions. However, their high inference cost-stemming from large-scale token computation and autoregressive decoding-poses significant challenges for real-time deployment and edge applications. While prior work has primarily focused on architectural optimization, w… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  40. arXiv:2505.20914  [pdf, other

    cs.CV

    Geometry-Editable and Appearance-Preserving Object Compositon

    Authors: Jianman Lin, Haojie Li, Chunmei Qing, Zhijing Yang, Liang Lin, Tianshui Chen

    Abstract: General object composition (GOC) aims to seamlessly integrate a target object into a background scene with desired geometric properties, while simultaneously preserving its fine-grained appearance details. Recent approaches derive semantic embeddings and integrate them into advanced diffusion models to enable geometry-editable generation. However, these highly compact embeddings encode only high-l… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  41. arXiv:2505.20687  [pdf, other

    cs.CV

    VisAlgae 2023: A Dataset and Challenge for Algae Detection in Microscopy Images

    Authors: Mingxuan Sun, Juntao Jiang, Zhiqiang Yang, Shenao Kong, Jiamin Qi, Jianru Shang, Shuangling Luo, Wanfa Sun, Tianyi Wang, Yanqi Wang, Qixuan Wang, Tingjian Dai, Tianxiang Chen, Jinming Zhang, Xuerui Zhang, Yuepeng He, Pengcheng Fu, Qiu Guan, Shizheng Zhou, Yanbo Yu, Qigui Jiang, Teng Zhou, Liuyong Shi, Hong Yan

    Abstract: Microalgae, vital for ecological balance and economic sectors, present challenges in detection due to their diverse sizes and conditions. This paper summarizes the second "Vision Meets Algae" (VisAlgae 2023) Challenge, aiming to enhance high-throughput microalgae cell detection. The challenge, which attracted 369 participating teams, includes a dataset of 1000 images across six classes, featuring… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  42. How Do Experts Make Sense of Integrated Process Models?

    Authors: Tianwa Chen, Barbara Weber, Graeme Shanks, Gianluca Demartini, Marta Indulska, Shazia Sadiq

    Abstract: A range of integrated modeling approaches have been developed to enable a holistic representation of business process logic together with all relevant business rules. These approaches address inherent problems with separate documentation of business process models and business rules. In this study, we explore how expert process workers make sense of the information provided through such integrated… ▽ More

    Submitted 28 May, 2025; v1 submitted 26 May, 2025; originally announced May 2025.

  43. arXiv:2505.20640  [pdf, ps, other

    cs.CV

    IndustryEQA: Pushing the Frontiers of Embodied Question Answering in Industrial Scenarios

    Authors: Yifan Li, Yuhang Chen, Anh Dao, Lichi Li, Zhongyi Cai, Zhen Tan, Tianlong Chen, Yu Kong

    Abstract: Existing Embodied Question Answering (EQA) benchmarks primarily focus on household environments, often overlooking safety-critical aspects and reasoning processes pertinent to industrial settings. This drawback limits the evaluation of agent readiness for real-world industrial applications. To bridge this, we introduce IndustryEQA, the first benchmark dedicated to evaluating embodied agent capabil… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

    Comments: v1.0

  44. arXiv:2505.20299  [pdf, ps, other

    physics.optics cs.AI

    MetamatBench: Integrating Heterogeneous Data, Computational Tools, and Visual Interface for Metamaterial Discovery

    Authors: Jianpeng Chen, Wangzhi Zhan, Haohui Wang, Zian Jia, Jingru Gan, Junkai Zhang, Jingyuan Qi, Tingwei Chen, Lifu Huang, Muhao Chen, Ling Li, Wei Wang, Dawei Zhou

    Abstract: Metamaterials, engineered materials with architected structures across multiple length scales, offer unprecedented and tunable mechanical properties that surpass those of conventional materials. However, leveraging advanced machine learning (ML) for metamaterial discovery is hindered by three fundamental challenges: (C1) Data Heterogeneity Challenge arises from heterogeneous data sources, heteroge… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

    Comments: 15 pages

    ACM Class: I.2.0; H.5; J.2; E.0

  45. arXiv:2505.20279  [pdf, ps, other

    cs.CV cs.CL

    VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

    Authors: Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Tianlong Chen, Jiachen Li, Zhengzhong Tu, Zhangyang Wang, Rakesh Ranjan

    Abstract: The rapid advancement of Large Multimodal Models (LMMs) for 2D images and videos has motivated extending these models to understand 3D scenes, aiming for human-like visual-spatial intelligence. Nevertheless, achieving deep spatial understanding comparable to human capabilities poses significant challenges in model encoding and data acquisition. Existing methods frequently depend on external depth… ▽ More

    Submitted 1 June, 2025; v1 submitted 26 May, 2025; originally announced May 2025.

    Comments: Project Page: https://vlm-3r.github.io/

  46. arXiv:2505.19504  [pdf, other

    cs.LG cs.AI cs.CL

    DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation

    Authors: Pingzhi Li, Zhen Tan, Huaizhi Qu, Huan Liu, Tianlong Chen

    Abstract: Large Language Models (LLMs) represent substantial intellectual and economic investments, yet their effectiveness can inadvertently facilitate model imitation via knowledge distillation (KD).In practical scenarios, competitors can distill proprietary LLM capabilities by simply observing publicly accessible outputs, akin to reverse-engineering a complex performance by observation alone. Existing pr… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

    Comments: Code is available at https://github.com/UNITES-Lab/DOGe

  47. arXiv:2505.19502  [pdf, ps, other

    cs.SE cs.AI

    CODE-DITING: A Reasoning-Based Metric for Functional Alignment in Code Evaluation

    Authors: Guang Yang, Yu Zhou, Xiang Chen, Wei Zheng, Xing Hu, Xin Zhou, David Lo, Taolue Chen

    Abstract: Trustworthy evaluation methods for code snippets play a crucial role in neural code generation. Traditional methods, which either rely on reference solutions or require executable test cases, have inherent limitation in flexibility and scalability. The recent LLM-as-Judge methodology offers a promising alternative by directly evaluating functional consistency between the problem description and th… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  48. arXiv:2505.19427  [pdf, ps, other

    cs.LG cs.AI

    WINA: Weight Informed Neuron Activation for Accelerating Large Language Model Inference

    Authors: Sihan Chen, Dan Zhao, Jongwoo Ko, Colby Banbury, Huiping Zhuang, Luming Liang, Tianyi Chen

    Abstract: The growing computational demands of large language models (LLMs) make efficient inference and activation strategies increasingly critical. While recent approaches, such as Mixture-of-Experts (MoE), leverage selective activation but require specialized training, training-free sparse activation methods offer broader applicability and superior resource efficiency through their plug-and-play design.… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

  49. arXiv:2505.19190  [pdf, other

    cs.LG cs.AI cs.CV

    I2MoE: Interpretable Multimodal Interaction-aware Mixture-of-Experts

    Authors: Jiayi Xin, Sukwon Yun, Jie Peng, Inyoung Choi, Jenna L. Ballard, Tianlong Chen, Qi Long

    Abstract: Modality fusion is a cornerstone of multimodal learning, enabling information integration from diverse data sources. However, vanilla fusion methods are limited by (1) inability to account for heterogeneous interactions between modalities and (2) lack of interpretability in uncovering the multimodal interactions inherent in the data. To this end, we propose I2MoE (Interpretable Multimodal Interact… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

    Comments: ICML 2025 Poster

  50. arXiv:2505.18999  [pdf, ps, other

    cs.IR

    Lightweight Embeddings with Graph Rewiring for Collaborative Filtering

    Authors: Xurong Liang, Tong Chen, Wei Yuan, Hongzhi Yin

    Abstract: As recommendation services scale rapidly and their deployment now commonly involves resource-constrained edge devices, GNN-based recommender systems face significant challenges, including high embedding storage costs and runtime latency from graph propagations. Our previous work, LEGCF, effectively reduced embedding storage costs but struggled to maintain recommendation performance under stricter… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

    Comments: Accepted by TOIS'25