Skip to main content

Showing 1–50 of 52,310 results for author: Chen

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.02860  [pdf, ps, other

    cs.CV

    Less is Enough: Training-Free Video Diffusion Acceleration via Runtime-Adaptive Caching

    Authors: Xin Zhou, Dingkang Liang, Kaijin Chen, Tianrui Feng, Xiwu Chen, Hongkai Lin, Yikang Ding, Feiyang Tan, Hengshuang Zhao, Xiang Bai

    Abstract: Video generation models have demonstrated remarkable performance, yet their broader adoption remains constrained by slow inference speeds and substantial computational costs, primarily due to the iterative nature of the denoising process. Addressing this bottleneck is essential for democratizing advanced video synthesis technologies and enabling their integration into real-world applications. This… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: The code is made available at https://github.com/H-EmbodVis/EasyCache. Project page: https://h-embodvis.github.io/EasyCache/

  2. arXiv:2507.02842  [pdf, ps, other

    cs.DS

    On the Structure of Replicable Hypothesis Testers

    Authors: Anders Aamand, Maryam Aliakbarpour, Justin Y. Chen, Shyam Narayanan, Sandeep Silwal

    Abstract: A hypothesis testing algorithm is replicable if, when run on two different samples from the same distribution, it produces the same output with high probability. This notion, defined by by Impagliazzo, Lei, Pitassi, and Sorell [STOC'22], can increase trust in testing procedures and is deeply related to algorithmic stability, generalization, and privacy. We build general tools to prove lower and up… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: Abstract abridged to meet arxiv requirements

  3. arXiv:2507.02824  [pdf, ps, other

    eess.SP cs.AI cs.LG

    DNN-Based Precoding in RIS-Aided mmWave MIMO Systems With Practical Phase Shift

    Authors: Po-Heng Chou, Ching-Wen Chen, Wan-Jen Huang, Walid Saad, Yu Tsao, Ronald Y. Chang

    Abstract: In this paper, the precoding design is investigated for maximizing the throughput of millimeter wave (mmWave) multiple-input multiple-output (MIMO) systems with obstructed direct communication paths. In particular, a reconfigurable intelligent surface (RIS) is employed to enhance MIMO transmissions, considering mmWave characteristics related to line-of-sight (LoS) and multipath effects. The tradit… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: 5 pages, 4 figures, 2 tables, accepted by IEEE Globecom 2024 Workshops

  4. arXiv:2507.02790  [pdf, ps, other

    cs.CV cs.CL

    From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding

    Authors: Xiangfeng Wang, Xiao Li, Yadong Wei, Xueyu Song, Yang Song, Xiaoqiang Xia, Fangrui Zeng, Zaiyi Chen, Liu Liu, Gu Xu, Tong Xu

    Abstract: The rapid growth of online video content, especially on short video platforms, has created a growing demand for efficient video editing techniques that can condense long-form videos into concise and engaging clips. Existing automatic editing methods predominantly rely on textual cues from ASR transcripts and end-to-end segment selection, often neglecting the rich visual context and leading to inco… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  5. arXiv:2507.02768  [pdf, ps, other

    eess.AS cs.CL cs.SD

    DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment

    Authors: Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Sung-Feng Huang, Chih-Kai Yang, Chee-En Yu, Chun-Wei Chen, Wei-Chih Chen, Chien-yu Huang, Yi-Cheng Lin, Yu-Xiang Lin, Chi-An Fu, Chun-Yi Kuan, Wenze Ren, Xuanjun Chen, Wei-Ping Huang, En-Pei Hu, Tzu-Quan Lin, Yuan-Kuei Wu, Kuan-Po Huang, Hsiao-Ying Huang, Huang-Cheng Chou, Kai-Wei Chang, Cheng-Han Chiang , et al. (3 additional authors not shown)

    Abstract: We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following, without requiring task-specific audio instruction-tuning. Recent LALMs typically augment Large Language Models (LLMs) with auditory capabilities by training on large-scale, manually curated or LLM-synthesized audio-instruction datasets. However, these… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: Model and code available at: https://github.com/kehanlu/DeSTA2.5-Audio

  6. arXiv:2507.02754  [pdf, ps, other

    cs.LG cs.AI

    Fast and Simplex: 2-Simplicial Attention in Triton

    Authors: Aurko Roy, Timothy Chou, Sai Surya Duvvuri, Sijia Chen, Jiecao Yu, Xiaodong Wang, Manzil Zaheer, Rohan Anil

    Abstract: Recent work has shown that training loss scales as a power law with both model size and the number of tokens, and that achieving compute-optimal models requires scaling model size and token count together. However, these scaling laws assume an infinite supply of data and apply primarily in compute-bound settings. As modern large language models increasingly rely on massive internet-scale datasets,… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: 10 pages, with appendix 25 pages

  7. arXiv:2507.02752  [pdf

    physics.chem-ph cs.AI

    Synthesizable by Design: A Retrosynthesis-Guided Framework for Molecular Analog Generation

    Authors: Shuan Chen, Gunwook Nam, Yousung Jung

    Abstract: The disconnect between AI-generated molecules with desirable properties and their synthetic feasibility remains a critical bottleneck in computational drug and material discovery. While generative AI has accelerated the proposal of candidate molecules, many of these structures prove challenging or impossible to synthesize using established chemical reactions. Here, we introduce SynTwins, a novel r… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  8. arXiv:2507.02747  [pdf, ps, other

    cs.CV cs.RO

    DexVLG: Dexterous Vision-Language-Grasp Model at Scale

    Authors: Jiawei He, Danshi Li, Xinqiang Yu, Zekun Qi, Wenyao Zhang, Jiayi Chen, Zhaoxiang Zhang, Zhizheng Zhang, Li Yi, He Wang

    Abstract: As large models gain traction, vision-language-action (VLA) systems are enabling robots to tackle increasingly complex tasks. However, limited by the difficulty of data collection, progress has mainly focused on controlling simple gripper end-effectors. There is little research on functional grasping with large models for human-like dexterous hands. In this paper, we introduce DexVLG, a large Visi… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  9. arXiv:2507.02735  [pdf, ps, other

    cs.CR cs.AI

    Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks

    Authors: Sizhe Chen, Arman Zharmagambetov, David Wagner, Chuan Guo

    Abstract: Prompt injection attacks pose a significant security threat to LLM-integrated applications. Model-level defenses have shown strong effectiveness, but are currently deployed into commercial-grade models in a closed-source manner. We believe open-source models are needed by the AI security community, where co-development of attacks and defenses through open research drives scientific progress in mit… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  10. arXiv:2507.02659  [pdf, ps, other

    cs.LG cs.CL

    OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding

    Authors: Ramchalam Kinattinkara Ramakrishnan, Zhaocong Yuan, Shaojie Zhuo, Chen Feng, Yicheng Lin, Chenzheng Su, Xiaopeng Zhang

    Abstract: Speculative decoding generally dictates having a small, efficient draft model that is either pretrained or distilled offline to a particular target model series, for instance, Llama or Qwen models. However, within online deployment settings, there are two major challenges: 1) usage of a target model that is incompatible with the draft model; 2) expectation of latency improvements over usage and ti… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  11. arXiv:2507.02626  [pdf, ps, other

    cs.MM

    VRAgent-R1: Boosting Video Recommendation with MLLM-based Agents via Reinforcement Learning

    Authors: Siran Chen, Boyu Chen, Chenyun Yu, Yuxiao Luo, Ouyang Yi, Lei Cheng, Chengxiang Zhuo, Zang Li, Yali Wang

    Abstract: Owing to powerful natural language processing and generative capabilities, large language model (LLM) agents have emerged as a promising solution for enhancing recommendation systems via user simulation. However, in the realm of video recommendation, existing studies predominantly resort to prompt-based simulation using frozen LLMs and encounter the intricate challenge of multimodal content unders… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  12. arXiv:2507.02606  [pdf, ps, other

    cs.SD cs.AI cs.CR cs.LG eess.AS

    De-AntiFake: Rethinking the Protective Perturbations Against Voice Cloning Attacks

    Authors: Wei Fan, Kejiang Chen, Chang Liu, Weiming Zhang, Nenghai Yu

    Abstract: The rapid advancement of speech generation models has heightened privacy and security concerns related to voice cloning (VC). Recent studies have investigated disrupting unauthorized voice cloning by introducing adversarial perturbations. However, determined attackers can mitigate these protective perturbations and successfully execute VC. In this study, we conduct the first systematic evaluation… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: Accepted by ICML 2025

  13. arXiv:2507.02600  [pdf, ps, other

    cs.RO

    ArtGS:3D Gaussian Splatting for Interactive Visual-Physical Modeling and Manipulation of Articulated Objects

    Authors: Qiaojun Yu, Xibin Yuan, Yu jiang, Junting Chen, Dongzhe Zheng, Ce Hao, Yang You, Yixing Chen, Yao Mu, Liu Liu, Cewu Lu

    Abstract: Articulated object manipulation remains a critical challenge in robotics due to the complex kinematic constraints and the limited physical reasoning of existing methods. In this work, we introduce ArtGS, a novel framework that extends 3D Gaussian Splatting (3DGS) by integrating visual-physical modeling for articulated object understanding and interaction. ArtGS begins with multi-view RGB-D reconst… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: Accepted by IROS 2025

  14. arXiv:2507.02598  [pdf, ps, other

    cs.AR cs.AI

    AC-Refiner: Efficient Arithmetic Circuit Optimization Using Conditional Diffusion Models

    Authors: Chenhao Xue, Kezhi Li, Jiaxing Zhang, Yi Ren, Zhengyuan Shi, Chen Zhang, Yibo Lin, Lining Zhang, Qiang Xu, Guangyu Sun

    Abstract: Arithmetic circuits, such as adders and multipliers, are fundamental components of digital systems, directly impacting the performance, power efficiency, and area footprint. However, optimizing these circuits remains challenging due to the vast design space and complex physical constraints. While recent deep learning-based approaches have shown promise, they struggle to consistently explore high-p… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: 8 pages, 12 figures

  15. arXiv:2507.02581  [pdf, ps, other

    cs.CV

    Structure-aware Semantic Discrepancy and Consistency for 3D Medical Image Self-supervised Learning

    Authors: Tan Pan, Zhaorui Tan, Kaiyu Guo, Dongli Xu, Weidi Xu, Chen Jiang, Xin Guo, Yuan Qi, Yuan Cheng

    Abstract: 3D medical image self-supervised learning (mSSL) holds great promise for medical analysis. Effectively supporting broader applications requires considering anatomical structure variations in location, scale, and morphology, which are crucial for capturing meaningful distinctions. However, previous mSSL methods partition images with fixed-size patches, often ignoring the structure variations. In th… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: Accepted by ICCV25

  16. arXiv:2507.02575  [pdf, ps, other

    cond-mat.soft cond-mat.stat-mech cs.MA nlin.AO

    A unifying approach to self-organizing systems interacting via conservation laws

    Authors: Frank Barrows, Guanming Zhang, Satyam Anand, Zizi Chen, Jonathan Lin, Amman Desai, Stefano Martiniani, Francesco Caravelli

    Abstract: We present a unified framework for embedding and analyzing dynamical systems using generalized projection operators rooted in local conservation laws. By representing physical, biological, and engineered systems as graphs with incidence and cycle matrices, we derive dual projection operators that decompose network fluxes and potentials. This formalism aligns with principles of non-equilibrium ther… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: 19 pages single column + 24 pages supplementary

  17. arXiv:2507.02565  [pdf, ps, other

    cs.CV

    Reconstructing Close Human Interaction with Appearance and Proxemics Reasoning

    Authors: Buzhen Huang, Chen Li, Chongyang Xu, Dongyue Lu, Jinnan Chen, Yangang Wang, Gim Hee Lee

    Abstract: Due to visual ambiguities and inter-person occlusions, existing human pose estimation methods cannot recover plausible close interactions from in-the-wild videos. Even state-of-the-art large foundation models~(\eg, SAM) cannot accurately distinguish human semantics in such challenging scenarios. In this work, we find that human appearance can provide a straightforward cue to address these obstacle… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  18. arXiv:2507.02547  [pdf, ps, other

    cs.RO

    Vibration of Soft, Twisted Beams for Under-Actuated Quadrupedal Locomotion

    Authors: Yuhao Jiang, Fuchen Chen, Jamie Paik, Daniel M. Aukes

    Abstract: Under-actuated compliant robotic systems offer a promising approach to mitigating actuation and control challenges by harnessing pre-designed, embodied dynamic behaviors. This paper presents Flix-Walker, a novel, untethered, centimeter-scale quadrupedal robot inspired by compliant under-actuated mechanisms. Flix-Walker employs flexible, helix-shaped beams as legs, which are actuated by vibrations… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: This manuscript is under revision for possible publication in the IEEE/ASME Transactions on Mechatronics. Copyright may be transferred to IEEE if the manuscript is accepted for publication, without further notice. Supplementary videos: https://youtu.be/T3d6FT3Rx-s, https://youtu.be/nPQrhKlN02E

  19. arXiv:2507.02479  [pdf, ps, other

    cs.CV cs.AI

    CrowdTrack: A Benchmark for Difficult Multiple Pedestrian Tracking in Real Scenarios

    Authors: Teng Fu, Yuwen Chen, Zhuofan Chen, Mengyang Zhao, Bin Li, Xiangyang Xue

    Abstract: Multi-object tracking is a classic field in computer vision. Among them, pedestrian tracking has extremely high application value and has become the most popular research category. Existing methods mainly use motion or appearance information for tracking, which is often difficult in complex scenarios. For the motion information, mutual occlusions between objects often prevent updating of the motio… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  20. arXiv:2507.02454  [pdf, ps, other

    cs.CV

    Weakly-supervised Contrastive Learning with Quantity Prompts for Moving Infrared Small Target Detection

    Authors: Weiwei Duan, Luping Ji, Shengjia Chen, Sicheng Zhu, Jianghong Huang, Mao Ye

    Abstract: Different from general object detection, moving infrared small target detection faces huge challenges due to tiny target size and weak background contrast.Currently, most existing methods are fully-supervised, heavily relying on a large number of manual target-wise annotations. However, manually annotating video sequences is often expensive and time-consuming, especially for low-quality infrared f… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  21. arXiv:2507.02437  [pdf, ps, other

    cs.CV eess.IV

    F^2TTA: Free-Form Test-Time Adaptation on Cross-Domain Medical Image Classification via Image-Level Disentangled Prompt Tuning

    Authors: Wei Li, Jingyang Zhang, Lihao Liu, Guoan Wang, Junjun He, Yang Chen, Lixu Gu

    Abstract: Test-Time Adaptation (TTA) has emerged as a promising solution for adapting a source model to unseen medical sites using unlabeled test data, due to the high cost of data annotation. Existing TTA methods consider scenarios where data from one or multiple domains arrives in complete domain units. However, in clinical practice, data usually arrives in domain fragments of arbitrary lengths and in ran… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: This paper has been submitted to relevant journals

  22. arXiv:2507.02411  [pdf, ps, other

    eess.IV cs.CV

    3D Heart Reconstruction from Sparse Pose-agnostic 2D Echocardiographic Slices

    Authors: Zhurong Chen, Jinhua Chen, Wei Zhuo, Wufeng Xue, Dong Ni

    Abstract: Echocardiography (echo) plays an indispensable role in the clinical practice of heart diseases. However, ultrasound imaging typically provides only two-dimensional (2D) cross-sectional images from a few specific views, making it challenging to interpret and inaccurate for estimation of clinical parameters like the volume of left ventricle (LV). 3D ultrasound imaging provides an alternative for 3D… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: 10 pages

  23. arXiv:2507.02399  [pdf, ps, other

    cs.CV cs.LG

    TABNet: A Triplet Augmentation Self-Recovery Framework with Boundary-Aware Pseudo-Labels for Medical Image Segmentation

    Authors: Peilin Zhang, Shaouxan Wua, Jun Feng, Zhuo Jin, Zhizezhang Gao, Jingkun Chen, Yaqiong Xing, Xiao Zhang

    Abstract: Background and objective: Medical image segmentation is a core task in various clinical applications. However, acquiring large-scale, fully annotated medical image datasets is both time-consuming and costly. Scribble annotations, as a form of sparse labeling, provide an efficient and cost-effective alternative for medical image segmentation. However, the sparsity of scribble annotations limits the… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Journal ref: Computer Methods and Programs in Biomedicine 2025

  24. arXiv:2507.02381  [pdf, ps, other

    cs.NE

    Running-time Analysis of ($μ+λ$) Evolutionary Combinatorial Optimization Based on Multiple-gain Estimation

    Authors: Min Huang, Pengxiang Chen, Han Huang, Tongli He, Yushan Zhang, Zhifeng Hao

    Abstract: The running-time analysis of evolutionary combinatorial optimization is a fundamental topic in evolutionary computation. However, theoretical results regarding the $(μ+λ)$ evolutionary algorithm (EA) for combinatorial optimization problems remain relatively scarce compared to those for simple pseudo-Boolean problems. This paper proposes a multiple-gain model to analyze the running time of EAs for… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  25. arXiv:2507.02379  [pdf

    cs.AI q-bio.BM

    An AI-native experimental laboratory for autonomous biomolecular engineering

    Authors: Mingyu Wu, Zhaoguo Wang, Jiabin Wang, Zhiyuan Dong, Jingkai Yang, Qingting Li, Tianyu Huang, Lei Zhao, Mingqiang Li, Fei Wang, Chunhai Fan, Haibo Chen

    Abstract: Autonomous scientific research, capable of independently conducting complex experiments and serving non-specialists, represents a long-held aspiration. Achieving it requires a fundamental paradigm shift driven by artificial intelligence (AI). While autonomous experimental systems are emerging, they remain confined to areas featuring singular objectives and well-defined, simple experimental workflo… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  26. arXiv:2507.02353  [pdf, ps, other

    cs.AI

    OMS: On-the-fly, Multi-Objective, Self-Reflective Ad Keyword Generation via LLM Agent

    Authors: Bowen Chen, Zhao Wang, Shingo Takamatsu

    Abstract: Keyword decision in Sponsored Search Advertising is critical to the success of ad campaigns. While LLM-based methods offer automated keyword generation, they face three major limitations: reliance on large-scale query-keyword pair data, lack of online multi-objective performance monitoring and optimization, and weak quality control in keyword selection. These issues hinder the agentic use of LLMs… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  27. arXiv:2507.02318  [pdf, ps, other

    cs.SE

    Precisely Detecting Python Type Errors via LLM-based Unit Test Generation

    Authors: Chen Yang, Ziqi Wang, Yanjie Jiang, Lin Yang, Yuteng Zheng, Jianyi Zhou, Junjie Chen

    Abstract: Type errors in Python often lead to runtime failures, posing significant challenges to software reliability and developer productivity. Existing static analysis tools aim to detect such errors without execution but frequently suffer from high false positive rates. Recently, unit test generation techniques offer great promise in achieving high test coverage, but they often struggle to produce bug-r… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  28. arXiv:2507.02316  [pdf, ps, other

    cs.CV

    Are Synthetic Videos Useful? A Benchmark for Retrieval-Centric Evaluation of Synthetic Videos

    Authors: Zecheng Zhao, Selena Song, Tong Chen, Zhi Chen, Shazia Sadiq, Yadan Luo

    Abstract: Text-to-video (T2V) synthesis has advanced rapidly, yet current evaluation metrics primarily capture visual quality and temporal consistency, offering limited insight into how synthetic videos perform in downstream tasks such as text-to-video retrieval (TVR). In this work, we introduce SynTVA, a new dataset and benchmark designed to evaluate the utility of synthetic videos for building retrieval m… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: 7 pages, 10 figures

  29. arXiv:2507.02299  [pdf, ps, other

    cs.CV

    DreamComposer++: Empowering Diffusion Models with Multi-View Conditions for 3D Content Generation

    Authors: Yunhan Yang, Shuo Chen, Yukun Huang, Xiaoyang Wu, Yuan-Chen Guo, Edmund Y. Lam, Hengshuang Zhao, Tong He, Xihui Liu

    Abstract: Recent advancements in leveraging pre-trained 2D diffusion models achieve the generation of high-quality novel views from a single in-the-wild image. However, existing works face challenges in producing controllable novel views due to the lack of information from multiple views. In this paper, we present DreamComposer++, a flexible and scalable framework designed to improve current view-aware diff… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: Accepted by TPAMI, extension of CVPR 2024 paper DreamComposer

  30. arXiv:2507.02279  [pdf, ps, other

    cs.CV

    LaCo: Efficient Layer-wise Compression of Visual Tokens for Multimodal Large Language Models

    Authors: Juntao Liu, Liqiang Niu, Wenchao Chen, Jie Zhou, Fandong Meng

    Abstract: Existing visual token compression methods for Multimodal Large Language Models (MLLMs) predominantly operate as post-encoder modules, limiting their potential for efficiency gains. To address this limitation, we propose LaCo (Layer-wise Visual Token Compression), a novel framework that enables effective token compression within the intermediate layers of the vision encoder. LaCo introduces two cor… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  31. arXiv:2507.02270  [pdf, ps, other

    cs.CV

    MAC-Lookup: Multi-Axis Conditional Lookup Model for Underwater Image Enhancement

    Authors: Fanghai Yi, Zehong Zheng, Zexiao Liang, Yihang Dong, Xiyang Fang, Wangyu Wu, Xuhang Chen

    Abstract: Enhancing underwater images is crucial for exploration. These images face visibility and color issues due to light changes, water turbidity, and bubbles. Traditional prior-based methods and pixel-based methods often fail, while deep learning lacks sufficient high-quality datasets. We introduce the Multi-Axis Conditional Lookup (MAC-Lookup) model, which enhances visual quality by improving color ac… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Accepted by IEEE SMC 2025

  32. arXiv:2507.02259  [pdf, ps, other

    cs.CL cs.AI cs.LG

    MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

    Authors: Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, Hao Zhou

    Abstract: Despite improvements by length extrapolation, efficient attention and memory modules, handling infinitely long documents with linear complexity without performance degradation during extrapolation remains the ultimate challenge in long-text processing. We directly optimize for long-text tasks in an end-to-end fashion and introduce a novel agent workflow, MemAgent, which reads text in segments and… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Project Page: https://memagent-sialab.github.io/

  33. arXiv:2507.02255  [pdf, ps, other

    cs.IR cs.LG

    Listwise Preference Alignment Optimization for Tail Item Recommendation

    Authors: Zihao Li, Chao Yang, Tong Zhang, Yakun Chen, Xianzhi Wang, Guandong Xu, Daoyi Dong

    Abstract: Preference alignment has achieved greater success on Large Language Models (LLMs) and drawn broad interest in recommendation research. Existing preference alignment methods for recommendation either require explicit reward modeling or only support pairwise preference comparison. The former directly increases substantial computational costs, while the latter hinders training efficiency on negative… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  34. arXiv:2507.02252  [pdf, ps, other

    cs.CV cs.AI

    SurgVisAgent: Multimodal Agentic Model for Versatile Surgical Visual Enhancement

    Authors: Zeyu Lei, Hongyuan Yu, Jinlin Wu, Zhen Chen

    Abstract: Precise surgical interventions are vital to patient safety, and advanced enhancement algorithms have been developed to assist surgeons in decision-making. Despite significant progress, these algorithms are typically designed for single tasks in specific scenarios, limiting their effectiveness in complex real-world situations. To address this limitation, we propose SurgVisAgent, an end-to-end intel… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  35. arXiv:2507.02250  [pdf, ps, other

    cs.CV

    FMOcc: TPV-Driven Flow Matching for 3D Occupancy Prediction with Selective State Space Model

    Authors: Jiangxia Chen, Tongyuan Huang, Ke Song

    Abstract: 3D semantic occupancy prediction plays a pivotal role in autonomous driving. However, inherent limitations of fewframe images and redundancy in 3D space compromise prediction accuracy for occluded and distant scenes. Existing methods enhance performance by fusing historical frame data, which need additional data and significant computational resources. To address these issues, this paper propose F… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  36. arXiv:2507.02245  [pdf, ps, other

    cs.RO

    CoInfra: A Large-Scale Cooperative Infrastructure Perception System and Dataset in Adverse Weather

    Authors: Minghao Ning, Yufeng Yang, Keqi Shu, Shucheng Huang, Jiaming Zhong, Maryam Salehi, Mahdi Rahmani, Yukun Lu, Chen Sun, Aladdin Saleh, Ehsan Hashemi, Amir Khajepour

    Abstract: We present CoInfra, a large-scale cooperative infrastructure perception system and dataset designed to advance robust multi-agent perception under real-world and adverse weather conditions. The CoInfra system includes 14 fully synchronized sensor nodes, each equipped with dual RGB cameras and a LiDAR, deployed across a shared region and operating continuously to capture all traffic participants in… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: This paper has been submitted to the IEEE Transactions on Robotics for review

  37. arXiv:2507.02200  [pdf, ps, other

    cs.CV cs.AI cs.CL

    ESTR-CoT: Towards Explainable and Accurate Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning

    Authors: Xiao Wang, Jingtao Jiang, Qiang Chen, Lan Chen, Lin Zhu, Yaowei Wang, Yonghong Tian, Jin Tang

    Abstract: Event stream based scene text recognition is a newly arising research topic in recent years which performs better than the widely used RGB cameras in extremely challenging scenarios, especially the low illumination, fast motion. Existing works either adopt end-to-end encoder-decoder framework or large language models for enhanced recognition, however, they are still limited by the challenges of in… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: A Strong Baseline for Reasoning based Event Stream Scene Text Recognition

  38. arXiv:2507.02128  [pdf, ps, other

    cs.LG

    CROP: Circuit Retrieval and Optimization with Parameter Guidance using LLMs

    Authors: Jingyu Pan, Isaac Jacobson, Zheng Zhao, Tung-Chieh Chen, Guanglei Zhou, Chen-Chia Chang, Vineet Rashingkar, Yiran Chen

    Abstract: Modern very large-scale integration (VLSI) design requires the implementation of integrated circuits using electronic design automation (EDA) tools. Due to the complexity of EDA algorithms, the vast parameter space poses a huge challenge to chip design optimization, as the combination of even moderate numbers of parameters creates an enormous solution space to explore. Manual parameter selection r… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Accepted by ICCAD 2025

  39. arXiv:2507.02057  [pdf, ps, other

    cs.CR cs.AI

    MGC: A Compiler Framework Exploiting Compositional Blindness in Aligned LLMs for Malware Generation

    Authors: Lu Yan, Zhuo Zhang, Xiangzhe Xu, Shengwei An, Guangyu Shen, Zhou Xuan, Xuan Chen, Xiangyu Zhang

    Abstract: Large language models (LLMs) have democratized software development, reducing the expertise barrier for programming complex applications. This accessibility extends to malicious software development, raising significant security concerns. While LLM providers have implemented alignment mechanisms to prevent direct generation of overtly malicious code, these safeguards predominantly evaluate individ… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  40. arXiv:2507.02029  [pdf, ps, other

    cs.RO

    RoboBrain 2.0 Technical Report

    Authors: BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, Yi Han, Yingbo Tang, Xiangqi Xu, Wei Guo, Yaoxu Lyu, Yijie Xu, Jiayu Shi, Cheng Chi, Mengdi Zhao, Xiaoshuai Hao, Shanyu Rong, Zhengliang Cai, Bolun Zhang, Shuyi Zhang, Huaihai Lyu, Mengfei Du , et al. (21 additional authors not shown)

    Abstract: We introduce RoboBrain 2.0, our latest generation of embodied vision-language foundation models, designed to unify perception, reasoning, and planning for complex embodied tasks in physical environments. It comes in two variants: a lightweight 7B model and a full-scale 32B model, featuring a heterogeneous architecture with a vision encoder and a language model. Despite its compact size, RoboBrain… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  41. arXiv:2507.01990  [pdf, ps, other

    q-fin.GN cs.AI cs.LG

    Integrating Large Language Models in Financial Investments and Market Analysis: A Survey

    Authors: Sedigheh Mahdavi, Jiating, Chen, Pradeep Kumar Joshi, Lina Huertas Guativa, Upmanyu Singh

    Abstract: Large Language Models (LLMs) have been employed in financial decision making, enhancing analytical capabilities for investment strategies. Traditional investment strategies often utilize quantitative models, fundamental analysis, and technical indicators. However, LLMs have introduced new capabilities to process and analyze large volumes of structured and unstructured data, extract meaningful insi… ▽ More

    Submitted 29 June, 2025; originally announced July 2025.

  42. arXiv:2507.01961  [pdf, ps, other

    cs.RO cs.AI

    AC-DiT: Adaptive Coordination Diffusion Transformer for Mobile Manipulation

    Authors: Sixiang Chen, Jiaming Liu, Siyuan Qian, Han Jiang, Lily Li, Renrui Zhang, Zhuoyang Liu, Chenyang Gu, Chengkai Hou, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang

    Abstract: Recently, mobile manipulation has attracted increasing attention for enabling language-conditioned robotic control in household tasks. However, existing methods still face challenges in coordinating mobile base and manipulator, primarily due to two limitations. On the one hand, they fail to explicitly model the influence of the mobile base on manipulator control, which easily leads to error accumu… ▽ More

    Submitted 3 July, 2025; v1 submitted 2 July, 2025; originally announced July 2025.

    Comments: Project website: https://ac-dit.github.io/

  43. arXiv:2507.01949  [pdf, ps, other

    cs.CV

    Kwai Keye-VL Technical Report

    Authors: Kwai Keye Team, Biao Yang, Bin Wen, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Hao Peng, Haojie Ding, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Jin Ouyang, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Shengnan Zhang, Siyang Mao , et al. (35 additional authors not shown)

    Abstract: While Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities on static images, they often fall short in comprehending dynamic, information-dense short-form videos, a dominant medium in today's digital landscape. To bridge this gap, we introduce \textbf{Kwai Keye-VL}, an 8-billion-parameter multimodal foundation model engineered for leading-edge performance in short-video unde… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Technical Report: https://github.com/Kwai-Keye/Keye

  44. arXiv:2507.01945  [pdf, ps, other

    cs.CV

    LongAnimation: Long Animation Generation with Dynamic Global-Local Memory

    Authors: Nan Chen, Mengqi Huang, Yihao Meng, Zhendong Mao

    Abstract: Animation colorization is a crucial part of real animation industry production. Long animation colorization has high labor costs. Therefore, automated long animation colorization based on the video generation model has significant research value. Existing studies are limited to short-term colorization. These studies adopt a local paradigm, fusing overlapping features to achieve smooth transitions… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  45. arXiv:2507.01925  [pdf, ps, other

    cs.RO

    A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

    Authors: Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, Zhiquan Qi, Yitao Liang, Yuanpei Chen, Yaodong Yang

    Abstract: The remarkable advancements of vision and language foundation models in multimodal understanding, reasoning, and generation has sparked growing efforts to extend such intelligence to the physical world, fueling the flourishing of vision-language-action (VLA) models. Despite seemingly diverse approaches, we observe that current VLA models can be unified under a single framework: vision and language… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: 70 pages, 5 figures

  46. arXiv:2507.01923  [pdf, ps, other

    cs.CL

    Decision-Oriented Text Evaluation

    Authors: Yu-Shiang Huang, Chuan-Ju Wang, Chung-Chi Chen

    Abstract: Natural language generation (NLG) is increasingly deployed in high-stakes domains, yet common intrinsic evaluation methods, such as n-gram overlap or sentence plausibility, weakly correlate with actual decision-making efficacy. We propose a decision-oriented framework for evaluating generated text by directly measuring its influence on human and large language model (LLM) decision outcomes. Using… ▽ More

    Submitted 3 July, 2025; v1 submitted 2 July, 2025; originally announced July 2025.

  47. arXiv:2507.01908  [pdf, ps, other

    cs.CV

    Reasoning to Edit: Hypothetical Instruction-Based Image Editing with Visual Reasoning

    Authors: Qingdong He, Xueqin Chen, Chaoyi Wang, Yanjie Pan, Xiaobin Hu, Zhenye Gan, Yabiao Wang, Chengjie Wang, Xiangtai Li, Jiangning Zhang

    Abstract: Instruction-based image editing (IIE) has advanced rapidly with the success of diffusion models. However, existing efforts primarily focus on simple and explicit instructions to execute editing operations such as adding, deleting, moving, or swapping objects. They struggle to handle more complex implicit hypothetical instructions that require deeper reasoning to infer plausible visual changes and… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  48. arXiv:2507.01903  [pdf, ps, other

    cs.CL cs.AI

    AI4Research: A Survey of Artificial Intelligence for Scientific Research

    Authors: Qiguang Chen, Mingda Yang, Libo Qin, Jinhao Liu, Zheng Yan, Jiannan Guan, Dengyun Peng, Yiyan Ji, Hanjing Li, Mengkang Hu, Yimeng Zhang, Yihao Liang, Yuhang Zhou, Jiaqi Wang, Zhi Chen, Wanxiang Che

    Abstract: Recent advancements in artificial intelligence (AI), particularly in large language models (LLMs) such as OpenAI-o1 and DeepSeek-R1, have demonstrated remarkable capabilities in complex domains such as logical reasoning and experimental coding. Motivated by these advancements, numerous studies have explored the application of AI in the innovation process, particularly in the context of scientific… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Preprint

  49. arXiv:2507.01873  [pdf, ps, other

    cs.DS

    Breaking the $n^{1.5}$ Additive Error Barrier for Private and Efficient Graph Sparsification via Private Expander Decomposition

    Authors: Anders Aamand, Justin Y. Chen, Mina Dalirrooyfard, Slobodan Mitrović, Yuriy Nevmyvaka, Sandeep Silwal, Yinzhan Xu

    Abstract: We study differentially private algorithms for graph cut sparsification, a fundamental problem in algorithms, privacy, and machine learning. While significant progress has been made, the best-known private and efficient cut sparsifiers on $n$-node graphs approximate each cut within $\widetilde{O}(n^{1.5})$ additive error and $1+γ$ multiplicative error for any $γ> 0$ [Gupta, Roth, Ullman TCC'12]. I… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: ICML 2025

  50. arXiv:2507.01791  [pdf

    cs.CV

    Boosting Adversarial Transferability Against Defenses via Multi-Scale Transformation

    Authors: Zihong Guo, Chen Wan, Yayin Zheng, Hailing Kuang, Xiaohai Lu

    Abstract: The transferability of adversarial examples poses a significant security challenge for deep neural networks, which can be attacked without knowing anything about them. In this paper, we propose a new Segmented Gaussian Pyramid (SGP) attack method to enhance the transferability, particularly against defense models. Unlike existing methods that generally focus on single-scale images, our approach em… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.