Skip to main content

Showing 1–50 of 802 results for author: Yuan, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.08350  [pdf, ps, other

    cs.CV cs.AI

    STORYANCHORS: Generating Consistent Multi-Scene Story Frames for Long-Form Narratives

    Authors: Bo Wang, Haoyang Huang, Zhiyin Lu, Fengyuan Liu, Guoqing Ma, Jianlong Yuan, Yuan Zhang, Nan Duan

    Abstract: This paper introduces StoryAnchors, a unified framework for generating high-quality, multi-scene story frames with strong temporal consistency. The framework employs a bidirectional story generator that integrates both past and future contexts to ensure temporal consistency, character continuity, and smooth scene transitions throughout the narrative. Specific conditions are introduced to distingui… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  2. arXiv:2505.07344  [pdf, other

    cs.CV cs.AI

    Generative Pre-trained Autoregressive Diffusion Transformer

    Authors: Yuan Zhang, Jiacheng Jiang, Guoqing Ma, Zhiying Lu, Haoyang Huang, Jianlong Yuan, Nan Duan

    Abstract: In this work, we present GPDiT, a Generative Pre-trained Autoregressive Diffusion Transformer that unifies the strengths of diffusion and autoregressive modeling for long-range video synthesis, within a continuous latent space. Instead of predicting discrete tokens, GPDiT autoregressively predicts future latent frames using a diffusion loss, enabling natural modeling of motion dynamics and semanti… ▽ More

    Submitted 15 May, 2025; v1 submitted 12 May, 2025; originally announced May 2025.

  3. arXiv:2505.06335  [pdf, ps, other

    cs.LG cs.AI cs.CR

    Remote Rowhammer Attack using Adversarial Observations on Federated Learning Clients

    Authors: Jinsheng Yuan, Yuhang Hao, Weisi Guo, Yun Wu, Chongyan Gu

    Abstract: Federated Learning (FL) has the potential for simultaneous global learning amongst a large number of parallel agents, enabling emerging AI such as LLMs to be trained across demographically diverse data. Central to this being efficient is the ability for FL to perform sparse gradient updates and remote direct memory access at the central server. Most of the research in FL security focuses on protec… ▽ More

    Submitted 9 May, 2025; originally announced May 2025.

  4. DeepSTA: A Spatial-Temporal Attention Network for Logistics Delivery Timely Rate Prediction in Anomaly Conditions

    Authors: Jinhui Yi, Huan Yan, Haotian Wang, Jian Yuan, Yong Li

    Abstract: Prediction of couriers' delivery timely rates in advance is essential to the logistics industry, enabling companies to take preemptive measures to ensure the normal operation of delivery services. This becomes even more critical during anomaly conditions like the epidemic outbreak, during which couriers' delivery timely rate will decline markedly and fluctuates significantly. Existing studies pay… ▽ More

    Submitted 1 May, 2025; originally announced May 2025.

    Comments: Accepted by CIKM 2023

  5. Learning to Estimate Package Delivery Time in Mixed Imbalanced Delivery and Pickup Logistics Services

    Authors: Jinhui Yi, Huan Yan, Haotian Wang, Jian Yuan, Yong Li

    Abstract: Accurately estimating package delivery time is essential to the logistics industry, which enables reasonable work allocation and on-time service guarantee. This becomes even more necessary in mixed logistics scenarios where couriers handle a high volume of delivery and a smaller number of pickup simultaneously. However, most of the related works treat the pickup and delivery patterns on couriers'… ▽ More

    Submitted 1 May, 2025; originally announced May 2025.

    Comments: Accepted by ACM SIGSPATIAL 2024

  6. arXiv:2504.16408  [pdf, other

    cs.CL

    LLMSR@XLLM25: Less is More: Enhancing Structured Multi-Agent Reasoning via Quality-Guided Distillation

    Authors: Jiahao Yuan, Xingzhe Sun, Xing Yu, Jingwen Wang, Dehui Du, Zhiqing Cui, Zixiang Di

    Abstract: The LLMSR@XLLM25 formulates a low-resource structural reasoning task that challenges LLMs to generate interpretable, step-by-step rationales with minimal labeled data. We present Less is More, the third-place winning approach in the LLMSR@XLLM25, which focuses on structured reasoning from only 24 labeled examples. Our approach leverages a multi-agent framework with reverse-prompt induction, retrie… ▽ More

    Submitted 13 May, 2025; v1 submitted 23 April, 2025; originally announced April 2025.

    Comments: XLLM @ ACL 2025 Shared Task-III: LLM for Structural Reasoning (LLM-SR)

  7. arXiv:2504.15681  [pdf, other

    cs.CV

    Vidi: Large Multimodal Models for Video Understanding and Editing

    Authors: Vidi Team, Celong Liu, Chia-Wen Kuo, Dawei Du, Fan Chen, Guang Chen, Jiamin Yuan, Lingxi Zhang, Lu Guo, Lusha Li, Longyin Wen, Qingyu Chen, Rachel Deng, Sijie Zhu, Stuart Siew, Tong Jin, Wei Lu, Wen Zhong, Xiaohui Shen, Xin Gu, Xing Mei, Xueqiong Qu

    Abstract: Humans naturally share information with those they are connected to, and video has become one of the dominant mediums for communication and expression on the Internet. To support the creation of high-quality large-scale video content, a modern pipeline requires a comprehensive understanding of both the raw input materials (e.g., the unedited footage captured by cameras) and the editing components… ▽ More

    Submitted 24 April, 2025; v1 submitted 22 April, 2025; originally announced April 2025.

  8. arXiv:2504.13914  [pdf, other

    cs.CL

    Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning

    Authors: ByteDance Seed, :, Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingxuan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, Yufeng Yuan, Yu Yue, Lin Yan, Qiying Yu, Xiaochen Zuo, Chi Zhang, Ruofei Zhu, Zhecheng An, Zhihao Bai, Yu Bao, Xingyan Bin, Jiangjie Chen, Feng Chen, Hongmin Chen , et al. (249 additional authors not shown)

    Abstract: We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For in… ▽ More

    Submitted 29 April, 2025; v1 submitted 10 April, 2025; originally announced April 2025.

  9. arXiv:2504.13443  [pdf, other

    cs.AI cs.DC cs.MA econ.GN

    Trust, but verify

    Authors: Michael J. Yuan, Carlos Campoy, Sydney Lai, James Snewin, Ju Long

    Abstract: Decentralized AI agent networks, such as Gaia, allows individuals to run customized LLMs on their own computers and then provide services to the public. However, in order to maintain service quality, the network must verify that individual nodes are running their designated LLMs. In this paper, we demonstrate that in a cluster of mostly honest nodes, we can detect nodes that run unauthorized or in… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

  10. arXiv:2504.09655  [pdf

    eess.IV cs.CV

    OmniMamba4D: Spatio-temporal Mamba for longitudinal CT lesion segmentation

    Authors: Justin Namuk Kim, Yiqiao Liu, Rajath Soans, Keith Persson, Sarah Halek, Michal Tomaszewski, Jianda Yuan, Gregory Goldmacher, Antong Chen

    Abstract: Accurate segmentation of longitudinal CT scans is important for monitoring tumor progression and evaluating treatment responses. However, existing 3D segmentation models solely focus on spatial information. To address this gap, we propose OmniMamba4D, a novel segmentation model designed for 4D medical images (3D images over time). OmniMamba4D utilizes a spatio-temporal tetra-orientated Mamba block… ▽ More

    Submitted 24 April, 2025; v1 submitted 13 April, 2025; originally announced April 2025.

    Comments: Accepted at IEEE International Symposium on Biomedical Imaging (ISBI) 2025

  11. arXiv:2504.09479  [pdf, other

    cs.AI cs.CL

    Draw with Thought: Unleashing Multimodal Reasoning for Scientific Diagram Generation

    Authors: Zhiqing Cui, Jiahao Yuan, Hanqing Wang, Yanshu Li, Chenxu Du, Zhenglong Ding

    Abstract: Scientific diagrams are vital tools for communicating structured knowledge across disciplines. However, they are often published as static raster images, losing symbolic semantics and limiting reuse. While Multimodal Large Language Models (MLLMs) offer a pathway to bridging vision and structure, existing methods lack semantic control and structural interpretability, especially on complex diagrams.… ▽ More

    Submitted 13 April, 2025; originally announced April 2025.

    Comments: 26 pages, 14 figures

  12. arXiv:2504.07398  [pdf, other

    cs.IR cs.AI

    A Novel Mamba-based Sequential Recommendation Method

    Authors: Jun Yuan

    Abstract: Sequential recommendation (SR), which encodes user activity to predict the next action, has emerged as a widely adopted strategy in developing commercial personalized recommendation systems. Although Transformer-based models have proven effective for sequential recommendation, the complexity of the self-attention module in Transformers scales quadratically with the sequence length. Controlling mod… ▽ More

    Submitted 9 April, 2025; originally announced April 2025.

  13. arXiv:2504.07089  [pdf, other

    cs.CV cs.CL

    OmniCaptioner: One Captioner to Rule Them All

    Authors: Yiting Lu, Jiakang Yuan, Zhen Li, Shitian Zhao, Qi Qin, Xinyue Li, Le Zhuo, Licheng Wen, Dongyang Liu, Yuewen Cao, Xiangchao Yan, Xin Li, Tianshuo Peng, Shufei Zhang, Botian Shi, Tao Chen, Zhibo Chen, Lei Bai, Bo Zhang, Peng Gao

    Abstract: We propose OmniCaptioner, a versatile visual captioning framework for generating fine-grained textual descriptions across a wide variety of visual domains. Unlike prior methods limited to specific image types (e.g., natural images or geometric visuals), our framework provides a unified solution for captioning natural images, visual text (e.g., posters, UIs, textbooks), and structured visuals (e.g.… ▽ More

    Submitted 27 April, 2025; v1 submitted 9 April, 2025; originally announced April 2025.

    Comments: More visualizations on Homepage: https://alpha-innovator.github.io/OmniCaptioner-project-page and Official code: https://github.com/Alpha-Innovator/OmniCaptioner

  14. arXiv:2503.23768  [pdf, other

    cs.CL cs.CV

    Texture or Semantics? Vision-Language Models Get Lost in Font Recognition

    Authors: Zhecheng Li, Guoxian Song, Yujun Cai, Zhen Xiong, Junsong Yuan, Yiwei Wang

    Abstract: Modern Vision-Language Models (VLMs) exhibit remarkable visual and linguistic capabilities, achieving impressive performance in various tasks such as image recognition and object localization. However, their effectiveness in fine-grained tasks remains an open question. In everyday scenarios, individuals encountering design materials, such as magazines, typography tutorials, research papers, or bra… ▽ More

    Submitted 31 March, 2025; originally announced March 2025.

  15. arXiv:2503.23747  [pdf, other

    cs.CV

    Consistency-aware Self-Training for Iterative-based Stereo Matching

    Authors: Jingyi Zhou, Peng Ye, Haoyu Zhang, Jiakang Yuan, Rao Qiang, Liu YangChenXu, Wu Cailin, Feng Xu, Tao Chen

    Abstract: Iterative-based methods have become mainstream in stereo matching due to their high performance. However, these methods heavily rely on labeled data and face challenges with unlabeled real-world data. To this end, we propose a consistency-aware self-training framework for iterative-based stereo matching for the first time, leveraging real-world unlabeled data in a teacher-student manner. We first… ▽ More

    Submitted 31 March, 2025; originally announced March 2025.

    Comments: Accepted by CVPR 2025

  16. arXiv:2503.21758  [pdf, other

    cs.CV

    Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

    Authors: Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, Xiangyang Zhu, Manyuan Zhang, Will Beddow, Erwann Millon, Victor Perez, Wenhai Wang, Conghui He, Bo Zhang, Xiaohong Liu, Hongsheng Li, Yu Qiao, Chang Xu, Peng Gao

    Abstract: We introduce Lumina-Image 2.0, an advanced text-to-image generation framework that achieves significant progress compared to previous work, Lumina-Next. Lumina-Image 2.0 is built upon two key principles: (1) Unification - it adopts a unified architecture (Unified Next-DiT) that treats text and image tokens as a joint sequence, enabling natural cross-modal interactions and allowing seamless task ex… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

    Comments: Tech Report, 21 pages, 12 figures

  17. arXiv:2503.21460  [pdf, other

    cs.CL

    Large Language Model Agent: A Survey on Methodology, Applications and Challenges

    Authors: Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, Rongcheng Tu, Xiao Luo, Wei Ju, Zhiping Xiao, Yifan Wang, Meng Xiao, Chenwu Liu, Jingyang Yuan, Shichang Zhang, Yiqiao Jin, Fan Zhang, Xian Wu, Hanqing Zhao, Dacheng Tao, Philip S. Yu , et al. (1 additional authors not shown)

    Abstract: The era of intelligent agents is upon us, driven by revolutionary advancements in large language models. Large Language Model (LLM) agents, with goal-driven behaviors and dynamic adaptation capabilities, potentially represent a critical pathway toward artificial general intelligence. This survey systematically deconstructs LLM agent systems through a methodology-centered taxonomy, linking architec… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

    Comments: 329 papers surveyed, resources are at https://github.com/luo-junyu/Awesome-Agent-Papers

  18. arXiv:2503.21036  [pdf, other

    cs.AI

    The Art of Tool Interface Design

    Authors: Yunnan Wu, Paul Chen, Deshank Baranwal, Jinlong Zhou, Jian Yuan

    Abstract: We present an agentic framework, Thinker, which achieves state of art performance in challenging reasoning tasks for realistic customer service scenarios that involve complex business logic and human interactions via long horizons. On the $Ï„$-bench retail dataset, Thinker achieves 82.6\% success rate with GPT-4o (version 2024-06-01) (baseline: 68.3\%), and 81.9\% success rate with Llama-3.1 405B (… ▽ More

    Submitted 26 March, 2025; originally announced March 2025.

  19. arXiv:2503.20505  [pdf, other

    cs.LG stat.ML

    Riemannian Optimization on Relaxed Indicator Matrix Manifold

    Authors: Jinghui Yuan, Fangyuan Xie, Feiping Nie, Xuelong Li

    Abstract: The indicator matrix plays an important role in machine learning, but optimizing it is an NP-hard problem. We propose a new relaxation of the indicator matrix and prove that this relaxation forms a manifold, which we call the Relaxed Indicator Matrix Manifold (RIM manifold). Based on Riemannian geometry, we develop a Riemannian toolbox for optimization on the RIM manifold. Specifically, we provide… ▽ More

    Submitted 11 April, 2025; v1 submitted 26 March, 2025; originally announced March 2025.

  20. VibE: A Visual Analytics Workflow for Semantic Error Analysis of CVML Models at Subgroup Level

    Authors: Jun Yuan, Kevin Miao, Heyin Oh, Isaac Walker, Zhouyang Xue, Tigran Katolikyan, Marco Cavallo

    Abstract: Effective error analysis is critical for the successful development and deployment of CVML models. One approach to understanding model errors is to summarize the common characteristics of error samples. This can be particularly challenging in tasks that utilize unstructured, complex data such as images, where patterns are not always obvious. Another method is to analyze error distributions across… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

    Comments: 19 pages, 9 figures. Accepted by IUI'25

  21. arXiv:2503.19881  [pdf, other

    cs.CV

    Mask$^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation

    Authors: Tianhao Qi, Jianlong Yuan, Wanquan Feng, Shancheng Fang, Jiawei Liu, SiYu Zhou, Qian He, Hongtao Xie, Yongdong Zhang

    Abstract: Sora has unveiled the immense potential of the Diffusion Transformer (DiT) architecture in single-scene video generation. However, the more challenging task of multi-scene video generation, which offers broader applications, remains relatively underexplored. To bridge this gap, we propose Mask$^2$DiT, a novel approach that establishes fine-grained, one-to-one alignment between video segments and t… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

    Comments: Accepted by CVPR 2025

  22. arXiv:2503.19703  [pdf, other

    cs.CV eess.IV

    High-Quality Spatial Reconstruction and Orthoimage Generation Using Efficient 2D Gaussian Splatting

    Authors: Qian Wang, Zhihao Zhan, Jialei He, Zhituo Tu, Xiang Zhu, Jie Yuan

    Abstract: Highly accurate geometric precision and dense image features characterize True Digital Orthophoto Maps (TDOMs), which are in great demand for applications such as urban planning, infrastructure management, and environmental monitoring.Traditional TDOM generation methods need sophisticated processes, such as Digital Surface Models (DSM) and occlusion detection, which are computationally expensive a… ▽ More

    Submitted 13 May, 2025; v1 submitted 25 March, 2025; originally announced March 2025.

  23. arXiv:2503.19358  [pdf, other

    cs.CV

    From Sparse to Dense: Camera Relocalization with Scene-Specific Detector from Feature Gaussian Splatting

    Authors: Zhiwei Huang, Hailin Yu, Yichun Shentu, Jin Yuan, Guofeng Zhang

    Abstract: This paper presents a novel camera relocalization method, STDLoc, which leverages Feature Gaussian as scene representation. STDLoc is a full relocalization pipeline that can achieve accurate relocalization without relying on any pose prior. Unlike previous coarse-to-fine localization methods that require image retrieval first and then feature matching, we propose a novel sparse-to-dense localizati… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

    Comments: 15 pages, 12 figures, CVPR 2025

  24. arXiv:2503.16419  [pdf, other

    cs.CL

    Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

    Authors: Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Hanjie Chen, Xia Hu

    Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks. Recent advancements in Large Reasoning Models (LRMs), such as OpenAI o1 and DeepSeek-R1, have further improved performance in System-2 reasoning domains like mathematics and programming by harnessing supervised fine-tuning (SFT) and reinforcement learning (RL) techniques to enhance the Chain-of-Thought (CoT) r… ▽ More

    Submitted 23 April, 2025; v1 submitted 20 March, 2025; originally announced March 2025.

    Comments: Project Website: https://github.com/Eclipsess/Awesome-Efficient-Reasoning-LLMs

  25. arXiv:2503.15569  [pdf, other

    cs.LG cs.HC

    RAG-based User Profiling for Precision Planning in Mixed-precision Over-the-Air Federated Learning

    Authors: Jinsheng Yuan, Yun Tang, Weisi Guo

    Abstract: Mixed-precision computing, a widely applied technique in AI, offers a larger trade-off space between accuracy and efficiency. The recent purposed Mixed-Precision Over-the-Air Federated Learning (MP-OTA-FL) enables clients to operate at appropriate precision levels based on their heterogeneous hardware, taking advantages of the larger trade-off space while covering the quantization overheads in the… ▽ More

    Submitted 19 March, 2025; originally announced March 2025.

    Comments: 5 pages, 4 figures, 2 tables, submitted to IEEE VTC 2025 fall for possible publication

  26. arXiv:2503.13917  [pdf, other

    cs.LG

    Robust Machine Unlearning for Quantized Neural Networks via Adaptive Gradient Reweighting with Similar Labels

    Authors: Yujia Tong, Yuze Wang, Jingling Yuan, Chuang Hu

    Abstract: Model quantization enables efficient deployment of deep neural networks on edge devices through low-bit parameter representation, yet raises critical challenges for implementing machine unlearning (MU) under data privacy regulations. Existing MU methods designed for full-precision models fail to address two fundamental limitations in quantized networks: 1) Noise amplification from label mismatch d… ▽ More

    Submitted 18 March, 2025; originally announced March 2025.

    Comments: 15 pages, 4 figures

  27. arXiv:2503.12910  [pdf, other

    cs.CV

    MFP-CLIP: Exploring the Efficacy of Multi-Form Prompts for Zero-Shot Industrial Anomaly Detection

    Authors: Jingyi Yuan, Pengyu Jie, Junyin Zhang, Ziao Li, Chenqiang Gao

    Abstract: Recently, zero-shot anomaly detection (ZSAD) has emerged as a pivotal paradigm for identifying defects in unseen categories without requiring target samples in training phase. However, existing ZSAD methods struggle with the boundary of small and complex defects due to insufficient representations. Most of them use the single manually designed prompts, failing to work for diverse objects and anoma… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

  28. arXiv:2503.11251  [pdf, other

    cs.CV cs.CL

    Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model

    Authors: Haoyang Huang, Guoqing Ma, Nan Duan, Xing Chen, Changyi Wan, Ranchen Ming, Tianyu Wang, Bo Wang, Zhiying Lu, Aojie Li, Xianfang Zeng, Xinhao Zhang, Gang Yu, Yuhe Yin, Qiling Wu, Wen Sun, Kang An, Xin Han, Deshan Sun, Wei Ji, Bizhu Huang, Brian Li, Chenfei Wu, Guanzhe Huang, Huixin Xiong , et al. (29 additional authors not shown)

    Abstract: We present Step-Video-TI2V, a state-of-the-art text-driven image-to-video generation model with 30B parameters, capable of generating videos up to 102 frames based on both text and image inputs. We build Step-Video-TI2V-Eval as a new benchmark for the text-driven image-to-video task and compare Step-Video-TI2V with open-source and commercial TI2V engines using this dataset. Experimental results de… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

    Comments: 7 pages

  29. arXiv:2503.10412   

    cs.LG cs.AI

    dFLMoE: Decentralized Federated Learning via Mixture of Experts for Medical Data Analysis

    Authors: Luyuan Xie, Tianyu Luan, Wenyuan Cai, Guochen Yan, Zhaoyu Chen, Nan Xi, Yuejian Fang, Qingni Shen, Zhonghai Wu, Junsong Yuan

    Abstract: Federated learning has wide applications in the medical field. It enables knowledge sharing among different healthcare institutes while protecting patients' privacy. However, existing federated learning systems are typically centralized, requiring clients to upload client-specific knowledge to a central server for aggregation. This centralized approach would integrate the knowledge from each clien… ▽ More

    Submitted 19 March, 2025; v1 submitted 13 March, 2025; originally announced March 2025.

    Comments: One of the authors, Wenyuan Cai, currently requests not to make the paper public. Before we officially release the paper, we request to withdraw the submission

    Journal ref: Accapted by CVPR 2025

  30. arXiv:2503.09394  [pdf, other

    cs.CV

    Bidirectional Prototype-Reward co-Evolution for Test-Time Adaptation of Vision-Language Models

    Authors: Xiaozhen Qiao, Peng Huang, Jiakang Yuan, Xianda Guo, Bowen Ye, Zhe Sun, Xuelong Li

    Abstract: Test-time adaptation (TTA) is crucial in maintaining Vision-Language Models (VLMs) performance when facing real-world distribution shifts, particularly when the source data or target labels are inaccessible. Existing TTA methods rely on CLIP's output probability distribution for feature evaluation, which can introduce biases under domain shifts. This misalignment may cause features to be misclassi… ▽ More

    Submitted 12 March, 2025; originally announced March 2025.

  31. Aligning Instance-Semantic Sparse Representation towards Unsupervised Object Segmentation and Shape Abstraction with Repeatable Primitives

    Authors: Jiaxin Li, Hongxing Wang, Jiawei Tan, Zhilong Ou, Junsong Yuan

    Abstract: Understanding 3D object shapes necessitates shape representation by object parts abstracted from results of instance and semantic segmentation. Promising shape representations enable computers to interpret a shape with meaningful parts and identify their repeatability. However, supervised shape representations depend on costly annotation efforts, while current unsupervised methods work under stron… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

    Comments: 15 pages, 15 figures, 8 tables

  32. arXiv:2503.06056  [pdf, other

    cs.CV

    Pathological Prior-Guided Multiple Instance Learning For Mitigating Catastrophic Forgetting in Breast Cancer Whole Slide Image Classification

    Authors: Weixi Zheng, Aoling Huang, Jingping Yuan, Haoyu Zhao, Zhou Zhao, Yongchao Xu, Thierry Géraud

    Abstract: In histopathology, intelligent diagnosis of Whole Slide Images (WSIs) is essential for automating and objectifying diagnoses, reducing the workload of pathologists. However, diagnostic models often face the challenge of forgetting previously learned data during incremental training on datasets from different sources. To address this issue, we propose a new framework PaGMIL to mitigate catastrophic… ▽ More

    Submitted 25 March, 2025; v1 submitted 7 March, 2025; originally announced March 2025.

    Comments: ICASSP2025(Oral)

  33. arXiv:2503.04814  [pdf, other

    cs.CL cs.AI

    Normalization through Fine-tuning: Understanding Wav2vec 2.0 Embeddings for Phonetic Analysis

    Authors: Yiming Wang, Yi Yang, Jiahong Yuan

    Abstract: Phonetic normalization plays a crucial role in speech recognition and analysis, ensuring the comparability of features derived from raw audio data. However, in the current paradigm of fine-tuning pre-trained large transformer models, phonetic normalization is not deemed a necessary step; instead, it is implicitly executed within the models. This study investigates the normalization process within… ▽ More

    Submitted 4 March, 2025; originally announced March 2025.

  34. arXiv:2503.04773  [pdf, other

    cs.CL cs.CY cs.SI

    Invisible Walls in Cities: Leveraging Large Language Models to Predict Urban Segregation Experience with Social Media Content

    Authors: Bingbing Fan, Lin Chen, Songwei Li, Jian Yuan, Fengli Xu, Pan Hui, Yong Li

    Abstract: Understanding experienced segregation in urban daily life is crucial for addressing societal inequalities and fostering inclusivity. The abundance of user-generated reviews on social media encapsulates nuanced perceptions and feelings associated with different places, offering rich insights into segregation. However, leveraging this data poses significant challenges due to its vast volume, ambigui… ▽ More

    Submitted 10 March, 2025; v1 submitted 17 February, 2025; originally announced March 2025.

    Comments: 11 pages, 6 figures

  35. arXiv:2503.04629  [pdf, other

    cs.CL

    SurveyForge: On the Outline Heuristics, Memory-Driven Generation, and Multi-dimensional Evaluation for Automated Survey Writing

    Authors: Xiangchao Yan, Shiyang Feng, Jiakang Yuan, Renqiu Xia, Bin Wang, Bo Zhang, Lei Bai

    Abstract: Survey paper plays a crucial role in scientific research, especially given the rapid growth of research publications. Recently, researchers have begun using LLMs to automate survey generation for better efficiency. However, the quality gap between LLM-generated surveys and those written by human remains significant, particularly in terms of outline quality and citation accuracy. To close these gap… ▽ More

    Submitted 6 March, 2025; originally announced March 2025.

    Comments: Code and dataset are available for downloading at: https://github.com/Alpha-Innovator/SurveyForge 22 pages, 10 figures

  36. arXiv:2503.02221  [pdf, other

    cs.AI

    Attention Bootstrapping for Multi-Modal Test-Time Adaptation

    Authors: Yusheng Zhao, Junyu Luo, Xiao Luo, Jinsheng Huang, Jingyang Yuan, Zhiping Xiao, Ming Zhang

    Abstract: Test-time adaptation aims to adapt a well-trained model to potential distribution shifts at test time using only unlabeled test data, without access to the original training data. While previous efforts mainly focus on a single modality, test-time distribution shift in the multi-modal setting is more complex and calls for new solutions. This paper tackles the problem of multi-modal test-time adapt… ▽ More

    Submitted 3 March, 2025; originally announced March 2025.

  37. arXiv:2503.01330  [pdf, other

    cs.CL

    WeightedKV: Attention Scores Weighted Key-Value Cache Merging for Large Language Models

    Authors: Jian Yuan, Ziwei He, Haoli Bai, Jingwen Leng, Bo Jiang

    Abstract: Large Language Models (LLMs) use key-value (KV) cache to reduce redundant computation in autoregressive generation. However, the KV cache size increases linearly during generation, leading to excessive memory usage, especially for long texts. Most KV cache compression methods evict the unimportant KV pairs to maintain a fixed cache size, which leads to the permanent loss of tokens during generatio… ▽ More

    Submitted 3 March, 2025; originally announced March 2025.

    Comments: Accepted by ICASSP 2025

  38. arXiv:2503.01202  [pdf, other

    cs.CV cs.RO eess.IV

    A Multi-Sensor Fusion Approach for Rapid Orthoimage Generation in Large-Scale UAV Mapping

    Authors: Jialei He, Zhihao Zhan, Zhituo Tu, Xiang Zhu, Jie Yuan

    Abstract: Rapid generation of large-scale orthoimages from Unmanned Aerial Vehicles (UAVs) has been a long-standing focus of research in the field of aerial mapping. A multi-sensor UAV system, integrating the Global Positioning System (GPS), Inertial Measurement Unit (IMU), 4D millimeter-wave radar and camera, can provide an effective solution to this problem. In this paper, we utilize multi-sensor data to… ▽ More

    Submitted 4 March, 2025; v1 submitted 3 March, 2025; originally announced March 2025.

  39. FSMP: A Frontier-Sampling-Mixed Planner for Fast Autonomous Exploration of Complex and Large 3-D Environments

    Authors: Shiyong Zhang, Xuebo Zhang, Qianli Dong, Ziyu Wang, Haobo Xi, Jing Yuan

    Abstract: In this paper, we propose a systematic framework for fast exploration of complex and large 3-D environments using micro aerial vehicles (MAVs). The key insight is the organic integration of the frontier-based and sampling-based strategies that can achieve rapid global exploration of the environment. Specifically, a field-of-view-based (FOV) frontier detector with the guarantee of completeness and… ▽ More

    Submitted 27 February, 2025; originally announced February 2025.

    Comments: 13pages, 12 figures, accepted by IEEE Transactions on Instrumentation and Measurement

  40. arXiv:2502.20307  [pdf, other

    cs.CV

    Mobius: Text to Seamless Looping Video Generation via Latent Shift

    Authors: Xiuli Bi, Jianfei Yuan, Bo Liu, Yong Zhang, Xiaodong Cun, Chi-Man Pun, Bin Xiao

    Abstract: We present Mobius, a novel method to generate seamlessly looping videos from text descriptions directly without any user annotations, thereby creating new visual materials for the multi-media presentation. Our method repurposes the pre-trained video latent diffusion model for generating looping videos from text prompts without any training. During inference, we first construct a latent cycle by co… ▽ More

    Submitted 27 February, 2025; originally announced February 2025.

    Comments: Project page: https://mobius-diffusion.github.io/ ; GitHub repository: https://github.com/YisuiTT/Mobius

  41. arXiv:2502.15961  [pdf, other

    cs.RO

    IA-TIGRIS: An Incremental and Adaptive Sampling-Based Planner for Online Informative Path Planning

    Authors: Brady Moon, Nayana Suvarna, Andrew Jong, Satrajit Chatterjee, Junbin Yuan, Sebastian Scherer

    Abstract: Planning paths that maximize information gain for robotic platforms has wide-ranging applications and significant potential impact. To effectively adapt to real-time data collection, informative path planning must be computed online and be responsive to new observations. In this work, we present IA-TIGRIS, an incremental and adaptive sampling-based informative path planner that can be run efficien… ▽ More

    Submitted 21 February, 2025; originally announced February 2025.

    Comments: 16 pages, 19 figures

  42. arXiv:2502.15576  [pdf, other

    cs.CL

    Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders

    Authors: Xuansheng Wu, Jiayi Yuan, Wenlin Yao, Xiaoming Zhai, Ninghao Liu

    Abstract: Large language models (LLMs) excel at handling human queries, but they can occasionally generate flawed or unexpected responses. Understanding their internal states is crucial for understanding their successes, diagnosing their failures, and refining their capabilities. Although sparse autoencoders (SAEs) have shown promise for interpreting LLM internal representations, limited research has explor… ▽ More

    Submitted 21 February, 2025; originally announced February 2025.

    Comments: Pre-print. 20 pages, 5 figures

  43. arXiv:2502.12531  [pdf, other

    cs.RO cs.AI

    GSCE: A Prompt Framework with Enhanced Reasoning for Reliable LLM-driven Drone Control

    Authors: Wenhao Wang, Yanyan Li, Long Jiao, Jiawei Yuan

    Abstract: The integration of Large Language Models (LLMs) into robotic control, including drones, has the potential to revolutionize autonomous systems. Research studies have demonstrated that LLMs can be leveraged to support robotic operations. However, when facing tasks with complex reasoning, concerns and challenges are raised about the reliability of solutions produced by LLMs. In this paper, we propose… ▽ More

    Submitted 7 April, 2025; v1 submitted 17 February, 2025; originally announced February 2025.

    Comments: 8 pages

  44. arXiv:2502.11718  [pdf, other

    cs.CL cs.CV

    ChineseSimpleVQA -- "See the World, Discover Knowledge": A Chinese Factuality Evaluation for Large Vision Language Models

    Authors: Jihao Gu, Yingyao Wang, Pi Bu, Chen Wang, Ziming Wang, Tengtao Song, Donglai Wei, Jiale Yuan, Yingxiu Zhao, Yancheng He, Shilong Li, Jiaheng Liu, Meng Cao, Jun Song, Yingshui Tan, Xiang Li, Wenbo Su, Zhicheng Zheng, Xiaoyong Zhu, Bo Zheng

    Abstract: The evaluation of factual accuracy in large vision language models (LVLMs) has lagged behind their rapid development, making it challenging to fully reflect these models' knowledge capacity and reliability. In this paper, we introduce the first factuality-based visual question-answering benchmark in Chinese, named ChineseSimpleVQA, aimed at assessing the visual factuality of LVLMs across 8 major t… ▽ More

    Submitted 26 February, 2025; v1 submitted 17 February, 2025; originally announced February 2025.

    Comments: 24 pages, 21 figures

  45. arXiv:2502.11573  [pdf, other

    cs.CL cs.AI

    InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning

    Authors: Congkai Xie, Shuo Cai, Wenjun Wang, Pengxiang Li, Zhijie Sang, Kejing Yang, Yiming Zhang, Zhen Li, Guanghao Zhu, Zeyu Liu, Yang Yu, Yuhang Liu, Su Lu, Baoyi He, Qi Zhou, Xiaotian Han, Jianbo Yuan, Shengyu Zhang, Fei Wu, Hongxia Yang

    Abstract: Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have made significant advancements in reasoning capabilities. However, they still face challenges such as high computational demands and privacy concerns. This paper focuses on developing efficient Small Language Models (SLMs) and Multimodal Small Language Models (MSLMs) that retain competitive reasoning abilities. We introd… ▽ More

    Submitted 17 February, 2025; originally announced February 2025.

  46. arXiv:2502.11089  [pdf, other

    cs.CL cs.AI cs.LG

    Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

    Authors: Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng

    Abstract: Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while maintaining model capabilities. We present NSA, a Natively trainable Sparse Attention mechanism that integrates algorithmic innovations with har… ▽ More

    Submitted 27 February, 2025; v1 submitted 16 February, 2025; originally announced February 2025.

  47. arXiv:2502.10677  [pdf, other

    cs.CV

    FocalCount: Towards Class-Count Imbalance in Class-Agnostic Counting

    Authors: Huilin Zhu, Jingling Yuan, Zhengwei Yang, Yu Guo, Xian Zhong, Shengfeng He

    Abstract: In class-agnostic object counting, the goal is to estimate the total number of object instances in an image without distinguishing between specific categories. Existing methods often predict this count without considering class-specific outputs, leading to inaccuracies when such outputs are required. These inaccuracies stem from two key challenges: 1) the prevalence of single-category images in da… ▽ More

    Submitted 15 February, 2025; originally announced February 2025.

  48. arXiv:2502.10042  [pdf, other

    cs.IT

    Scaling Law Tradeoff Between Throughput and Sensing Distance in Large ISAC Networks

    Authors: Min Qiu, Ming-Chun Lee, Yu-Chih Huang, Jinhong Yuan

    Abstract: In this paper, we investigate the fundamental tradeoff between communication and sensing performance of \emph{ad hoc} integrated sensing and communication (ISAC) wireless networks. Specifically, we consider that $n$ nodes are randomly located in an extended network with area $n$ and transmit ISAC signals. Under the pure path loss channel gain model and the condition that the transmission power sca… ▽ More

    Submitted 14 February, 2025; originally announced February 2025.

  49. arXiv:2502.09670  [pdf, other

    cs.CL cs.AI

    The Science of Evaluating Foundation Models

    Authors: Jiayi Yuan, Jiamu Zhang, Andrew Wen, Xia Hu

    Abstract: The emergent phenomena of large foundation models have revolutionized natural language processing. However, evaluating these models presents significant challenges due to their size, capabilities, and deployment across diverse applications. Existing literature often focuses on individual aspects, such as benchmark performance or specific tasks, but fails to provide a cohesive process that integrat… ▽ More

    Submitted 12 February, 2025; originally announced February 2025.

  50. A Contextual-Aware Position Encoding for Sequential Recommendation

    Authors: Jun Yuan, Guohao Cai, Zhenhua Dong

    Abstract: Sequential recommendation (SR), which encodes user activity to predict the next action, has emerged as a widely adopted strategy in developing commercial personalized recommendation systems. A critical component of modern SR models is the attention mechanism, which synthesizes users' historical activities. This mechanism is typically order-invariant and generally relies on position encoding (PE).… ▽ More

    Submitted 21 February, 2025; v1 submitted 13 February, 2025; originally announced February 2025.

    Comments: Accepted by WWW'25 Industry Track