Skip to main content

Showing 1–50 of 2,519 results for author: Yifan

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.21360  [pdf, ps, other

    cs.CL

    Structuralist Approach to AI Literary Criticism: Leveraging Greimas Semiotic Square for Large Language Models

    Authors: Fangzhou Dong, Yifan Zeng, Yingpeng Sang, Hong Shen

    Abstract: Large Language Models (LLMs) excel in understanding and generating text but struggle with providing professional literary criticism for works with profound thoughts and complex narratives. This paper proposes GLASS (Greimas Literary Analysis via Semiotic Square), a structured analytical framework based on Greimas Semiotic Square (GSS), to enhance LLMs' ability to conduct in-depth literary analysis… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

    Comments: Accepted in CogSci 2025

  2. arXiv:2506.20990  [pdf, ps, other

    cs.LG cs.CL cs.CV

    SharpZO: Hybrid Sharpness-Aware Vision Language Model Prompt Tuning via Forward-Only Passes

    Authors: Yifan Yang, Zhen Zhang, Rupak Vignesh Swaminathan, Jing Liu, Nathan Susanj, Zheng Zhang

    Abstract: Fine-tuning vision language models (VLMs) has achieved remarkable performance across various downstream tasks; yet, it requires access to model gradients through backpropagation (BP), making them unsuitable for memory-constrained, inference-only edge devices. To address this limitation, previous work has explored various BP-free fine-tuning methods. However, these approaches often rely on high-var… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

  3. arXiv:2506.20741  [pdf, ps, other

    cs.CV

    OTSurv: A Novel Multiple Instance Learning Framework for Survival Prediction with Heterogeneity-aware Optimal Transport

    Authors: Qin Ren, Yifan Wang, Ruogu Fang, Haibin Ling, Chenyu You

    Abstract: Survival prediction using whole slide images (WSIs) can be formulated as a multiple instance learning (MIL) problem. However, existing MIL methods often fail to explicitly capture pathological heterogeneity within WSIs, both globally -- through long-tailed morphological distributions, and locally through -- tile-level prediction uncertainty. Optimal transport (OT) provides a principled way of mode… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

  4. arXiv:2506.20702  [pdf

    cs.AI cs.CY

    The Singapore Consensus on Global AI Safety Research Priorities

    Authors: Yoshua Bengio, Tegan Maharaj, Luke Ong, Stuart Russell, Dawn Song, Max Tegmark, Lan Xue, Ya-Qin Zhang, Stephen Casper, Wan Sie Lee, Sören Mindermann, Vanessa Wilfred, Vidhisha Balachandran, Fazl Barez, Michael Belinsky, Imane Bello, Malo Bourgon, Mark Brakel, Siméon Campos, Duncan Cass-Beggs, Jiahao Chen, Rumman Chowdhury, Kuan Chua Seah, Jeff Clune, Juntao Dai , et al. (61 additional authors not shown)

    Abstract: Rapidly improving AI capabilities and autonomy hold significant promise of transformation, but are also driving vigorous debate on how to ensure that AI is safe, i.e., trustworthy, reliable, and secure. Building a trusted ecosystem is therefore essential -- it helps people embrace AI with confidence and gives maximal space for innovation while avoiding backlash. The "2025 Singapore Conference on… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

    Comments: Final report from the "2025 Singapore Conference on AI (SCAI)" held April 26: https://www.scai.gov.sg/2025/scai2025-report

  5. arXiv:2506.20168  [pdf, ps, other

    cs.CV

    Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models

    Authors: Zhentao He, Can Zhang, Ziheng Wu, Zhenghao Chen, Yufei Zhan, Yifan Li, Zhao Zhang, Xian Wang, Minghui Qiu

    Abstract: Recent advancements in multimodal large language models have enhanced document understanding by integrating textual and visual information. However, existing models exhibit incompleteness within their paradigm in real-world scenarios, particularly under visual degradation. In such conditions, the current response paradigm often fails to adequately perceive visual degradation and ambiguity, leading… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

  6. arXiv:2506.20123  [pdf, ps, other

    cs.CE

    DiT-SGCR: Directed Temporal Structural Representation with Global-Cluster Awareness for Ethereum Malicious Account Detection

    Authors: Ye Tian, Liangliang Song, Peng Qian, Yanbin Wang, Jianguo Sun, Yifan Jia

    Abstract: The detection of malicious accounts on Ethereum - the preeminent DeFi platform - is critical for protecting digital assets and maintaining trust in decentralized finance. Recent advances highlight that temporal transaction evolution reveals more attack signatures than static graphs. However, current methods either fail to model continuous transaction dynamics or incur high computational costs that… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

  7. arXiv:2506.19468  [pdf, ps, other

    cs.CL cs.AI

    MuBench: Assessment of Multilingual Capabilities of Large Language Models Across 61 Languages

    Authors: Wenhan Han, Yifan Zhang, Zhixun Chen, Binbin Liu, Haobin Lin, Bingni Zhang, Taifeng Wang, Mykola Pechenizkiy, Meng Fang, Yin Zheng

    Abstract: Multilingual large language models (LLMs) are advancing rapidly, with new models frequently claiming support for an increasing number of languages. However, existing evaluation datasets are limited and lack cross-lingual alignment, leaving assessments of multilingual capabilities fragmented in both language and skill coverage. To address this, we introduce MuBench, a benchmark covering 61 language… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  8. arXiv:2506.19356  [pdf, ps, other

    cs.CR cs.LG

    WebGuard++:Interpretable Malicious URL Detection via Bidirectional Fusion of HTML Subgraphs and Multi-Scale Convolutional BERT

    Authors: Ye Tian, Zhang Yumin, Yifan Jia, Jianguo Sun, Yanbin Wang

    Abstract: URL+HTML feature fusion shows promise for robust malicious URL detection, since attacker artifacts persist in DOM structures. However, prior work suffers from four critical shortcomings: (1) incomplete URL modeling, failing to jointly capture lexical patterns and semantic context; (2) HTML graph sparsity, where threat-indicative nodes (e.g., obfuscated scripts) are isolated amid benign content, ca… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  9. arXiv:2506.18923  [pdf, ps, other

    cs.PL cs.CL cs.SE

    Mix-of-Language-Experts Architecture for Multilingual Programming

    Authors: Yifan Zong, Yuntian Deng, Pengyu Nie

    Abstract: Large language models (LLMs) have demonstrated impressive capabilities in aiding developers with tasks like code comprehension, generation, and translation. Supporting multilingual programming -- i.e., coding tasks across multiple programming languages -- typically requires either (1) finetuning a single LLM across all programming languages, which is cost-efficient but sacrifices language-specific… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

    Comments: Accepted at LLM4Code @ ICSE 2025

  10. arXiv:2506.18767  [pdf, ps, other

    cs.CR

    Physical Layer Challenge-Response Authentication between Ambient Backscatter Devices

    Authors: Yifan Zhang, Yongchao Dang, Masoud Kaveh, Zheng Yan, Riku Jäntti, Zhu Han

    Abstract: Ambient backscatter communication (AmBC) has become an integral part of ubiquitous Internet of Things (IoT) applications due to its energy-harvesting capabilities and ultra-low-power consumption. However, the open wireless environment exposes AmBC systems to various attacks, and existing authentication methods cannot be implemented between resource-constrained backscatter devices (BDs) due to thei… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

  11. arXiv:2506.18701  [pdf, ps, other

    cs.CV cs.AI

    Matrix-Game: Interactive World Foundation Model

    Authors: Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, Yahui Zhou

    Abstract: We introduce Matrix-Game, an interactive world foundation model for controllable game world generation. Matrix-Game is trained using a two-stage pipeline that first performs large-scale unlabeled pretraining for environment understanding, followed by action-labeled training for interactive video generation. To support this, we curate Matrix-Game-MC, a comprehensive Minecraft dataset comprising ove… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

    Comments: Technical Report

  12. arXiv:2506.18637  [pdf, ps, other

    cs.LG cs.AI

    Granular-Ball-Induced Multiple Kernel K-Means

    Authors: Shuyin Xia, Yifan Wang, Lifeng Shen, Guoyin Wang

    Abstract: Most existing multi-kernel clustering algorithms, such as multi-kernel K-means, often struggle with computational efficiency and robustness when faced with complex data distributions. These challenges stem from their dependence on point-to-point relationships for optimization, which can lead to difficulty in accurately capturing data sets' inherent structure and diversity. Additionally, the intric… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

    Comments: Accepted by IJCAI 2025

  13. arXiv:2506.17637  [pdf, ps, other

    cs.CL cs.LG

    Step-Opt: Boosting Optimization Modeling in LLMs through Iterative Data Synthesis and Structured Validation

    Authors: Yang Wu, Yifan Zhang, Yurong Wu, Yuran Wang, Junkai Zhang, Jian Cheng

    Abstract: Large Language Models (LLMs) have revolutionized various domains but encounter substantial challenges in tackling optimization modeling tasks for Operations Research (OR), particularly when dealing with complex problem. In this work, we propose Step-Opt-Instruct, a framework that augments existing datasets and generates high-quality fine-tuning data tailored to optimization modeling. Step-Opt-Inst… ▽ More

    Submitted 21 June, 2025; originally announced June 2025.

    Comments: 17 pages, 12 figures

  14. arXiv:2506.17630  [pdf, ps, other

    cs.CL

    Answer-Centric or Reasoning-Driven? Uncovering the Latent Memory Anchor in LLMs

    Authors: Yang Wu, Yifan Zhang, Yiwei Wang, Yujun Cai, Yurong Wu, Yuran Wang, Ning Xu, Jian Cheng

    Abstract: While Large Language Models (LLMs) demonstrate impressive reasoning capabilities, growing evidence suggests much of their success stems from memorized answer-reasoning patterns rather than genuine inference. In this work, we investigate a central question: are LLMs primarily anchored to final answers or to the textual pattern of reasoning chains? We propose a five-level answer-visibility prompt fr… ▽ More

    Submitted 21 June, 2025; originally announced June 2025.

    Comments: 14 pages, 8 figures

  15. arXiv:2506.17611  [pdf, ps, other

    cs.CL cs.SD eess.AS

    OpusLM: A Family of Open Unified Speech Language Models

    Authors: Jinchuan Tian, William Chen, Yifan Peng, Jiatong Shi, Siddhant Arora, Shikhar Bharadwaj, Takashi Maekaku, Yusuke Shinohara, Keita Goto, Xiang Yue, Huck Yang, Shinji Watanabe

    Abstract: This paper presents Open Unified Speech Language Models (OpusLMs), a family of open foundational speech language models (SpeechLMs) up to 7B. Initialized from decoder-only text language models, the OpusLMs are continuously pre-trained on 213K hours of speech-text pairs and 292B text-only tokens. We demonstrate our OpusLMs achieve comparable (or even superior) performance with existing SpeechLMs in… ▽ More

    Submitted 21 June, 2025; originally announced June 2025.

  16. arXiv:2506.17582  [pdf, ps, other

    cs.LG physics.comp-ph

    LFR-PINO: A Layered Fourier Reduced Physics-Informed Neural Operator for Parametric PDEs

    Authors: Jing Wang, Biao Chen, Hairun Xie, Rui Wang, Yifan Xia, Jifa Zhang, Hui Xu

    Abstract: Physics-informed neural operators have emerged as a powerful paradigm for solving parametric partial differential equations (PDEs), particularly in the aerospace field, enabling the learning of solution operators that generalize across parameter spaces. However, existing methods either suffer from limited expressiveness due to fixed basis/coefficient designs, or face computational challenges due t… ▽ More

    Submitted 21 June, 2025; originally announced June 2025.

    Comments: 28 pages, 17 figures

  17. arXiv:2506.17576  [pdf, ps, other

    cs.LG

    Towards Deeper GCNs: Alleviating Over-smoothing via Iterative Training and Fine-tuning

    Authors: Furong Peng, Jinzhen Gao, Xuan Lu, Kang Liu, Yifan Huo, Sheng Wang

    Abstract: Graph Convolutional Networks (GCNs) suffer from severe performance degradation in deep architectures due to over-smoothing. While existing studies primarily attribute the over-smoothing to repeated applications of graph Laplacian operators, our empirical analysis reveals a critical yet overlooked factor: trainable linear transformations in GCNs significantly exacerbate feature collapse, even at mo… ▽ More

    Submitted 21 June, 2025; originally announced June 2025.

    Comments: 16 pages,18 figures

  18. arXiv:2506.17457  [pdf, ps, other

    cs.CV

    When Every Millisecond Counts: Real-Time Anomaly Detection via the Multimodal Asynchronous Hybrid Network

    Authors: Dong Xiao, Guangyao Chen, Peixi Peng, Yangru Huang, Yifan Zhao, Yongxing Dai, Yonghong Tian

    Abstract: Anomaly detection is essential for the safety and reliability of autonomous driving systems. Current methods often focus on detection accuracy but neglect response time, which is critical in time-sensitive driving scenarios. In this paper, we introduce real-time anomaly detection for autonomous driving, prioritizing both minimal response time and high accuracy. We propose a novel multimodal asynch… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

    Comments: ICML 2025 Spotlight

  19. arXiv:2506.17367  [pdf, ps, other

    cs.CL cs.AI cs.MA

    Cash or Comfort? How LLMs Value Your Inconvenience

    Authors: Mateusz Cedro, Timour Ichmoukhamedov, Sofie Goethals, Yifan He, James Hinns, David Martens

    Abstract: Large Language Models (LLMs) are increasingly proposed as near-autonomous artificial intelligence (AI) agents capable of making everyday decisions on behalf of humans. Although LLMs perform well on many technical tasks, their behaviour in personal decision-making remains less understood. Previous studies have assessed their rationality and moral alignment with human decisions. However, the behavio… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

    Comments: 12 pages, 4 figures, 3 tables

  20. arXiv:2506.16731  [pdf, ps, other

    cs.AI cs.DC cs.LG

    Incentivizing High-quality Participation From Federated Learning Agents

    Authors: Jinlong Pang, Jiaheng Wei, Yifan Hua, Chen Qian, Yang Liu

    Abstract: Federated learning (FL) provides a promising paradigm for facilitating collaboration between multiple clients that jointly learn a global model without directly sharing their local data. However, existing research suffers from two caveats: 1) From the perspective of agents, voluntary and unselfish participation is often assumed. But self-interested agents may opt out of the system or provide low-q… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

  21. arXiv:2506.16685  [pdf, ps, other

    cs.RO cs.LG

    Compliant Residual DAgger: Improving Real-World Contact-Rich Manipulation with Human Corrections

    Authors: Xiaomeng Xu, Yifan Hou, Zeyi Liu, Shuran Song

    Abstract: We address key challenges in Dataset Aggregation (DAgger) for real-world contact-rich manipulation: how to collect informative human correction data and how to effectively update policies with this new data. We introduce Compliant Residual DAgger (CR-DAgger), which contains two novel components: 1) a Compliant Intervention Interface that leverages compliance control, allowing humans to provide gen… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

  22. arXiv:2506.15889  [pdf, ps, other

    cs.CL

    Entropy-Driven Pre-Tokenization for Byte-Pair Encoding

    Authors: Yifan Hu, Frank Liang, Dachuan Zhao, Jonathan Geuter, Varshini Reddy, Craig W. Schmidt, Chris Tanner

    Abstract: Byte-Pair Encoding (BPE) has become a widely adopted subword tokenization method in modern language models due to its simplicity and strong empirical performance across downstream tasks. However, applying BPE to unsegmented languages such as Chinese presents significant challenges, as its frequency-driven merge operation is agnostic to linguistic boundaries. To address this, we propose two entropy… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

  23. arXiv:2506.15666  [pdf, ps, other

    cs.RO

    Vision in Action: Learning Active Perception from Human Demonstrations

    Authors: Haoyu Xiong, Xiaomeng Xu, Jimmy Wu, Yifan Hou, Jeannette Bohg, Shuran Song

    Abstract: We present Vision in Action (ViA), an active perception system for bimanual robot manipulation. ViA learns task-relevant active perceptual strategies (e.g., searching, tracking, and focusing) directly from human demonstrations. On the hardware side, ViA employs a simple yet effective 6-DoF robotic neck to enable flexible, human-like head movements. To capture human active perception strategies, we… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

  24. arXiv:2506.15539  [pdf, ps, other

    cs.RO

    Aerial Grasping via Maximizing Delta-Arm Workspace Utilization

    Authors: Haoran Chen, Weiliang Deng, Biyu Ye, Yifan Xiong, Ximin Lyu

    Abstract: The workspace limits the operational capabilities and range of motion for the systems with robotic arms. Maximizing workspace utilization has the potential to provide more optimal solutions for aerial manipulation tasks, increasing the system's flexibility and operational efficiency. In this paper, we introduce a novel planning framework for aerial grasping that maximizes workspace utilization. We… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

    Comments: 8 pages, 7 figures

  25. arXiv:2506.14827  [pdf, ps, other

    cs.CV cs.AI

    DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning

    Authors: Yifeng Gao, Yifan Ding, Hongyu Su, Juncheng Li, Yunhan Zhao, Lin Luo, Zixing Chen, Li Wang, Xin Wang, Yixu Wang, Xingjun Ma, Yu-Gang Jiang

    Abstract: As AI-generated video becomes increasingly pervasive across media platforms, the ability to reliably distinguish synthetic content from authentic footage has become both urgent and essential. Existing approaches have primarily treated this challenge as a binary classification task, offering limited insight into where or why a model identifies a video as AI-generated. However, the core challenge ex… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

  26. MOL: Joint Estimation of Micro-Expression, Optical Flow, and Landmark via Transformer-Graph-Style Convolution

    Authors: Zhiwen Shao, Yifan Cheng, Feiran Li, Yong Zhou, Xuequan Lu, Yuan Xie, Lizhuang Ma

    Abstract: Facial micro-expression recognition (MER) is a challenging problem, due to transient and subtle micro-expression (ME) actions. Most existing methods depend on hand-crafted features, key frames like onset, apex, and offset frames, or deep networks limited by small-scale and low-diversity datasets. In this paper, we propose an end-to-end micro-action-aware deep learning framework with advantages fro… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

    Comments: This paper has been accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence

  27. arXiv:2506.14370  [pdf, ps, other

    cs.CL

    Digital Gatekeepers: Google's Role in Curating Hashtags and Subreddits

    Authors: Amrit Poudel, Yifan Ding, Jurgen Pfeffer, Tim Weninger

    Abstract: Search engines play a crucial role as digital gatekeepers, shaping the visibility of Web and social media content through algorithmic curation. This study investigates how search engines like Google selectively promotes or suppresses certain hashtags and subreddits, impacting the information users encounter. By comparing search engine results with nonsampled data from Reddit and Twitter/X, we reve… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

    Comments: Accepted to ACL 2025 Main

    Journal ref: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics 2025

  28. arXiv:2506.14204  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Improving Practical Aspects of End-to-End Multi-Talker Speech Recognition for Online and Offline Scenarios

    Authors: Aswin Shanmugam Subramanian, Amit Das, Naoyuki Kanda, Jinyu Li, Xiaofei Wang, Yifan Gong

    Abstract: We extend the frameworks of Serialized Output Training (SOT) to address practical needs of both streaming and offline automatic speech recognition (ASR) applications. Our approach focuses on balancing latency and accuracy, catering to real-time captioning and summarization requirements. We propose several key improvements: (1) Leveraging Continuous Speech Separation (CSS) single-channel front-end… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

    Comments: Accepted to Interspeech 2025

  29. arXiv:2506.14142  [pdf, ps, other

    cs.CV cs.CL

    RadFabric: Agentic AI System with Reasoning Capability for Radiology

    Authors: Wenting Chen, Yi Dong, Zhaojun Ding, Yucheng Shi, Yifan Zhou, Fang Zeng, Yijun Luo, Tianyu Lin, Yihang Su, Yichen Wu, Kai Zhang, Zhen Xiang, Tianming Liu, Ninghao Liu, Lichao Sun, Yixuan Yuan, Xiang Li

    Abstract: Chest X ray (CXR) imaging remains a critical diagnostic tool for thoracic conditions, but current automated systems face limitations in pathology coverage, diagnostic accuracy, and integration of visual and textual reasoning. To address these gaps, we propose RadFabric, a multi agent, multimodal reasoning framework that unifies visual and textual analysis for comprehensive CXR interpretation. RadF… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: 4 figures, 2 tables

  30. arXiv:2506.14035  [pdf, ps, other

    cs.CV cs.AI

    SimpleDoc: Multi-Modal Document Understanding with Dual-Cue Page Retrieval and Iterative Refinement

    Authors: Chelsi Jain, Yiran Wu, Yifan Zeng, Jiale Liu, S hengyu Dai, Zhenwen Shao, Qingyun Wu, Huazheng Wang

    Abstract: Document Visual Question Answering (DocVQA) is a practical yet challenging task, which is to ask questions based on documents while referring to multiple pages and different modalities of information, e.g, images and tables. To handle multi-modality, recent methods follow a similar Retrieval Augmented Generation (RAG) pipeline, but utilize Visual Language Models (VLMs) based embedding model to emb… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  31. arXiv:2506.13226  [pdf, ps, other

    cs.CE

    A modified Newmark/Newton-Raphson method with automatic differentiation for general nonlinear dynamics analysis

    Authors: Yifan Jiang, Yuhong Jin, Lei Hou, Yi Chen, Andong Cong

    Abstract: The Newmark/Newton-Raphson (NNR) method is widely employed for solving nonlinear dynamic systems. However, the current NNR method exhibits limited applicability in complex nonlinear dynamic systems, as the acquisition of the Jacobian matrix required for Newton iterations incurs substantial computational costs and may even prove intractable in certain cases. To address these limitations, we integra… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: 18 pages, 9 figures

  32. arXiv:2506.12570  [pdf, ps, other

    cs.SD cs.CL eess.AS

    StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling

    Authors: Hui Wang, Yifan Yang, Shujie Liu, Jinyu Li, Lingwei Meng, Yanqing Liu, Jiaming Zhou, Haoqin Sun, Yan Lu, Yong Qin

    Abstract: Recent advances in zero-shot text-to-speech (TTS) synthesis have achieved high-quality speech generation for unseen speakers, but most systems remain unsuitable for real-time applications because of their offline design. Current streaming TTS paradigms often rely on multi-stage pipelines and discrete representations, leading to increased computational cost and suboptimal system performance. In thi… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

  33. arXiv:2506.12409  [pdf, ps, other

    cs.CV

    Branch, or Layer? Zeroth-Order Optimization for Continual Learning of Vision-Language Models

    Authors: Ziwei Liu, Borui Kang, Wei Li, Hangjie Yuan, Yanbing Yang, Wenbin Li, Jun Luo, Yifan Zhu, Tao Feng

    Abstract: Continual learning in vision-language models (VLMs) faces critical challenges in balancing parameter efficiency, memory consumption, and optimization stability. While First-Order (FO) optimization (e.g., SGD) dominate current approaches, their deterministic gradients often trap models in suboptimal local minima and incur substantial memory overhead. This paper pioneers a systematic exploration of… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

  34. arXiv:2506.12012  [pdf, ps, other

    cs.AI

    Tracing LLM Reasoning Processes with Strategic Games: A Framework for Planning, Revision, and Resource-Constrained Decision Making

    Authors: Xiaopeng Yuan, Xingjian Zhang, Ke Xu, Yifan Xu, Lijun Yu, Jindong Wang, Yushun Dong, Haohan Wang

    Abstract: Large language models (LLMs) are increasingly used for tasks that require complex reasoning. Most benchmarks focus on final outcomes but overlook the intermediate reasoning steps - such as planning, revision, and decision making under resource constraints. We argue that measuring these internal processes is essential for understanding model behavior and improving reliability. We propose using stra… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

    Comments: 19 pages, 7 figures. Under review

  35. arXiv:2506.11882  [pdf, ps, other

    cs.LG cs.AI

    An Explainable AI Framework for Dynamic Resource Management in Vehicular Network Slicing

    Authors: Haochen Sun, Yifan Liu, Ahmed Al-Tahmeesschi, Swarna Chetty, Syed Ali Raza Zaidi, Avishek Nag, Hamed Ahmadi

    Abstract: Effective resource management and network slicing are essential to meet the diverse service demands of vehicular networks, including Enhanced Mobile Broadband (eMBB) and Ultra-Reliable and Low-Latency Communications (URLLC). This paper introduces an Explainable Deep Reinforcement Learning (XRL) framework for dynamic network slicing and resource allocation in vehicular networks, built upon a near-r… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

    Comments: To appear in Proceedings of IEEE PIMRC 2025. 6 pages, 4 figures

  36. arXiv:2506.11329  [pdf, ps, other

    cs.AR

    A4: Microarchitecture-Aware LLC Management for Datacenter Servers with Emerging I/O Devices

    Authors: Haneul Park, Jiaqi Lou, Sangjin Lee, Yifan Yuan, Kyoung Soo Park, Yongseok Son, Ipoom Jeong, Nam Sung Kim

    Abstract: In modern server CPUs, the Last-Level Cache (LLC) serves not only as a victim cache for higher-level private caches but also as a buffer for low-latency DMA transfers between CPU cores and I/O devices through Direct Cache Access (DCA). However, prior work has shown that high-bandwidth network-I/O devices can rapidly flood the LLC with packets, often causing significant contention with co-running w… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

  37. arXiv:2506.11037  [pdf, ps, other

    cs.LG

    Mini-Game Lifetime Value Prediction in WeChat

    Authors: Aochuan Chen, Yifan Niu, Ziqi Gao, Yujie Sun, Shoujun Liu, Gong Chen, Yang Liu, Jia Li

    Abstract: The LifeTime Value (LTV) prediction, which endeavors to forecast the cumulative purchase contribution of a user to a particular item, remains a vital challenge that advertisers are keen to resolve. A precise LTV prediction system enhances the alignment of user interests with meticulously designed advertisements, thereby generating substantial profits for advertisers. Nonetheless, this issue is com… ▽ More

    Submitted 17 June, 2025; v1 submitted 20 May, 2025; originally announced June 2025.

    Comments: KDD ADS Track 2025

  38. arXiv:2506.10998  [pdf, other

    cs.SE cs.AI

    Towards Automated Formal Verification of Backend Systems with LLMs

    Authors: Kangping Xu, Yifan Luo, Yang Yuan, Andrew Chi-Chih Yao

    Abstract: Software testing plays a critical role in ensuring that systems behave as intended. However, existing automated testing approaches struggle to match the capabilities of human engineers due to key limitations such as test locality, lack of general reliability, and business logic blindness. In this work, we propose a novel framework that leverages functional programming and type systems to translate… ▽ More

    Submitted 13 April, 2025; originally announced June 2025.

  39. arXiv:2506.10822  [pdf, ps, other

    cs.CL

    ReCUT: Balancing Reasoning Length and Accuracy in LLMs via Stepwise Trails and Preference Optimization

    Authors: Zhensheng Jin, Xinze Li, Yifan Ji, Chunyi Peng, Zhenghao Liu, Qi Shi, Yukun Yan, Shuo Wang, Furong Peng, Ge Yu

    Abstract: Recent advances in Chain-of-Thought (CoT) prompting have substantially improved the reasoning capabilities of Large Language Models (LLMs). However, these methods often suffer from overthinking, leading to unnecessarily lengthy or redundant reasoning traces. Existing approaches attempt to mitigate this issue through curating multiple reasoning chains for training LLMs, but their effectiveness is o… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

  40. arXiv:2506.10712  [pdf, ps, other

    cs.CV

    Uncertainty-Masked Bernoulli Diffusion for Camouflaged Object Detection Refinement

    Authors: Yuqi Shen, Fengyang Xiao, Sujie Hu, Youwei Pang, Yifan Pu, Chengyu Fang, Xiu Li, Chunming He

    Abstract: Camouflaged Object Detection (COD) presents inherent challenges due to the subtle visual differences between targets and their backgrounds. While existing methods have made notable progress, there remains significant potential for post-processing refinement that has yet to be fully explored. To address this limitation, we propose the Uncertainty-Masked Bernoulli Diffusion (UMBD) model, the first g… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

    Comments: 16 pages, 7 figures

  41. arXiv:2506.09507  [pdf, ps, other

    cs.CL cs.AI

    TransXSSM: A Hybrid Transformer State Space Model with Unified Rotary Position Embedding

    Authors: Bingheng Wu, Jingze Shi, Yifan Wu, Nan Tang, Yuyu Luo

    Abstract: Transformers exhibit proficiency in capturing long-range dependencies, whereas State Space Models (SSMs) facilitate linear-time sequence modeling. Notwithstanding their synergistic potential, the integration of these architectures presents a significant challenge, primarily attributable to a fundamental incongr inuity their respective positional encoding mechanisms: Transformers rely on explicit R… ▽ More

    Submitted 18 June, 2025; v1 submitted 11 June, 2025; originally announced June 2025.

  42. arXiv:2506.09473  [pdf, ps, other

    cs.CV

    Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation In-Context Learning

    Authors: Cheng Chen, Yunpeng Zhai, Yifan Zhao, Jinyang Gao, Bolin Ding, Jia Li

    Abstract: In-context learning (ICL), a predominant trend in instruction learning, aims at enhancing the performance of large language models by providing clear task guidance and examples, improving their capability in task understanding and execution. This paper investigates ICL on Large Vision-Language Models (LVLMs) and explores the policies of multi-modal demonstration selection. Existing research effort… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: 10 pages, 6 figures, CVPR 2025

  43. arXiv:2506.09427  [pdf, other

    cs.CV cs.AI

    A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation

    Authors: Yukang Feng, Jianwen Sun, Chuanhao Li, Zizhen Li, Jiaxin Ai, Fanrui Zhang, Yifan Chang, Sizhuo Zhou, Shenglin Zhang, Yu Dai, Kaipeng Zhang

    Abstract: Recent advancements in Large Multimodal Models (LMMs) have significantly improved multimodal understanding and generation. However, these models still struggle to generate tightly interleaved image-text outputs, primarily due to the limited scale, quality and instructional richness of current training datasets. To address this, we introduce InterSyn, a large-scale multimodal dataset constructed us… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

  44. arXiv:2506.09404  [pdf, ps, other

    cs.LG cs.NE

    Synergizing Reinforcement Learning and Genetic Algorithms for Neural Combinatorial Optimization

    Authors: Shengda Gu, Kai Li, Junliang Xing, Yifan Zhang, Jian Cheng

    Abstract: Combinatorial optimization problems are notoriously challenging due to their discrete structure and exponentially large solution space. Recent advances in deep reinforcement learning (DRL) have enabled the learning heuristics directly from data. However, DRL methods often suffer from limited exploration and susceptibility to local optima. On the other hand, evolutionary algorithms such as Genetic… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

  45. arXiv:2506.09042  [pdf, ps, other

    cs.CV

    Cosmos-Drive-Dreams: Scalable Synthetic Driving Data Generation with World Foundation Models

    Authors: Xuanchi Ren, Yifan Lu, Tianshi Cao, Ruiyuan Gao, Shengyu Huang, Amirmojtaba Sabour, Tianchang Shen, Tobias Pfaff, Jay Zhangjie Wu, Runjian Chen, Seung Wook Kim, Jun Gao, Laura Leal-Taixe, Mike Chen, Sanja Fidler, Huan Ling

    Abstract: Collecting and annotating real-world data for safety-critical physical AI systems, such as Autonomous Vehicle (AV), is time-consuming and costly. It is especially challenging to capture rare edge cases, which play a critical role in training and testing of an AV system. To address this challenge, we introduce the Cosmos-Drive-Dreams - a synthetic data generation (SDG) pipeline that aims to generat… ▽ More

    Submitted 18 June, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

    Comments: Only the core contributors are listed. The full list of contributors can be found in Appendix A of this paper

  46. arXiv:2506.08283  [pdf, ps, other

    cs.IR

    Serendipitous Recommendation with Multimodal LLM

    Authors: Haoting Wang, Jianling Wang, Hao Li, Fangjun Yi, Mengyu Fu, Youwei Zhang, Yifan Liu, Liang Liu, Minmin Chen, Ed H. Chi, Lichan Hong, Haokai Lu

    Abstract: Conventional recommendation systems succeed in identifying relevant content but often fail to provide users with surprising or novel items. Multimodal Large Language Models (MLLMs) possess the world knowledge and multimodal understanding needed for serendipity, but their integration into billion-item-scale platforms presents significant challenges. In this paper, we propose a novel hierarchical fr… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  47. arXiv:2506.07984  [pdf, ps, other

    cs.CV cs.LG

    CXR-LT 2024: A MICCAI challenge on long-tailed, multi-label, and zero-shot disease classification from chest X-ray

    Authors: Mingquan Lin, Gregory Holste, Song Wang, Yiliang Zhou, Yishu Wei, Imon Banerjee, Pengyi Chen, Tianjie Dai, Yuexi Du, Nicha C. Dvornek, Yuyan Ge, Zuowei Guo, Shouhei Hanaoka, Dongkyun Kim, Pablo Messina, Yang Lu, Denis Parra, Donghyun Son, Álvaro Soto, Aisha Urooj, René Vidal, Yosuke Yamagishi, Zefan Yang, Ruichi Zhang, Yang Zhou , et al. (8 additional authors not shown)

    Abstract: The CXR-LT series is a community-driven initiative designed to enhance lung disease classification using chest X-rays (CXR). It tackles challenges in open long-tailed lung disease classification and enhances the measurability of state-of-the-art techniques. The first event, CXR-LT 2023, aimed to achieve these goals by providing high-quality benchmark CXR data for model development and conducting c… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

    Comments: 17 pages, 3 figures

  48. arXiv:2506.07553  [pdf, ps, other

    cs.AI q-bio.QM

    GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition

    Authors: Jingchao Wang, Haote Yang, Jiang Wu, Yifan He, Xingjian Wei, Yinfan Wang, Chengjin Liu, Lingli Ge, Lijun Wu, Bin Wang, Dahua Lin, Conghui He

    Abstract: Optical Chemical Structure Recognition (OCSR) is crucial for digitizing chemical knowledge by converting molecular images into machine-readable formats. While recent vision-language models (VLMs) have shown potential in this task, their image-captioning approach often struggles with complex molecular structures and inconsistent annotations. To overcome these challenges, we introduce GTR-Mol-VLM, a… ▽ More

    Submitted 9 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

  49. arXiv:2506.07406  [pdf, ps, other

    cs.LG cs.AI

    InverseScope: Scalable Activation Inversion for Interpreting Large Language Models

    Authors: Yifan Luo, Zhennan Zhou, Bin Dong

    Abstract: Understanding the internal representations of large language models (LLMs) is a central challenge in interpretability research. Existing feature interpretability methods often rely on strong assumptions about the structure of representations that may not hold in practice. In this work, we introduce InverseScope, an assumption-light and scalable framework for interpreting neural activations via inp… ▽ More

    Submitted 8 June, 2025; originally announced June 2025.

    Comments: 18 pages, 8 figures

  50. arXiv:2506.07335  [pdf, ps, other

    cs.CL cs.AI

    Improving LLM Reasoning through Interpretable Role-Playing Steering

    Authors: Anyi Wang, Dong Shu, Yifan Wang, Yunpu Ma, Mengnan Du

    Abstract: Role-playing has emerged as an effective technique for enhancing the reasoning capabilities of large language models (LLMs). However, existing methods primarily rely on prompt engineering, which often lacks stability and interpretability. In this paper, we introduce Sparse Autoencoder Role-Playing Steering (SRPS), a novel framework that identifies and manipulates internal model features associated… ▽ More

    Submitted 8 June, 2025; originally announced June 2025.

    Comments: 21 pages, 8 figures, 8 tables