Skip to main content

Showing 1–50 of 201 results for author: Nie, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.02994  [pdf, ps, other

    cs.LG cs.CV

    MedGround-R1: Advancing Medical Image Grounding via Spatial-Semantic Rewarded Group Relative Policy Optimization

    Authors: Huihui Xu, Yuanpeng Nie, Hualiang Wang, Ying Chen, Wei Li, Junzhi Ning, Lihao Liu, Hongqiu Wang, Lei Zhu, Jiyao Liu, Xiaomeng Li, Junjun He

    Abstract: Medical Image Grounding (MIG), which involves localizing specific regions in medical images based on textual descriptions, requires models to not only perceive regions but also deduce spatial relationships of these regions. Existing Vision-Language Models (VLMs) for MIG often rely on Supervised Fine-Tuning (SFT) with large amounts of Chain-of-Thought (CoT) reasoning annotations, which are expensiv… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: MICCAI2025 Early Accept

  2. arXiv:2506.08626  [pdf, ps, other

    cs.IR

    Leveraging LLMs to Evaluate Usefulness of Document

    Authors: Xingzhu Wang, Erhan Zhang, Yiqun Chen, Jinghan Xuan, Yucheng Hou, Yitong Xu, Ying Nie, Shuaiqiang Wang, Dawei Yin, Jiaxin Mao

    Abstract: The conventional Cranfield paradigm struggles to effectively capture user satisfaction due to its weak correlation between relevance and satisfaction, alongside the high costs of relevance annotation in building test collections. To tackle these issues, our research explores the potential of leveraging large language models (LLMs) to generate multilevel usefulness labels for evaluation. We introdu… ▽ More

    Submitted 10 June, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

  3. arXiv:2506.05713  [pdf, ps, other

    cs.LG

    Come Together, But Not Right Now: A Progressive Strategy to Boost Low-Rank Adaptation

    Authors: Zhan Zhuang, Xiequn Wang, Wei Li, Yulong Zhang, Qiushi Huang, Shuhao Chen, Xuehao Wang, Yanbin Wei, Yuhe Nie, Kede Ma, Yu Zhang, Ying Wei

    Abstract: Low-rank adaptation (LoRA) has emerged as a leading parameter-efficient fine-tuning technique for adapting large foundation models, yet it often locks adapters into suboptimal minima near their initialization. This hampers model generalization and limits downstream operators such as adapter merging and pruning. Here, we propose CoTo, a progressive training strategy that gradually increases adapter… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

    Comments: Accepted by ICML 2025. Code link: https://github.com/zwebzone/coto

  4. arXiv:2506.05182  [pdf, ps, other

    cs.IR

    On the Comprehensibility of Multi-structured Financial Documents using LLMs and Pre-processing Tools

    Authors: Shivani Upadhyay, Messiah Ataey, Shariyar Murtuza, Yifan Nie, Jimmy Lin

    Abstract: The proliferation of complex structured data in hybrid sources, such as PDF documents and web pages, presents unique challenges for current Large Language Models (LLMs) and Multi-modal Large Language Models (MLLMs) in providing accurate answers. Despite the recent advancements of MLLMs, they still often falter when interpreting intricately structured information, such as nested tables and multi-di… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

    Comments: 15 pages, 5 figures, 9 tables

  5. arXiv:2506.02846  [pdf, ps, other

    cs.CV

    PBR-SR: Mesh PBR Texture Super Resolution from 2D Image Priors

    Authors: Yujin Chen, Yinyu Nie, Benjamin Ummenhofer, Reiner Birkl, Michael Paulitsch, Matthias Nießner

    Abstract: We present PBR-SR, a novel method for physically based rendering (PBR) texture super resolution (SR). It outputs high-resolution, high-quality PBR textures from low-resolution (LR) PBR input in a zero-shot manner. PBR-SR leverages an off-the-shelf super-resolution model trained on natural images, and iteratively minimizes the deviations between super-resolution priors and differentiable renderings… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: Project page: https://terencecyj.github.io/projects/PBR-SR/, Video: https://youtu.be/eaM5S3Mt1RM

  6. arXiv:2506.02692  [pdf, ps, other

    cs.CV

    Large-scale Self-supervised Video Foundation Model for Intelligent Surgery

    Authors: Shu Yang, Fengtao Zhou, Leon Mayer, Fuxiang Huang, Yiliang Chen, Yihui Wang, Sunan He, Yuxiang Nie, Xi Wang, Ömer Sümer, Yueming Jin, Huihui Sun, Shuchang Xu, Alex Qinyang Liu, Zheng Li, Jing Qin, Jeremy YuenChun Teoh, Lena Maier-Hein, Hao Chen

    Abstract: Computer-Assisted Intervention (CAI) has the potential to revolutionize modern surgery, with surgical scene understanding serving as a critical component in supporting decision-making, improving procedural efficacy, and ensuring intraoperative safety. While existing AI-driven approaches alleviate annotation burdens via self-supervised spatial representation learning, their lack of explicit tempora… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

  7. arXiv:2506.02535  [pdf, ps, other

    cs.CV

    MemoryOut: Learning Principal Features via Multimodal Sparse Filtering Network for Semi-supervised Video Anomaly Detection

    Authors: Juntong Li, Lingwei Dang, Yukun Su, Yun Hao, Qingxin Xiao, Yongwei Nie, Qingyao Wu

    Abstract: Video Anomaly Detection (VAD) methods based on reconstruction or prediction face two critical challenges: (1) strong generalization capability often results in accurate reconstruction or prediction of abnormal events, making it difficult to distinguish normal from abnormal patterns; (2) reliance only on low-level appearance and motion cues limits their ability to identify high-level semantic in ab… ▽ More

    Submitted 4 June, 2025; v1 submitted 3 June, 2025; originally announced June 2025.

  8. arXiv:2506.01551  [pdf, ps, other

    cs.CV cs.AI cs.CL

    EvolveNav: Self-Improving Embodied Reasoning for LLM-Based Vision-Language Navigation

    Authors: Bingqian Lin, Yunshuang Nie, Khun Loun Zai, Ziming Wei, Mingfei Han, Rongtao Xu, Minzhe Niu, Jianhua Han, Liang Lin, Cewu Lu, Xiaodan Liang

    Abstract: Building Vision-Language Navigation (VLN) agents which can navigate following natural language instructions is a long-standing goal in human-robot interaction applications. Recent studies have revealed the potential of training open-source Large Language Models (LLMs) to unleash LLMs' reasoning ability for improving navigation, and simultaneously mitigate the domain gap between LLMs' training corp… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

  9. arXiv:2506.00855  [pdf, other

    cs.AI

    MedBookVQA: A Systematic and Comprehensive Medical Benchmark Derived from Open-Access Book

    Authors: Sau Lai Yip, Sunan He, Yuxiang Nie, Shu Pui Chan, Yilin Ye, Sum Ying Lam, Hao Chen

    Abstract: The accelerating development of general medical artificial intelligence (GMAI), powered by multimodal large language models (MLLMs), offers transformative potential for addressing persistent healthcare challenges, including workforce deficits and escalating costs. The parallel development of systematic evaluation benchmarks emerges as a critical imperative to enable performance assessment and prov… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

    Comments: For data and code, see: https://huggingface.co/datasets/slyipae1/MedBookVQA and https://github.com/slyipae1/MedBookVQA

  10. arXiv:2505.23885  [pdf, ps, other

    cs.AI cs.CL

    OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation

    Authors: Mengkang Hu, Yuhang Zhou, Wendong Fan, Yuzhou Nie, Bowei Xia, Tao Sun, Ziyu Ye, Zhaoxuan Jin, Yingru Li, Qiguang Chen, Zeyu Zhang, Yifeng Wang, Qianshuo Ye, Bernard Ghanem, Ping Luo, Guohao Li

    Abstract: Large Language Model (LLM)-based multi-agent systems show promise for automating real-world tasks but struggle to transfer across domains due to their domain-specific nature. Current approaches face two critical shortcomings: they require complete architectural redesign and full retraining of all components when applied to new domains. We introduce Workforce, a hierarchical multi-agent framework t… ▽ More

    Submitted 10 June, 2025; v1 submitted 29 May, 2025; originally announced May 2025.

    Comments: Project Page: https://github.com/camel-ai/owl

  11. CXXCrafter: An LLM-Based Agent for Automated C/C++ Open Source Software Building

    Authors: Zhengmin Yu, Yuan Zhang, Ming Wen, Yinan Nie, Wenhui Zhang, Min Yang

    Abstract: Project building is pivotal to support various program analysis tasks, such as generating intermediate rep- resentation code for static analysis and preparing binary code for vulnerability reproduction. However, automating the building process for C/C++ projects is a highly complex endeavor, involving tremendous technical challenges, such as intricate dependency management, diverse build systems,… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  12. arXiv:2505.20148  [pdf, ps, other

    cs.AI

    MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents

    Authors: Ziming Wei, Bingqian Lin, Zijian Jiao, Yunshuang Nie, Liang Ma, Yuecheng Liu, Yuzheng Zhuang, Xiaodan Liang

    Abstract: Spatial Planning is a crucial part in the field of spatial intelligence, which requires the understanding and planning about object arrangements in space perspective. AI agents with the spatial planning ability can better adapt to various real-world applications, including robotic manipulation, automatic assembly, urban planning etc. Recent works have attempted to construct benchmarks for evaluati… ▽ More

    Submitted 27 May, 2025; v1 submitted 26 May, 2025; originally announced May 2025.

  13. arXiv:2505.19144  [pdf, other

    cs.LG q-bio.QM

    ADGSyn: Dual-Stream Learning for Efficient Anticancer Drug Synergy Prediction

    Authors: Yuxuan Nie, Yutong Song, Hong Peng

    Abstract: Drug combinations play a critical role in cancer therapy by significantly enhancing treatment efficacy and overcoming drug resistance. However, the combinatorial space of possible drug pairs grows exponentially, making experimental screening highly impractical. Therefore, developing efficient computational methods to predict promising drug combinations and guide experimental validation is of param… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

  14. Foundation Models for Geospatial Reasoning: Assessing Capabilities of Large Language Models in Understanding Geometries and Topological Spatial Relations

    Authors: Yuhan Ji, Song Gao, Ying Nie, Ivan Majić, Krzysztof Janowicz

    Abstract: Applying AI foundation models directly to geospatial datasets remains challenging due to their limited ability to represent and reason with geographical entities, specifically vector-based geometries and natural language descriptions of complex spatial relations. To address these issues, we investigate the extent to which a well-known-text (WKT) representation of geometries and their spatial relat… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

    Comments: 33 pages, 13 figures, IJGIS GeoFM Special Issue

    ACM Class: I.2

    Journal ref: International Journal of Geographical Information Science, 2025 International Journal of Geographical Information Science International Journal of Geographical Information Science

  15. arXiv:2505.16416  [pdf, other

    cs.CV cs.AI

    Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models

    Authors: Chengcheng Wang, Jianyuan Guo, Hongguang Li, Yuchuan Tian, Ying Nie, Chang Xu, Kai Han

    Abstract: Rotary Position Embedding (RoPE) is a widely adopted technique for encoding relative positional information in large language models (LLMs). However, when extended to large vision-language models (LVLMs), its variants introduce unintended cross-modal positional biases. Specifically, they enforce relative positional dependencies between text token indices and image tokens, causing spurious alignmen… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

  16. Interest Changes: Considering User Interest Life Cycle in Recommendation System

    Authors: Yinjiang Cai, Jiangpan Hou, Yangping Zhu, Yuan Nie

    Abstract: In recommendation systems, user interests are always in a state of constant flux. Typically, a user interest experiences a emergent phase, a stable phase, and a declining phase, which are referred to as the "user interest life-cycle". Recent papers on user interest modeling have primarily focused on how to compute the correlation between the target item and user's historical behaviors, without tho… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

    Comments: Accepted by SIGIR 2025

  17. arXiv:2505.05849  [pdf, ps, other

    cs.CR cs.AI

    AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents

    Authors: Zhun Wang, Vincent Siu, Zhe Ye, Tianneng Shi, Yuzhou Nie, Xuandong Zhao, Chenguang Wang, Wenbo Guo, Dawn Song

    Abstract: The strong planning and reasoning capabilities of Large Language Models (LLMs) have fostered the development of agent-based systems capable of leveraging external tools and interacting with increasingly complex environments. However, these powerful features also introduce a critical security risk: indirect prompt injection, a sophisticated attack vector that compromises the core of these agents, t… ▽ More

    Submitted 13 June, 2025; v1 submitted 9 May, 2025; originally announced May 2025.

  18. arXiv:2504.21336  [pdf, ps, other

    cs.CV

    UniBiomed: A Universal Foundation Model for Grounded Biomedical Image Interpretation

    Authors: Linshan Wu, Yuxiang Nie, Sunan He, Jiaxin Zhuang, Luyang Luo, Neeraj Mahboobani, Varut Vardhanabhuti, Ronald Cheong Kin Chan, Yifan Peng, Pranav Rajpurkar, Hao Chen

    Abstract: The integration of AI-assisted biomedical image analysis into clinical practice demands AI-generated findings that are not only accurate but also interpretable to clinicians. However, existing biomedical AI models generally lack the ability to simultaneously generate diagnostic findings and localize corresponding biomedical objects. This limitation makes it challenging for clinicians to correlate… ▽ More

    Submitted 29 May, 2025; v1 submitted 30 April, 2025; originally announced April 2025.

    Comments: The first universal foundation model for grounded biomedical image interpretation

  19. arXiv:2504.10834  [pdf, other

    cs.CV

    LightFormer: A lightweight and efficient decoder for remote sensing image segmentation

    Authors: Sihang Chen, Lijun Yun, Ze Liu, JianFeng Zhu, Jie Chen, Hui Wang, Yueping Nie

    Abstract: Deep learning techniques have achieved remarkable success in the semantic segmentation of remote sensing images and in land-use change detection. Nevertheless, their real-time deployment on edge platforms remains constrained by decoder complexity. Herein, we introduce LightFormer, a lightweight decoder for time-critical tasks that involve unstructured targets, such as disaster assessment, unmanned… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

    Comments: 26 pages, 69 figures

  20. arXiv:2504.03342  [pdf, other

    cs.CV cs.AI

    EOOD: Entropy-based Out-of-distribution Detection

    Authors: Guide Yang, Chao Hou, Weilong Peng, Xiang Fang, Yongwei Nie, Peican Zhu, Keke Tang

    Abstract: Deep neural networks (DNNs) often exhibit overconfidence when encountering out-of-distribution (OOD) samples, posing significant challenges for deployment. Since DNNs are trained on in-distribution (ID) datasets, the information flow of ID samples through DNNs inevitably differs from that of OOD samples. In this paper, we propose an Entropy-based Out-Of-distribution Detection (EOOD) framework. EOO… ▽ More

    Submitted 4 April, 2025; originally announced April 2025.

    Comments: IJCNN 2025

  21. arXiv:2503.22747  [pdf, other

    cs.LG cs.AI cs.ET

    LeForecast: Enterprise Hybrid Forecast by Time Series Intelligence

    Authors: Zheng Tan, Yiwen Nie, Wenfa Wu, Guanyu Zhang, Yanze Liu, Xinyuan Tian, Kailin Gao, Mengya Liu, Qijiang Cheng, Haipeng Jiang, Yingzheng Ma, Wei Zheng, Yuci Zhu, Yuanyuan Sun, Xiangyu Lei, Xiyu Guan, Wanqing Huang, Shouming Liu, Xiangquan Meng, Pengzhan Qu, Chao Yang, Jiaxuan Fan, Yuan He, Hongsheng Qi, Yangzhou Du

    Abstract: Demand is spiking in industrial fields for multidisciplinary forecasting, where a broad spectrum of sectors needs planning and forecasts to streamline intelligent business management, such as demand forecasting, product planning, inventory optimization, etc. Specifically, these tasks expecting intelligent approaches to learn from sequentially collected historical data and then foresee most possibl… ▽ More

    Submitted 26 March, 2025; originally announced March 2025.

  22. arXiv:2503.20680  [pdf, other

    cs.CV cs.CL

    Vision as LoRA

    Authors: Han Wang, Yongjie Ye, Bingru Li, Yuxiang Nie, Jinghui Lu, Jingqun Tang, Yanjie Wang, Can Huang

    Abstract: We introduce Vision as LoRA (VoRA), a novel paradigm for transforming an LLM into an MLLM. Unlike prevalent MLLM architectures that rely on external vision modules for vision encoding, VoRA internalizes visual capabilities by integrating vision-specific LoRA layers directly into the LLM. This design allows the added parameters to be seamlessly merged into the LLM during inference, eliminating stru… ▽ More

    Submitted 26 March, 2025; originally announced March 2025.

  23. arXiv:2503.19398  [pdf, other

    cs.HC

    CyanKitten: AI-Driven Markerless Motion Capture for Improved Elderly Well-Being

    Authors: Mengyao Guo, Yu Nie, Jinda Han, Zongxing Li, Ze Gao

    Abstract: This paper introduces CyanKitten, an interactive virtual companion system tailored for elderly users, integrating advanced posture recognition, behavior recognition, and multimodal interaction capabilities. The system utilizes a three-tier architecture to process and interpret user movements and gestures, leveraging a dual-camera setup and a convolutional neural network trained explicitly on elder… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

    Comments: Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, April 26-May 1, 2025, Yokohama, Japan

    ACM Class: F.2.2; I.2.7

  24. arXiv:2503.18402  [pdf, other

    cs.CV

    DashGaussian: Optimizing 3D Gaussian Splatting in 200 Seconds

    Authors: Youyu Chen, Junjun Jiang, Kui Jiang, Xiao Tang, Zhihao Li, Xianming Liu, Yinyu Nie

    Abstract: 3D Gaussian Splatting (3DGS) renders pixels by rasterizing Gaussian primitives, where the rendering resolution and the primitive number, concluded as the optimization complexity, dominate the time cost in primitive optimization. In this paper, we propose DashGaussian, a scheduling scheme over the optimization complexity of 3DGS that strips redundant complexity to accelerate 3DGS optimization. Spec… ▽ More

    Submitted 26 March, 2025; v1 submitted 24 March, 2025; originally announced March 2025.

    Comments: Accepted by CVPR2025. Project page: https://dashgaussian.github.io

  25. arXiv:2503.18065  [pdf, other

    cs.CV cs.AI cs.CL cs.RO

    Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation

    Authors: Ziming Wei, Bingqian Lin, Yunshuang Nie, Jiaqi Chen, Shikui Ma, Hang Xu, Xiaodan Liang

    Abstract: Data scarcity is a long-standing challenge in the Vision-Language Navigation (VLN) field, which extremely hinders the generalization of agents to unseen environments. Previous works primarily rely on additional simulator data or web-collected images/videos to improve the generalization. However, the simulator environments still face limited diversity, and the web-collected data often requires exte… ▽ More

    Submitted 23 March, 2025; originally announced March 2025.

  26. arXiv:2503.14198  [pdf, other

    cs.CV

    RoGSplat: Learning Robust Generalizable Human Gaussian Splatting from Sparse Multi-View Images

    Authors: Junjin Xiao, Qing Zhang, Yonewei Nie, Lei Zhu, Wei-Shi Zheng

    Abstract: This paper presents RoGSplat, a novel approach for synthesizing high-fidelity novel views of unseen human from sparse multi-view images, while requiring no cumbersome per-subject optimization. Unlike previous methods that typically struggle with sparse views with few overlappings and are less effective in reconstructing complex human geometry, the proposed method enables robust reconstruction in s… ▽ More

    Submitted 18 March, 2025; originally announced March 2025.

    Comments: Accepted to CVPR2025

  27. arXiv:2503.06252  [pdf, other

    cs.CV cs.AI

    Can Atomic Step Decomposition Enhance the Self-structured Reasoning of Multimodal Large Models?

    Authors: Kun Xiang, Zhili Liu, Zihao Jiang, Yunshuang Nie, Kaixin Cai, Yiyang Yin, Runhui Huang, Haoxiang Fan, Hanhui Li, Weiran Huang, Yihan Zeng, Yu-Jie Yuan, Jianhua Han, Lanqing Hong, Hang Xu, Xiaodan Liang

    Abstract: In this paper, we address the challenging task of multimodal mathematical reasoning by incorporating the ability of "slow thinking" into multimodal large language models (MLLMs). Our core idea is that different levels of reasoning abilities can be combined dynamically to tackle questions with different complexity. To this end, we propose a paradigm of Self-structured Chain of Thought (SCoT), which… ▽ More

    Submitted 8 March, 2025; originally announced March 2025.

  28. arXiv:2502.14948  [pdf, ps, other

    cs.SE

    Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation

    Authors: Zi Lin, Sheng Shen, Jingbo Shang, Jason Weston, Yixin Nie

    Abstract: Recent advances in large language models (LLMs) have improved their performance on coding benchmarks. However, improvement is plateauing due to the exhaustion of readily available high-quality data. Prior work has shown the potential of synthetic self-instruct data, but naively training on a model's own outputs can cause error accumulation, especially in coding tasks, where generalization may coll… ▽ More

    Submitted 4 June, 2025; v1 submitted 20 February, 2025; originally announced February 2025.

    Comments: 14 pages, 5 figures

  29. arXiv:2502.12640  [pdf, other

    cs.CV

    RecDreamer: Consistent Text-to-3D Generation via Uniform Score Distillation

    Authors: Chenxi Zheng, Yihong Lin, Bangzhen Liu, Xuemiao Xu, Yongwei Nie, Shengfeng He

    Abstract: Current text-to-3D generation methods based on score distillation often suffer from geometric inconsistencies, leading to repeated patterns across different poses of 3D assets. This issue, known as the Multi-Face Janus problem, arises because existing methods struggle to maintain consistency across varying poses and are biased toward a canonical pose. While recent work has improved pose control an… ▽ More

    Submitted 18 February, 2025; originally announced February 2025.

  30. arXiv:2502.06220  [pdf, other

    cs.CV cs.IR

    FunduSAM: A Specialized Deep Learning Model for Enhanced Optic Disc and Cup Segmentation in Fundus Images

    Authors: Jinchen Yu, Yongwei Nie, Fei Qi, Wenxiong Liao, Hongmin Cai

    Abstract: The Segment Anything Model (SAM) has gained popularity as a versatile image segmentation method, thanks to its strong generalization capabilities across various domains. However, when applied to optic disc (OD) and optic cup (OC) segmentation tasks, SAM encounters challenges due to the complex structures, low contrast, and blurred boundaries typical of fundus images, leading to suboptimal performa… ▽ More

    Submitted 10 February, 2025; originally announced February 2025.

  31. arXiv:2502.02196  [pdf, other

    cs.CV cs.AI

    Exploiting Ensemble Learning for Cross-View Isolated Sign Language Recognition

    Authors: Fei Wang, Kun Li, Yiqi Nie, Zhangling Duan, Peng Zou, Zhiliang Wu, Yuwei Wang, Yanyan Wei

    Abstract: In this paper, we present our solution to the Cross-View Isolated Sign Language Recognition (CV-ISLR) challenge held at WWW 2025. CV-ISLR addresses a critical issue in traditional Isolated Sign Language Recognition (ISLR), where existing datasets predominantly capture sign language videos from a frontal perspective, while real-world camera angles often vary. To accurately recognize sign language f… ▽ More

    Submitted 4 February, 2025; originally announced February 2025.

    Comments: 3rd Place in Cross-View Isolated Sign Language Recognition Challenge at WWW 2025

  32. arXiv:2501.15579  [pdf, other

    cs.CV cs.CL

    An Explainable Biomedical Foundation Model via Large-Scale Concept-Enhanced Vision-Language Pre-training

    Authors: Yuxiang Nie, Sunan He, Yequan Bie, Yihui Wang, Zhixuan Chen, Shu Yang, Zhiyuan Cai, Hongmei Wang, Xi Wang, Luyang Luo, Mingxiang Wu, Xian Wu, Ronald Cheong Kin Chan, Yuk Ming Lau, Yefeng Zheng, Pranav Rajpurkar, Hao Chen

    Abstract: The clinical adoption of artificial intelligence (AI) in medical imaging requires models that are both diagnostically accurate and interpretable to clinicians. While current multimodal biomedical foundation models prioritize performance, their black-box nature hinders explaining the decision-making process in clinically meaningful concepts. Here, we present ConceptCLIP, the first explainable biome… ▽ More

    Submitted 26 April, 2025; v1 submitted 26 January, 2025; originally announced January 2025.

  33. arXiv:2501.14249  [pdf, other

    cs.LG cs.AI cs.CL

    Humanity's Last Exam

    Authors: Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dmitry Dodonov, Tung Nguyen, Jaeho Lee, Daron Anderson, Mikhail Doroshenko, Alun Cennyth Stokes , et al. (1084 additional authors not shown)

    Abstract: Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of… ▽ More

    Submitted 19 April, 2025; v1 submitted 24 January, 2025; originally announced January 2025.

    Comments: 29 pages, 6 figures

  34. arXiv:2412.09530  [pdf, other

    cs.CV

    Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

    Authors: Han Wang, Yuxiang Nie, Yongjie Ye, Deng GuanYu, Yanjie Wang, Shuai Li, Haiyang Yu, Jinghui Lu, Can Huang

    Abstract: The application of Large Vision-Language Models (LVLMs) for analyzing images and videos is an exciting and rapidly evolving field. In recent years, we've seen significant growth in high-quality image-text datasets for fine-tuning image understanding, but there is still a lack of comparable datasets for videos. Additionally, many VideoLLMs are extensions of single-image VLMs, which may not efficien… ▽ More

    Submitted 12 December, 2024; originally announced December 2024.

  35. arXiv:2412.06482  [pdf, other

    math.OC cs.GT eess.SY

    A Cardinality-Constrained Approach to Combinatorial Bilevel Congestion Pricing

    Authors: Lei Guo, Jiayang Li, Yu Marco Nie, Jun Xie

    Abstract: Combinatorial bilevel congestion pricing (CBCP), a variant of the mixed (continuous/discrete) network design problems, seeks to minimize the total travel time experienced by all travelers in a road network, by strategically selecting toll locations and determining toll charges. Conventional wisdom suggests that these problems are intractable since they have to be formulated and solved with a signi… ▽ More

    Submitted 23 April, 2025; v1 submitted 9 December, 2024; originally announced December 2024.

  36. arXiv:2412.05734  [pdf, other

    cs.CR cs.AI cs.LG

    PrivAgent: Agentic-based Red-teaming for LLM Privacy Leakage

    Authors: Yuzhou Nie, Zhun Wang, Ye Yu, Xian Wu, Xuandong Zhao, Wenbo Guo, Dawn Song

    Abstract: Recent studies have discovered that LLMs have serious privacy leakage concerns, where an LLM may be fooled into outputting private information under carefully crafted adversarial prompts. These risks include leaking system prompts, personally identifiable information, training data, and model parameters. Most existing red-teaming approaches for privacy leakage rely on humans to craft the adversari… ▽ More

    Submitted 7 December, 2024; originally announced December 2024.

  37. arXiv:2411.17949  [pdf, other

    cs.CV

    ROICtrl: Boosting Instance Control for Visual Generation

    Authors: Yuchao Gu, Yipin Zhou, Yunfan Ye, Yixin Nie, Licheng Yu, Pingchuan Ma, Kevin Qinghong Lin, Mike Zheng Shou

    Abstract: Natural language often struggles to accurately associate positional and attribute information with multiple instances, which limits current text-based visual generation models to simpler compositions featuring only a few dominant instances. To address this limitation, this work enhances diffusion models by introducing regional instance control, where each instance is governed by a bounding box pai… ▽ More

    Submitted 26 November, 2024; originally announced November 2024.

    Comments: Project page at https://roictrl.github.io/

  38. arXiv:2411.11930  [pdf, other

    cs.CV cs.AI

    AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning

    Authors: Kun Xiang, Zhili Liu, Zihao Jiang, Yunshuang Nie, Runhui Huang, Haoxiang Fan, Hanhui Li, Weiran Huang, Yihan Zeng, Jianhua Han, Lanqing Hong, Hang Xu, Xiaodan Liang

    Abstract: In this paper, we address the challenging task of multimodal mathematical reasoning by incorporating the ability of ``slow thinking" into multimodal large language models (MLLMs). Contrary to existing methods that rely on direct or fast thinking, our key idea is to construct long chains of thought (CoT) consisting of atomic actions in a step-by-step manner, guiding MLLMs to perform complex reasoni… ▽ More

    Submitted 13 December, 2024; v1 submitted 18 November, 2024; originally announced November 2024.

  39. arXiv:2411.11543  [pdf, other

    cs.CV cs.AI

    PSA-VLM: Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment

    Authors: Zhendong Liu, Yuanbi Nie, Yingshui Tan, Jiaheng Liu, Xiangyu Yue, Qiushi Cui, Chongjun Wang, Xiaoyong Zhu, Bo Zheng

    Abstract: Benefiting from the powerful capabilities of Large Language Models (LLMs), pre-trained visual encoder models connected to LLMs form Vision Language Models (VLMs). However, recent research shows that the visual modality in VLMs is highly vulnerable, allowing attackers to bypass safety alignment in LLMs through visually transmitted content, launching harmful attacks. To address this challenge, we pr… ▽ More

    Submitted 13 January, 2025; v1 submitted 18 November, 2024; originally announced November 2024.

    Comments: arXiv admin note: substantial text overlap with arXiv:2405.13581

  40. arXiv:2411.03047  [pdf, other

    cs.CV cs.GR

    GarVerseLOD: High-Fidelity 3D Garment Reconstruction from a Single In-the-Wild Image using a Dataset with Levels of Details

    Authors: Zhongjin Luo, Haolin Liu, Chenghong Li, Wanghao Du, Zirong Jin, Wanhu Sun, Yinyu Nie, Weikai Chen, Xiaoguang Han

    Abstract: Neural implicit functions have brought impressive advances to the state-of-the-art of clothed human digitization from multiple or even single images. However, despite the progress, current arts still have difficulty generalizing to unseen images with complex cloth deformation and body poses. In this work, we present GarVerseLOD, a new dataset and framework that paves the way to achieving unprecede… ▽ More

    Submitted 5 November, 2024; originally announced November 2024.

    Comments: Project page: https://garverselod.github.io/

  41. arXiv:2410.19917  [pdf, other

    cs.CR cs.IT cs.LG

    Collaborative Inference over Wireless Channels with Feature Differential Privacy

    Authors: Mohamed Seif, Yuqi Nie, Andrea J. Goldsmith, H. Vincent Poor

    Abstract: Collaborative inference among multiple wireless edge devices has the potential to significantly enhance Artificial Intelligence (AI) applications, particularly for sensing and computer vision. This approach typically involves a three-stage process: a) data acquisition through sensing, b) feature extraction, and c) feature encoding for transmission. However, transmitting the extracted features pose… ▽ More

    Submitted 25 October, 2024; originally announced October 2024.

    Comments: This work is under review for possible IEEE publication. arXiv admin note: substantial text overlap with arXiv:2406.00256

  42. arXiv:2410.11096  [pdf, other

    cs.CR cs.AI

    SecCodePLT: A Unified Platform for Evaluating the Security of Code GenAI

    Authors: Yu Yang, Yuzhou Nie, Zhun Wang, Yuheng Tang, Wenbo Guo, Bo Li, Dawn Song

    Abstract: Existing works have established multiple benchmarks to highlight the security risks associated with Code GenAI. These risks are primarily reflected in two areas: a model potential to generate insecure code (insecure coding) and its utility in cyberattacks (cyberattack helpfulness). While these benchmarks have made significant strides, there remain opportunities for further improvement. For instanc… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

  43. arXiv:2410.08858  [pdf, other

    cs.CR cs.SE

    Decoding Secret Memorization in Code LLMs Through Token-Level Characterization

    Authors: Yuqing Nie, Chong Wang, Kailong Wang, Guoai Xu, Guosheng Xu, Haoyu Wang

    Abstract: Code Large Language Models (LLMs) have demonstrated remarkable capabilities in generating, understanding, and manipulating programming code. However, their training process inadvertently leads to the memorization of sensitive information, posing severe privacy risks. Existing studies on memorization in LLMs primarily rely on prompt engineering techniques, which suffer from limitations such as wide… ▽ More

    Submitted 20 April, 2025; v1 submitted 11 October, 2024; originally announced October 2024.

    Comments: 13 pages, 8 figures

  44. arXiv:2410.07701  [pdf, other

    cs.RO

    Autonomous Driving in Unstructured Environments: How Far Have We Come?

    Authors: Chen Min, Shubin Si, Xu Wang, Hanzhang Xue, Weizhong Jiang, Yang Liu, Juan Wang, Qingtian Zhu, Qi Zhu, Lun Luo, Fanjie Kong, Jinyu Miao, Xudong Cai, Shuai An, Wei Li, Jilin Mei, Tong Sun, Heng Zhai, Qifeng Liu, Fangzhou Zhao, Liang Chen, Shuai Wang, Erke Shang, Linzhi Shang, Kunlong Zhao , et al. (13 additional authors not shown)

    Abstract: Research on autonomous driving in unstructured outdoor environments is less advanced than in structured urban settings due to challenges like environmental diversities and scene complexity. These environments-such as rural areas and rugged terrains-pose unique obstacles that are not common in structured urban areas. Despite these difficulties, autonomous driving in unstructured outdoor environment… ▽ More

    Submitted 31 October, 2024; v1 submitted 10 October, 2024; originally announced October 2024.

    Comments: Survey paper; 38 pages

  45. arXiv:2410.06886  [pdf, other

    cs.CL

    FltLM: An Intergrated Long-Context Large Language Model for Effective Context Filtering and Understanding

    Authors: Jingyang Deng, Zhengyang Shen, Boyang Wang, Lixin Su, Suqi Cheng, Ying Nie, Junfeng Wang, Dawei Yin, Jinwen Ma

    Abstract: The development of Long-Context Large Language Models (LLMs) has markedly advanced natural language processing by facilitating the process of textual data across long documents and multiple corpora. However, Long-Context LLMs still face two critical challenges: The lost in the middle phenomenon, where crucial middle-context information is likely to be missed, and the distraction issue that the mod… ▽ More

    Submitted 9 October, 2024; originally announced October 2024.

    Comments: Accepted by the 27th European Conference on Artificial Intelligence (ECAI-2024), this is the full version of the paper including technical appendices. This final version features enhanced formatting and corrections to errors present in other online versions. We regret any inconvenience this may have caused our readers

  46. arXiv:2409.18569  [pdf, other

    cs.CV

    Cross-video Identity Correlating for Person Re-identification Pre-training

    Authors: Jialong Zuo, Ying Nie, Hanyu Zhou, Huaxin Zhang, Haoyu Wang, Tianyu Guo, Nong Sang, Changxin Gao

    Abstract: Recent researches have proven that pre-training on large-scale person images extracted from internet videos is an effective way in learning better representations for person re-identification. However, these researches are mostly confined to pre-training at the instance-level or single-video tracklet-level. They ignore the identity-invariance in images of the same person across different videos, w… ▽ More

    Submitted 27 September, 2024; originally announced September 2024.

    Comments: NeurIPS 2024 Accepted Paper

  47. arXiv:2409.16040  [pdf, other

    cs.LG cs.AI

    Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts

    Authors: Xiaoming Shi, Shiyu Wang, Yuqi Nie, Dianqi Li, Zhou Ye, Qingsong Wen, Ming Jin

    Abstract: Deep learning for time series forecasting has seen significant advancements over the past decades. However, despite the success of large-scale pre-training in language and vision domains, pre-trained time series models remain limited in scale and operate at a high cost, hindering the development of larger capable forecasting models in real-world applications. In response, we introduce Time-MoE, a… ▽ More

    Submitted 27 February, 2025; v1 submitted 24 September, 2024; originally announced September 2024.

    Comments: Accepted by the 13th International Conference on Learning Representations (ICLR 2025)

  48. arXiv:2408.10899  [pdf, other

    cs.RO

    All Robots in One: A New Standard and Unified Dataset for Versatile, General-Purpose Embodied Agents

    Authors: Zhiqiang Wang, Hao Zheng, Yunshuang Nie, Wenjun Xu, Qingwei Wang, Hua Ye, Zhe Li, Kaidong Zhang, Xuewen Cheng, Wanxi Dong, Chang Cai, Liang Lin, Feng Zheng, Xiaodan Liang

    Abstract: Embodied AI is transforming how AI systems interact with the physical world, yet existing datasets are inadequate for developing versatile, general-purpose agents. These limitations include a lack of standardized formats, insufficient data diversity, and inadequate data volume. To address these issues, we introduce ARIO (All Robots In One), a new data standard that enhances existing datasets by of… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

    Comments: Project website: https://imaei.github.io/project_pages/ario/

  49. arXiv:2408.10006  [pdf, other

    cs.LG

    Unlocking the Power of LSTM for Long Term Time Series Forecasting

    Authors: Yaxuan Kong, Zepu Wang, Yuqi Nie, Tian Zhou, Stefan Zohren, Yuxuan Liang, Peng Sun, Qingsong Wen

    Abstract: Traditional recurrent neural network architectures, such as long short-term memory neural networks (LSTM), have historically held a prominent role in time series forecasting (TSF) tasks. While the recently introduced sLSTM for Natural Language Processing (NLP) introduces exponential gating and memory mixing that are beneficial for long term sequential learning, its potential short memory issue is… ▽ More

    Submitted 24 February, 2025; v1 submitted 19 August, 2024; originally announced August 2024.

    Comments: Accepted by 39th Annual AAAI Conference on Artificial Intelligence (AAAI 2025)

  50. arXiv:2408.09594  [pdf, other

    cs.AI

    Moonshine: Distilling Game Content Generators into Steerable Generative Models

    Authors: Yuhe Nie, Michael Middleton, Tim Merino, Nidhushan Kanagaraja, Ashutosh Kumar, Zhan Zhuang, Julian Togelius

    Abstract: Procedural Content Generation via Machine Learning (PCGML) has enhanced game content creation, yet challenges in controllability and limited training data persist. This study addresses these issues by distilling a constructive PCG algorithm into a controllable PCGML model. We first generate a large amount of content with a constructive algorithm and label it using a Large Language Model (LLM). We… ▽ More

    Submitted 2 February, 2025; v1 submitted 18 August, 2024; originally announced August 2024.

    ACM Class: I.2.1