Skip to main content

Showing 1–50 of 112 results for author: Sheng, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.24000  [pdf, ps, other

    cs.LG cs.CV

    The Illusion of Progress? A Critical Look at Test-Time Adaptation for Vision-Language Models

    Authors: Lijun Sheng, Jian Liang, Ran He, Zilei Wang, Tieniu Tan

    Abstract: Test-time adaptation (TTA) methods have gained significant attention for enhancing the performance of vision-language models (VLMs) such as CLIP during inference, without requiring additional labeled data. However, current TTA researches generally suffer from major limitations such as duplication of baseline results, limited evaluation metrics, inconsistent experimental settings, and insufficient… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: Github link: https://github.com/TomSheng21/tta-vlm

  2. arXiv:2506.19851  [pdf, ps, other

    cs.CV

    AnimaX: Animating the Inanimate in 3D with Joint Video-Pose Diffusion Models

    Authors: Zehuan Huang, Haoran Feng, Yangtian Sun, Yuanchen Guo, Yanpei Cao, Lu Sheng

    Abstract: We present AnimaX, a feed-forward 3D animation framework that bridges the motion priors of video diffusion models with the controllable structure of skeleton-based animation. Traditional motion synthesis methods are either restricted to fixed skeletal topologies or require costly optimization in high-dimensional deformation spaces. In contrast, AnimaX effectively transfers video-based motion knowl… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

    Comments: Project page: https://anima-x.github.io/

  3. arXiv:2506.18315  [pdf, ps, other

    cs.SE cs.AI

    Use Property-Based Testing to Bridge LLM Code Generation and Validation

    Authors: Lehan He, Zeren Chen, Zhe Zhang, Jing Shao, Xiang Gao, Lu Sheng

    Abstract: Large Language Models (LLMs) excel at code generation, but ensuring their outputs to be functionally correct, especially in complex programming tasks, is a persistent challenge. While traditional Test-Driven Development (TDD) offers a path for code refinement, its efficacy with LLMs is often undermined by the scarcity of high-quality test cases or the pitfalls of automated test generation, includi… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

  4. arXiv:2506.16402  [pdf, ps, other

    cs.AI cs.CL cs.CV cs.LG cs.RO

    IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

    Authors: Xiaoya Lu, Zeren Chen, Xuhao Hu, Yijin Zhou, Weichen Zhang, Dongrui Liu, Lu Sheng, Jing Shao

    Abstract: Flawed planning from VLM-driven embodied agents poses significant safety hazards, hindering their deployment in real-world household tasks. However, existing static, non-interactive evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent's actions and rely on unreliable post-hoc evaluations that… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

  5. arXiv:2506.09740  [pdf, ps, other

    cs.CV cs.AI

    ELBO-T2IAlign: A Generic ELBO-Based Method for Calibrating Pixel-level Text-Image Alignment in Diffusion Models

    Authors: Qin Zhou, Zhiyang Zhang, Jinglong Wang, Xiaobin Li, Jing Zhang, Qian Yu, Lu Sheng, Dong Xu

    Abstract: Diffusion models excel at image generation. Recent studies have shown that these models not only generate high-quality images but also encode text-image alignment information through attention maps or loss functions. This information is valuable for various downstream tasks, including segmentation, text-guided image editing, and compositional image generation. However, current methods heavily rely… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

  6. arXiv:2506.08390  [pdf, ps, other

    cs.AI

    On Reasoning Strength Planning in Large Reasoning Models

    Authors: Leheng Sheng, An Zhang, Zijian Wu, Weixiang Zhao, Changshuo Shen, Yi Zhang, Xiang Wang, Tat-Seng Chua

    Abstract: Recent studies empirically reveal that large reasoning models (LRMs) can automatically allocate more reasoning strengths (i.e., the number of reasoning tokens) for harder problems, exhibiting difficulty-awareness for better task performance. While this automatic reasoning strength allocation phenomenon has been widely observed, its underlying mechanism remains largely unexplored. To this end, we p… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  7. arXiv:2506.07022  [pdf, ps, other

    cs.LG cs.AI cs.CR

    AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint

    Authors: Leheng Sheng, Changshuo Shen, Weixiang Zhao, Junfeng Fang, Xiaohao Liu, Zhenkai Liang, Xiang Wang, An Zhang, Tat-Seng Chua

    Abstract: As LLMs are increasingly deployed in real-world applications, ensuring their ability to refuse malicious prompts, especially jailbreak attacks, is essential for safe and reliable use. Recently, activation steering has emerged as an effective approach for enhancing LLM safety by adding a refusal direction vector to internal activations of LLMs during inference, which will further induce the refusal… ▽ More

    Submitted 8 June, 2025; originally announced June 2025.

  8. arXiv:2506.04308  [pdf, ps, other

    cs.RO cs.AI cs.CV

    RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

    Authors: Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, Shanghang Zhang

    Abstract: Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world. However, even with the powerful pretrained vision language models (VLMs), recent approaches are still not qualified to accurately understand the complex 3D scenes and dynamically reason about the instruction-indicated locations for interaction. To this end, we propose RoboRefer, a 3D-aware VLM… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

    Comments: Project page: https://zhoues.github.io/RoboRefer/

  9. arXiv:2505.19623  [pdf, other

    cs.IR cs.AI

    AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems

    Authors: Yu Shang, Peijie Liu, Yuwei Yan, Zijing Wu, Leheng Sheng, Yuanqing Yu, Chumeng Jiang, An Zhang, Fengli Xu, Yu Wang, Min Zhang, Yong Li

    Abstract: The emergence of agentic recommender systems powered by Large Language Models (LLMs) represents a paradigm shift in personalized recommendations, leveraging LLMs' advanced reasoning and role-playing capabilities to enable autonomous, adaptive decision-making. Unlike traditional recommendation approaches, agentic recommender systems can dynamically gather and interpret user-item interactions from c… ▽ More

    Submitted 28 May, 2025; v1 submitted 26 May, 2025; originally announced May 2025.

    Comments: 15 pages, 6 figures

  10. arXiv:2505.13271  [pdf, ps, other

    cs.CL

    CSC-SQL: Corrective Self-Consistency in Text-to-SQL via Reinforcement Learning

    Authors: Lei Sheng, Shuai-Shuai Xu

    Abstract: Large language models (LLMs) have demonstrated strong capabilities in translating natural language questions about relational databases into SQL queries. In particular, test-time scaling techniques such as Self-Consistency and Self-Correction can enhance SQL generation accuracy by increasing computational effort during inference. However, these methods have notable limitations: Self-Consistency ma… ▽ More

    Submitted 30 June, 2025; v1 submitted 19 May, 2025; originally announced May 2025.

    Comments: 25 pages, 5 figures

  11. arXiv:2504.16084  [pdf, ps, other

    cs.CL cs.LG

    TTRL: Test-Time Reinforcement Learning

    Authors: Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, Biqing Qi, Youbang Sun, Zhiyuan Ma, Lifan Yuan, Ning Ding, Bowen Zhou

    Abstract: This paper investigates Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in Large Language Models (LLMs). The core challenge of the problem is reward estimation during inference while not having access to ground-truth information. While this setting appears elusive, we find that common practices in Test-Time Scaling (TTS), such as majority voting, yield surprisingly… ▽ More

    Submitted 30 June, 2025; v1 submitted 22 April, 2025; originally announced April 2025.

  12. arXiv:2504.11195  [pdf, other

    cs.LG cs.CR cs.CV

    R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning

    Authors: Lijun Sheng, Jian Liang, Zilei Wang, Ran He

    Abstract: Vision-language models (VLMs), such as CLIP, have gained significant popularity as foundation models, with numerous fine-tuning methods developed to enhance performance on downstream tasks. However, due to their inherent vulnerability and the common practice of selecting from a limited set of open-source models, VLMs suffer from a higher risk of adversarial attacks than traditional vision models.… ▽ More

    Submitted 15 April, 2025; originally announced April 2025.

    Comments: CVPR 2025

  13. arXiv:2503.23271  [pdf, other

    cs.RO cs.AI

    Learning Coordinated Bimanual Manipulation Policies using State Diffusion and Inverse Dynamics Models

    Authors: Haonan Chen, Jiaming Xu, Lily Sheng, Tianchen Ji, Shuijing Liu, Yunzhu Li, Katherine Driggs-Campbell

    Abstract: When performing tasks like laundry, humans naturally coordinate both hands to manipulate objects and anticipate how their actions will change the state of the clothes. However, achieving such coordination in robotics remains challenging due to the need to model object movement, predict future states, and generate precise bimanual actions. In this work, we address these challenges by infusing the p… ▽ More

    Submitted 29 March, 2025; originally announced March 2025.

    Comments: Project Page: https://haonan16.github.io/coord_bimanual_page/. 12 pages, 12 figures, Accepted at ICRA 2025

  14. arXiv:2503.12590  [pdf, other

    cs.CV

    Personalize Anything for Free with Diffusion Transformer

    Authors: Haoran Feng, Zehuan Huang, Lin Li, Hairong Lv, Lu Sheng

    Abstract: Personalized image generation aims to produce images of user-specified concepts while enabling flexible editing. Recent training-free approaches, while exhibit higher computational efficiency than training-based methods, struggle with identity preservation, applicability, and compatibility with diffusion transformers (DiTs). In this paper, we uncover the untapped potential of DiT, where simply rep… ▽ More

    Submitted 16 March, 2025; originally announced March 2025.

    Comments: https://fenghora.github.io/Personalize-Anything-Page/

  15. arXiv:2503.05127  [pdf, other

    cs.CV cs.AI

    HexPlane Representation for 3D Semantic Scene Understanding

    Authors: Zeren Chen, Yuenan Hou, Yulin Chen, Li Liu, Xiao Sun, Lu Sheng

    Abstract: In this paper, we introduce the HexPlane representation for 3D semantic scene understanding. Specifically, we first design the View Projection Module (VPM) to project the 3D point cloud into six planes to maximally retain the original spatial information. Features of six planes are extracted by the 2D encoder and sent to the HexPlane Association Module (HAM) to adaptively fuse the most informative… ▽ More

    Submitted 6 March, 2025; originally announced March 2025.

    Comments: 7 pages, 2 figures

  16. arXiv:2503.00292  [pdf, other

    stat.ML cs.LG

    Generalization Bounds for Equivariant Networks on Markov Data

    Authors: Hui Li, Zhiguo Wang, Bohui Chen, Li Sheng

    Abstract: Equivariant neural networks play a pivotal role in analyzing datasets with symmetry properties, particularly in complex data structures. However, integrating equivariance with Markov properties presents notable challenges due to the inherent dependencies within such data. Previous research has primarily concentrated on establishing generalization bounds under the assumption of independently and id… ▽ More

    Submitted 28 February, 2025; originally announced March 2025.

    Comments: Submitted for possible publication

  17. arXiv:2502.12277  [pdf

    cs.LG cs.CY

    Healthcare cost prediction for heterogeneous patient profiles using deep learning models with administrative claims data

    Authors: Mohammad Amin Morid, Olivia R. Liu Sheng

    Abstract: Problem: How can we design patient cost prediction models that effectively address the challenges of heterogeneity in administrative claims (AC) data to ensure accurate, fair, and generalizable predictions, especially for high-need (HN) patients with complex chronic conditions? Relevance: Accurate and equitable patient cost predictions are vital for developing health management policies and opti… ▽ More

    Submitted 17 February, 2025; originally announced February 2025.

    Journal ref: Information Systems Research (forthcoming 2025)

  18. arXiv:2502.10739  [pdf, other

    cs.CL

    BASE-SQL: A powerful open source Text-To-SQL baseline approach

    Authors: Lei Sheng, Shuai-Shuai Xu, Wei Xie

    Abstract: The conversion of natural language into SQL language for querying databases (Text-to-SQL) has broad application prospects and has attracted widespread attention. At present, the mainstream Text-to-SQL methods are mainly divided into in-context learning (ICL) based methods and supervised fine-tuning (SFT) based methods. ICL-based methods can achieve relatively good results thanks to the use of the… ▽ More

    Submitted 15 February, 2025; originally announced February 2025.

    Comments: Work in progress. 16 pages, 3 figures, 8 tables

  19. arXiv:2502.04153  [pdf, other

    cs.CL cs.AI

    UltraIF: Advancing Instruction Following from the Wild

    Authors: Kaikai An, Li Sheng, Ganqu Cui, Shuzheng Si, Ning Ding, Yu Cheng, Baobao Chang

    Abstract: Instruction-following made modern large language models (LLMs) helpful assistants. However, the key to taming LLMs on complex instructions remains mysterious, for that there are huge gaps between models trained by open-source community and those trained by leading companies. To bridge the gap, we propose a simple and scalable approach UltraIF for building LLMs that can follow complex instructions… ▽ More

    Submitted 6 February, 2025; originally announced February 2025.

  20. arXiv:2501.12612  [pdf, other

    cs.CL cs.CR

    T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation

    Authors: Lijun Li, Zhelun Shi, Xuhao Hu, Bowen Dong, Yiran Qin, Xihui Liu, Lu Sheng, Jing Shao

    Abstract: Text-to-image (T2I) models have rapidly advanced, enabling the generation of high-quality images from text prompts across various domains. However, these models present notable safety concerns, including the risk of generating harmful, biased, or private content. Current research on assessing T2I safety remains in its early stages. While some efforts have been made to evaluate models on specific s… ▽ More

    Submitted 20 February, 2025; v1 submitted 21 January, 2025; originally announced January 2025.

  21. arXiv:2501.06465  [pdf, other

    cs.CL cs.AI

    MedCT: A Clinical Terminology Graph for Generative AI Applications in Healthcare

    Authors: Ye Chen, Dongdong Huang, Haoyun Xu, Cong Fu, Lin Sheng, Qingli Zhou, Yuqiang Shen, Kai Wang

    Abstract: We introduce the world's first clinical terminology for the Chinese healthcare community, namely MedCT, accompanied by a clinical foundation model MedBERT and an entity linking model MedLink. The MedCT system enables standardized and programmable representation of Chinese clinical data, successively stimulating the development of new medicines, treatment pathways, and better patient outcomes for t… ▽ More

    Submitted 10 April, 2025; v1 submitted 11 January, 2025; originally announced January 2025.

    Comments: Accepted into ICCS 2025 and published in Springer's LNCS Series

  22. arXiv:2501.02429  [pdf, other

    cs.IR

    Citation Structural Diversity: A Novel and Concise Metric Combining Structure and Semantics for Literature Evaluation

    Authors: Mingyue Kong, Yinglong Zhang, Likun Sheng, Kaifeng Hong

    Abstract: As academic research becomes increasingly diverse, traditional literature evaluation methods face significant limitations,particularly in capturing the complexity of academic dissemination and the multidimensional impacts of literature. To address these challenges, this paper introduces a novel literature evaluation model of citation structural diversity, with a focus on assessing its feasibility… ▽ More

    Submitted 4 January, 2025; originally announced January 2025.

    Comments: 18 pages, 10 figures

  23. arXiv:2412.20895  [pdf, other

    cs.CV cs.LG

    Towards Compatible Fine-tuning for Vision-Language Model Updates

    Authors: Zhengbo Wang, Jian Liang, Lijun Sheng, Ran He, Zilei Wang, Tieniu Tan

    Abstract: So far, efficient fine-tuning has become a popular strategy for enhancing the capabilities of foundation models on downstream tasks by learning plug-and-play modules. However, existing methods overlook a crucial issue: if the underlying foundation model is updated, are these plug-and-play modules still effective? In this paper, we first conduct a detailed analysis of various fine-tuning methods on… ▽ More

    Submitted 30 December, 2024; originally announced December 2024.

    Comments: preprint

  24. arXiv:2412.20670  [pdf, other

    cs.LG cs.CV

    Prototypical Distillation and Debiased Tuning for Black-box Unsupervised Domain Adaptation

    Authors: Jian Liang, Lijun Sheng, Hongmin Liu, Ran He

    Abstract: Unsupervised domain adaptation aims to transfer knowledge from a related, label-rich source domain to an unlabeled target domain, thereby circumventing the high costs associated with manual annotation. Recently, there has been growing interest in source-free domain adaptation, a paradigm in which only a pre-trained model, rather than the labeled source data, is provided to the target domain. Given… ▽ More

    Submitted 29 December, 2024; originally announced December 2024.

  25. arXiv:2412.04455  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection

    Authors: Enshen Zhou, Qi Su, Cheng Chi, Zhizheng Zhang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, He Wang

    Abstract: Automatic detection and prevention of open-set failures are crucial in closed-loop robotic systems. Recent studies often struggle to simultaneously identify unexpected failures reactively after they occur and prevent foreseeable ones proactively. To this end, we propose Code-as-Monitor (CaM), a novel paradigm leveraging the vision-language model (VLM) for both open-set reactive and proactive failu… ▽ More

    Submitted 21 March, 2025; v1 submitted 5 December, 2024; originally announced December 2024.

    Comments: Accepted by CVPR 2025. Project page: https://zhoues.github.io/Code-as-Monitor/

  26. arXiv:2412.03632  [pdf, other

    cs.CV

    MV-Adapter: Multi-view Consistent Image Generation Made Easy

    Authors: Zehuan Huang, Yuan-Chen Guo, Haoran Wang, Ran Yi, Lizhuang Ma, Yan-Pei Cao, Lu Sheng

    Abstract: Existing multi-view image generation methods often make invasive modifications to pre-trained text-to-image (T2I) models and require full fine-tuning, leading to (1) high computational costs, especially with large base models and high-resolution images, and (2) degradation in image quality due to optimization difficulties and scarce high-quality 3D data. In this paper, we propose the first adapter… ▽ More

    Submitted 4 December, 2024; originally announced December 2024.

    Comments: Project page: https://huanngzh.github.io/MV-Adapter-Page/

  27. arXiv:2412.03558  [pdf, other

    cs.CV

    MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation

    Authors: Zehuan Huang, Yuan-Chen Guo, Xingqiao An, Yunhan Yang, Yangguang Li, Zi-Xin Zou, Ding Liang, Xihui Liu, Yan-Pei Cao, Lu Sheng

    Abstract: This paper introduces MIDI, a novel paradigm for compositional 3D scene generation from a single image. Unlike existing methods that rely on reconstruction or retrieval techniques or recent approaches that employ multi-stage object-by-object generation, MIDI extends pre-trained image-to-3D object generation models to multi-instance diffusion models, enabling the simultaneous generation of multiple… ▽ More

    Submitted 27 May, 2025; v1 submitted 4 December, 2024; originally announced December 2024.

    Comments: Project page: https://huanngzh.github.io/MIDI-Page/

  28. arXiv:2411.17265  [pdf, ps, other

    cs.CL cs.CV

    Systematic Reward Gap Optimization for Mitigating VLM Hallucinations

    Authors: Lehan He, Zeren Chen, Zhelun Shi, Tianyu Yu, Jing Shao, Lu Sheng

    Abstract: The success of Direct Preference Optimization (DPO) in mitigating hallucinations in Vision Language Models (VLMs) critically hinges on the true reward gaps within preference pairs. However, current methods, typically relying on ranking or rewriting strategies, often struggle to optimize these reward gaps in a systematic way during data curation. A core difficulty lies in precisely characterizing a… ▽ More

    Submitted 23 June, 2025; v1 submitted 26 November, 2024; originally announced November 2024.

  29. arXiv:2410.18072  [pdf, other

    cs.CV

    WorldSimBench: Towards Video Generation Models as World Simulators

    Authors: Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, Lei Bai, Wanli Ouyang, Ruimao Zhang

    Abstract: Recent advancements in predictive models have demonstrated exceptional capabilities in predicting the future state of objects and scenes. However, the lack of categorization based on inherent characteristics continues to hinder the progress of predictive model development. Additionally, existing benchmarks are unable to effectively evaluate higher-capability, highly embodied predictive models from… ▽ More

    Submitted 23 October, 2024; originally announced October 2024.

  30. arXiv:2409.11884  [pdf, ps, other

    cs.LG

    Out-of-Distribution Detection: A Task-Oriented Survey of Recent Advances

    Authors: Shuo Lu, Yingsheng Wang, Lijun Sheng, Lingxiao He, Aihua Zheng, Jian Liang

    Abstract: Out-of-distribution (OOD) detection aims to detect test samples outside the training category space, which is an essential component in building reliable machine learning systems. Existing reviews on OOD detection primarily focus on method taxonomy, surveying the field by categorizing various approaches. However, many recent works concentrate on non-traditional OOD detection scenarios, such as tes… ▽ More

    Submitted 18 June, 2025; v1 submitted 18 September, 2024; originally announced September 2024.

    Comments: First Submitted in May 2024

  31. arXiv:2409.05125  [pdf, other

    cs.CV

    PdfTable: A Unified Toolkit for Deep Learning-Based Table Extraction

    Authors: Lei Sheng, Shuai-Shuai Xu

    Abstract: Currently, a substantial volume of document data exists in an unstructured format, encompassing Portable Document Format (PDF) files and images. Extracting information from these documents presents formidable challenges due to diverse table styles, complex forms, and the inclusion of different languages. Several open-source toolkits, such as Camelot, Plumb a PDF (pdfnumber), and Paddle Paddle Stru… ▽ More

    Submitted 8 September, 2024; originally announced September 2024.

    Comments: 19 pages, 4 figures

  32. arXiv:2409.05105  [pdf, other

    cs.CL cs.AI

    EdaCSC: Two Easy Data Augmentation Methods for Chinese Spelling Correction

    Authors: Lei Sheng, Shuai-Shuai Xu

    Abstract: Chinese Spelling Correction (CSC) aims to detect and correct spelling errors in Chinese sentences caused by phonetic or visual similarities. While current CSC models integrate pinyin or glyph features and have shown significant progress,they still face challenges when dealing with sentences containing multiple typos and are susceptible to overcorrection in real-world scenarios. In contrast to exis… ▽ More

    Submitted 8 September, 2024; originally announced September 2024.

    Comments: 18 pages, 2 figures

  33. arXiv:2408.10159  [pdf, other

    cs.IR cs.AI

    Customizing Language Models with Instance-wise LoRA for Sequential Recommendation

    Authors: Xiaoyu Kong, Jiancan Wu, An Zhang, Leheng Sheng, Hui Lin, Xiang Wang, Xiangnan He

    Abstract: Sequential recommendation systems predict the next interaction item based on users' past interactions, aligning recommendations with individual preferences. Leveraging the strengths of Large Language Models (LLMs) in knowledge comprehension and reasoning, recent approaches are eager to apply LLMs to sequential recommendation. A common paradigm is converting user behavior sequences into instruction… ▽ More

    Submitted 20 January, 2025; v1 submitted 19 August, 2024; originally announced August 2024.

    Comments: NeurIPS 2024 poster

    Journal ref: 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

  34. arXiv:2407.15773  [pdf, other

    cs.LG cs.CV

    STAMP: Outlier-Aware Test-Time Adaptation with Stable Memory Replay

    Authors: Yongcan Yu, Lijun Sheng, Ran He, Jian Liang

    Abstract: Test-time adaptation (TTA) aims to address the distribution shift between the training and test data with only unlabeled data at test time. Existing TTA methods often focus on improving recognition performance specifically for test data associated with classes in the training set. However, during the open-world inference process, there are inevitably test data instances from unknown classes, commo… ▽ More

    Submitted 27 August, 2024; v1 submitted 22 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV 2024; Fixed a bug in calculating OOD score of STAMP and updated the results

  35. arXiv:2407.15734  [pdf, other

    cs.AI cs.MA

    TaskGen: A Task-Based, Memory-Infused Agentic Framework using StrictJSON

    Authors: John Chong Min Tan, Prince Saroj, Bharat Runwal, Hardik Maheshwari, Brian Lim Yi Sheng, Richard Cottrill, Alankrit Chona, Ambuj Kumar, Mehul Motani

    Abstract: TaskGen is an open-sourced agentic framework which uses an Agent to solve an arbitrary task by breaking them down into subtasks. Each subtask is mapped to an Equipped Function or another Agent to execute. In order to reduce verbosity (and hence token usage), TaskGen uses StrictJSON that ensures JSON output from the Large Language Model (LLM), along with additional features such as type checking an… ▽ More

    Submitted 22 July, 2024; originally announced July 2024.

    Comments: 53 pages

  36. arXiv:2407.05441  [pdf, other

    cs.IR cs.AI

    Language Representations Can be What Recommenders Need: Findings and Potentials

    Authors: Leheng Sheng, An Zhang, Yi Zhang, Yuxin Chen, Xiang Wang, Tat-Seng Chua

    Abstract: Recent studies empirically indicate that language models (LMs) encode rich world knowledge beyond mere semantics, attracting significant attention across various fields. However, in the recommendation domain, it remains uncertain whether LMs implicitly encode user preference information. Contrary to prevailing understanding that LMs and traditional recommenders learn two distinct representation sp… ▽ More

    Submitted 20 April, 2025; v1 submitted 7 July, 2024; originally announced July 2024.

    Comments: ICLR 2025 (Oral). Codes are available at https://github.com/LehengTHU/AlphaRec

  37. arXiv:2406.09215  [pdf, other

    cs.IR cs.AI

    On Softmax Direct Preference Optimization for Recommendation

    Authors: Yuxin Chen, Junfei Tan, An Zhang, Zhengyi Yang, Leheng Sheng, Enzhi Zhang, Xiang Wang, Tat-Seng Chua

    Abstract: Recommender systems aim to predict personalized rankings based on user preference data. With the rise of Language Models (LMs), LM-based recommenders have been widely explored due to their extensive world knowledge and powerful reasoning abilities. Most of the LM-based recommenders convert historical interactions into language prompts, pairing with a positive item as the target response and fine-t… ▽ More

    Submitted 7 November, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: NeurIPS 2024

  38. arXiv:2406.03184  [pdf, other

    cs.CV

    Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion

    Authors: Hao Wen, Zehuan Huang, Yaohui Wang, Xinyuan Chen, Lu Sheng

    Abstract: Existing single image-to-3D creation methods typically involve a two-stage process, first generating multi-view images, and then using these images for 3D reconstruction. However, training these two stages separately leads to significant data bias in the inference phase, thus affecting the quality of reconstructed results. We introduce a unified 3D generation framework, named Ouroboros3D, which in… ▽ More

    Submitted 1 May, 2025; v1 submitted 5 June, 2024; originally announced June 2024.

    Comments: See our project page at https://costwen.github.io/Ouroboros3D/

  39. arXiv:2404.15267  [pdf, other

    cs.CV

    From Parts to Whole: A Unified Reference Framework for Controllable Human Image Generation

    Authors: Zehuan Huang, Hongxing Fan, Lipeng Wang, Lu Sheng

    Abstract: Recent advancements in controllable human image generation have led to zero-shot generation using structural signals (e.g., pose, depth) or facial appearance. Yet, generating human images conditioned on multiple parts of human appearance remains challenging. Addressing this, we introduce Parts2Whole, a novel framework designed for generating customized portraits from multiple reference images, inc… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

  40. arXiv:2404.13854  [pdf, other

    cs.CV

    Self-Supervised Monocular Depth Estimation in the Dark: Towards Data Distribution Compensation

    Authors: Haolin Yang, Chaoqiang Zhao, Lu Sheng, Yang Tang

    Abstract: Nighttime self-supervised monocular depth estimation has received increasing attention in recent years. However, using night images for self-supervision is unreliable because the photometric consistency assumption is usually violated in the videos taken under complex lighting conditions. Even with domain adaptation or photometric loss repair, performance is still limited by the poor supervision of… ▽ More

    Submitted 21 April, 2024; originally announced April 2024.

    Comments: Accepted by IJCAI2024

  41. arXiv:2403.19622  [pdf, other

    cs.RO cs.CV

    RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents

    Authors: Zeren Chen, Zhelun Shi, Xiaoya Lu, Lehan He, Sucheng Qian, Zhenfei Yin, Wanli Ouyang, Jing Shao, Yu Qiao, Cewu Lu, Lu Sheng

    Abstract: Achieving generalizability in solving out-of-distribution tasks is one of the ultimate goals of learning robotic manipulation. Recent progress of Vision-Language Models (VLMs) has shown that VLM-based task planners can alleviate the difficulty of solving novel tasks, by decomposing the compounded tasks as a plan of sequentially executing primitive-level skills that have been already mastered. It i… ▽ More

    Submitted 1 February, 2025; v1 submitted 28 March, 2024; originally announced March 2024.

    Comments: 18 pages, 11 figures, 7 tables. Accepted by NeurIPS 2024 Workshop

  42. arXiv:2403.17830  [pdf, other

    cs.CV

    Assessment of Multimodal Large Language Models in Alignment with Human Values

    Authors: Zhelun Shi, Zhipin Wang, Hongxing Fan, Zaibin Zhang, Lijun Li, Yongting Zhang, Zhenfei Yin, Lu Sheng, Yu Qiao, Jing Shao

    Abstract: Large Language Models (LLMs) aim to serve as versatile assistants aligned with human values, as defined by the principles of being helpful, honest, and harmless (hhh). However, in terms of Multimodal Large Language Models (MLLMs), despite their commendable performance in perception and reasoning tasks, their alignment with human values remains largely unexplored, given the complexity of defining h… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: arXiv admin note: text overlap with arXiv:2311.02692

  43. arXiv:2403.12037  [pdf, other

    cs.CV

    MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control

    Authors: Enshen Zhou, Yiran Qin, Zhenfei Yin, Yuzhou Huang, Ruimao Zhang, Lu Sheng, Yu Qiao, Jing Shao

    Abstract: It is a long-lasting goal to design a generalist-embodied agent that can follow diverse instructions in human-like ways. However, existing approaches often fail to steadily follow instructions due to difficulties in understanding abstract and sequential natural language instructions. To this end, we introduce MineDreamer, an open-ended embodied agent built upon the challenging Minecraft simulator… ▽ More

    Submitted 19 March, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

    Comments: Project page: https://sites.google.com/view/minedreamer/main

  44. arXiv:2403.10750  [pdf, other

    cs.CL cs.AI

    Depression Detection on Social Media with Large Language Models

    Authors: Xiaochong Lan, Yiming Cheng, Li Sheng, Chen Gao, Yong Li

    Abstract: Depression harms. However, due to a lack of mental health awareness and fear of stigma, many patients do not actively seek diagnosis and treatment, leading to detrimental outcomes. Depression detection aims to determine whether an individual suffers from depression by analyzing their history of posts on social media, which can significantly aid in early detection and intervention. It mainly faces… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

  45. arXiv:2403.10261  [pdf, other

    cs.CV

    Learning Spatiotemporal Inconsistency via Thumbnail Layout for Face Deepfake Detection

    Authors: Yuting Xu, Jian Liang, Lijun Sheng, Xiao-Yu Zhang

    Abstract: The deepfake threats to society and cybersecurity have provoked significant public apprehension, driving intensified efforts within the realm of deepfake video detection. Current video-level methods are mostly based on {3D CNNs} resulting in high computational demands, although have achieved good performance. This paper introduces an elegantly simple yet effective strategy named Thumbnail Layout (… ▽ More

    Submitted 20 March, 2024; v1 submitted 15 March, 2024; originally announced March 2024.

    Comments: Accepted by IJCV

  46. arXiv:2403.03962  [pdf, other

    cs.SI cs.AI cs.NE

    Identify Critical Nodes in Complex Network with Large Language Models

    Authors: Jinzhu Mao, Dongyun Zou, Li Sheng, Siyi Liu, Chen Gao, Yue Wang, Yong Li

    Abstract: Identifying critical nodes in networks is a classical decision-making task, and many methods struggle to strike a balance between adaptability and utility. Therefore, we propose an approach that empowers Evolutionary Algorithm (EA) with Large Language Models (LLMs), to generate a function called "score\_nodes" which can further be used to identify crucial nodes based on their assigned scores. Our… ▽ More

    Submitted 1 March, 2024; originally announced March 2024.

  47. arXiv:2402.13769  [pdf, other

    cs.IR

    General Debiasing for Graph-based Collaborative Filtering via Adversarial Graph Dropout

    Authors: An Zhang, Wenchang Ma, Pengbo Wei, Leheng Sheng, Xiang Wang

    Abstract: Graph neural networks (GNNs) have shown impressive performance in recommender systems, particularly in collaborative filtering (CF). The key lies in aggregating neighborhood information on a user-item interaction graph to enhance user/item representations. However, we have discovered that this aggregation mechanism comes with a drawback, which amplifies biases present in the interaction graph. For… ▽ More

    Submitted 21 February, 2024; originally announced February 2024.

    Comments: Accepted to WWW 2024

  48. arXiv:2402.04087  [pdf, other

    cs.CV cs.AI cs.LG

    A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation

    Authors: Zhengbo Wang, Jian Liang, Lijun Sheng, Ran He, Zilei Wang, Tieniu Tan

    Abstract: Contrastive Language-Image Pretraining (CLIP) has gained popularity for its remarkable zero-shot capacity. Recent research has focused on developing efficient fine-tuning methods, such as prompt learning and adapter, to enhance CLIP's performance in downstream tasks. However, these methods still require additional training time and computational resources, which is undesirable for devices with lim… ▽ More

    Submitted 6 February, 2024; originally announced February 2024.

    Comments: Accepted by ICLR 2024

  49. arXiv:2401.15657  [pdf, other

    cs.CV

    Data-Free Generalized Zero-Shot Learning

    Authors: Bowen Tang, Long Yan, Jing Zhang, Qian Yu, Lu Sheng, Dong Xu

    Abstract: Deep learning models have the ability to extract rich knowledge from large-scale datasets. However, the sharing of data has become increasingly challenging due to concerns regarding data copyright and privacy. Consequently, this hampers the effective transfer of knowledge from existing data to novel downstream tasks and concepts. Zero-shot learning (ZSL) approaches aim to recognize new classes by… ▽ More

    Submitted 28 January, 2024; originally announced January 2024.

    Comments: Accepted by AAAI24

  50. arXiv:2401.15071  [pdf, other

    cs.CV

    From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities

    Authors: Chaochao Lu, Chen Qian, Guodong Zheng, Hongxing Fan, Hongzhi Gao, Jie Zhang, Jing Shao, Jingyi Deng, Jinlan Fu, Kexin Huang, Kunchang Li, Lijun Li, Limin Wang, Lu Sheng, Meiqi Chen, Ming Zhang, Qibing Ren, Sirui Chen, Tao Gui, Wanli Ouyang, Yali Wang, Yan Teng, Yaru Wang, Yi Wang, Yinan He , et al. (11 additional authors not shown)

    Abstract: Multi-modal Large Language Models (MLLMs) have shown impressive abilities in generating reasonable responses with respect to multi-modal contents. However, there is still a wide gap between the performance of recent MLLM-based applications and the expectation of the broad public, even though the most powerful OpenAI's GPT-4 and Google's Gemini have been deployed. This paper strives to enhance unde… ▽ More

    Submitted 29 January, 2024; v1 submitted 26 January, 2024; originally announced January 2024.